Lines continually stall

susan.burkley · **Posted:** Mon Jul 07, 2008 7:12 am

Hello - Is there any mechanism that we can use to automatically 'restart' stalled lines, other than restarting the TeleFlow service? Lines 'drop' several times a day (service still running), and soon it will start affecting our SLAs. Currently, I must restart the service first thing in the morning and constantly review Monitor throughout the day.

Thanks,

Chris · **Posted:** Mon Jul 07, 2008 9:36 am

Be sure to set all lines to have "No limit" checked in TeleFlow LineList when running applications for production use. Changes in Linelist require a TFServer service restart to take effect.

When you say lines "drop", I'm guessing you are referring to application errors? If so, be sure to check your logs and make whatever changes are required to stop these errors from occurring. They could be causing abrupt disconnects for your callers. Typically, in a production system, check it frequently for the first couple weeks after installing (a new system or updates) using the TeleFlow Monitor. If the lights are any color other than bright green, check the error logs for issues that need to be fixed. Address any issues, then restart your applications with either a Reload All Normal from the Monitor or a TFServer service restart. Keep checking until you only ever see bright green lights. This indicates a healthy system that is running without error.

susan.burkley · **Posted:** Mon Jul 07, 2008 9:58 am

I will try selecting the 'No Limit' checkbox for each line. We aren't getting application errors - the lines simply 'FINISH'.

Thanks,

susan.burkley · **Posted:** Mon Jul 07, 2008 1:26 pm

So now we have it so the lines restart themselves, which is great; however, if an outbound call is initiated on one of these lines, then the call is placed, the recipient picks up, and then TeleFlow hangs up, with no message delivered. On the lines that haven't been restarted (the ones with green lights versus yellow), the call completes as expected.

I've also noticed that restarting the TF service clears this up, but only minutes later, the lines start to drop and restart again. Then I am faced with the original issue of having to restart the TF service every 30 min or so, which is really not convenient. The line messages don't give us any errors or other indication of a problem. Any more ideas?

Chris · **Posted:** Mon Jul 07, 2008 2:19 pm

This all sounds rather odd, so maybe it would help if I could see more.

Could you set at least one line to not restart (like you had before), and at least one to restart? (The others can be whatever you think is best) Set both lines you pick for testing to "Log Output" of "All events".

I'd like to see the logs from a line that has stalled and won't come back without restarting TeleFlow, along with a log from a line that is hanging up as soon as the call recipient picks up. (Try to grab these ASAP, so they have a minimum of content before the issues are reached. It would help if you can identify which log is which, stalled or hanging up on call recipient.)

Send the logs to support@engenic.com, with a subject of "TeleFlow Forums: Lines continually stall".

susan.burkley · **Posted:** Mon Jul 07, 2008 3:26 pm

OK - I'm collecting logs for you guys. Now that I am actively waiting for a line failure, it doesn't want to happen! Of course, I only have one line with the 'no limit' option unchecked. More than likely, I will have logs for you guys first thing in the morning....

Thanks again for your help,

Chris · **Posted:** Tue Jul 08, 2008 9:33 am

Looking through the logs now. I'll post each as I examine it, so you can start sorting them out while I continue to look.

Out04_2008-07-07_FAILED.txt: Has an application error in it. This would cause the application to exit.

Search for this string in the log: 21:56:15.515

That will take you right to the problem issue, which is a failing query.

Before changes to the TAL to restart, this could/probably would have stalled that line.

Chris · **Posted:** Tue Jul 08, 2008 9:53 am

Are any of the logs you sent "clips" of the logs in some way? A couple are only 12Kb in length, and don't contain what would appear to be complete information.

One also caps out at 501kb, which is the default amount for TFConfig.

While we are trying to sort this out, I would recommend boosting that setting to maybe 20000, or checking the "Unlimited" checkbox. You will have to zip the resulting files to send them, but logs zip up pretty small, so this should be OK to send to support@engenic.com.

susan.burkley · **Posted:** Tue Jul 08, 2008 10:07 am

I had to capture the logs as the calls were being placed, else I wouldn't get anything helpful for you to look at.... If I caught the log after the call was placed, it was busy processing the next flow, and it'd overwrite the previous......

Chris · **Posted:** Tue Jul 08, 2008 10:08 am

This log:

Out03_2008-07-07_NORESTART_Call placed but never went through.txt

Only shows one call that was successfully connected, but appears to be incomplete. It shows a file playing, a wait, the file playing again, a wait, and then nothing more.

Chris · **Posted:** Tue Jul 08, 2008 10:13 am

Quote:

I had to capture the logs as the calls were being placed, else I wouldn't get anything helpful for you to look at.... If I caught the log after the call was placed, it was busy processing the next flow, and it'd overwrite the previous......

Check "Append to file" on all lines. You don't want to miss anything. There's no telling what might be important from our perspective. In rare cases, a "hole" in logic from a previous call could impact the next. We wouldn't want to miss that.

If you are actually catching it when something goes wrong, the problem we are looking for will be at the end of the log, so that is all we really need to know.

susan.burkley · **Posted:** Wed Jul 09, 2008 5:45 am

Chris wrote:

This log:

Out03_2008-07-07_NORESTART_Call placed but never went through.txt

Only shows one call that was successfully connected, but appears to be incomplete. It shows a file playing, a wait, the file playing again, a wait, and then nothing more.

This call was supposed to come through to my cell phone. I watched it go through on TF Monitor, but my phone never rang.....

susan.burkley · **Posted:** Wed Jul 09, 2008 5:46 am

Chris wrote:

Quote:

I had to capture the logs as the calls were being placed, else I wouldn't get anything helpful for you to look at.... If I caught the log after the call was placed, it was busy processing the next flow, and it'd overwrite the previous......

Check "Append to file" on all lines. You don't want to miss anything. There's no telling what might be important from our perspective. In rare cases, a "hole" in logic from a previous call could impact the next. We wouldn't want to miss that.

If you are actually catching it when something goes wrong, the problem we are looking for will be at the end of the log, so that is all we really need to know.

This option is checked, but when I go to the log files on the file system, their updated date does not reflect the 'current' calls, even after refreshing. When I review these logs, the calls in question don't appear to be there, else I would've forwarded the logs to you much earlier...

susan.burkley · **Posted:** Wed Jul 09, 2008 5:57 am

Chris wrote:

Looking through the logs now. I'll post each as I examine it, so you can start sorting them out while I continue to look.

Out04_2008-07-07_FAILED.txt: Has an application error in it. This would cause the application to exit.

Search for this string in the log: 21:56:15.515

That will take you right to the problem issue, which is a failing query.

Before changes to the TAL to restart, this could/probably would have stalled that line.

This application error 'seemed' to result from a query result of '0' to indicate no records found meeting the criteria. Does TeleFlow 'fail' when the result from SQL is '0 records returned'? If so, this could be a problem because we won't always have calls scheduled to go out all the time, yet TeleFlow is processing all of the time....

susan.burkley · **Posted:** Wed Jul 09, 2008 6:22 am

Chris wrote:

Are any of the logs you sent "clips" of the logs in some way? A couple are only 12Kb in length, and don't contain what would appear to be complete information.

One also caps out at 501kb, which is the default amount for TFConfig.

While we are trying to sort this out, I would recommend boosting that setting to maybe 20000, or checking the "Unlimited" checkbox. You will have to zip the resulting files to send them, but logs zip up pretty small, so this should be OK to send to support@engenic.com.

I forgot about this post when I replied earlier. I've updated the TFConfig to have unlimited file size, and since the files are continuously growing, I take that as a good sign that we may get more detail. I'll send the full log(s) when I see the lines drop and restart again.

Chris · **Posted:** Wed Jul 09, 2008 10:04 am

Quote:

This application error 'seemed' to result from a query result of '0' to indicate no records found meeting the criteria. Does TeleFlow 'fail' when the result from SQL is '0 records returned'? If so, this could be a problem because we won't always have calls scheduled to go out all the time, yet TeleFlow is processing all of the time....

The application error in the log I was referring to was on an Insert statement. There's nothing in the log that I see that has to do with 0 records found meeting criteria. (Or if there is, it isn't in the place I'm referring to, which is where the application error occurs.)

As for your question about fail conditions: The "SQL Statement" step only follows its FAIL path if there is some form of error condition triggered in the statement(A syntax error, for example).

A "SQL Fetch" step fails when there are 0 records to Fetch. In the case of a looping application, I would direct a FAIL path for a SQL Fetch to a Wait step. Wait for a little while, then try your queries to get call jobs again.

susan.burkley · **Posted:** Tue Jul 15, 2008 11:54 am

The 'full' logs are simply way to large to send, and they indicate the same content that I've already forwarded. I had to turn off full logging, as these guys decided to take over the system, and cause us to run short of disk space. I now have a scheduled task to restart the TF service every 30 min., as the lines seem to drop for no reason (including after the placement of a call). Are there any ideas that you may have as to why this is happening? Our outbound calls are the simplest of the simple, and the flow is based on the sample outbound template we see after a fresh install. We are suspecting that this software is geared more for the inbound side, when the flow 'reacts', versus the outbound side, where the flow has to initiate the call.

Thanks again for your help,

Chris · **Posted:** Mon Jul 21, 2008 3:01 pm

Thanks for finding a way to provide the logs. Reviewing the logs, here are a few things I have observed:
- So far, it only looks like we have what you might call a "random sampling" of what is going on. This is because some of the time the logs were limited to a small size, and don't show anything useful, because they fill up almost instantly, before anything important happens. When the logs were allowed to grow, an extremely large log covering several hours only has 6 place calls in it (None of which had an error. With only 6 place calls, it is hard to say that this is in any way unexpected, seeing as we don't yet know what the problem is.)
- There was one application error, which follows:

Quote:

2008/07/18 18:00:11.390: ERROR: Connect: SQLConnect failed.
2008/07/18 18:00:11.390: SQL Error State: 08001
Native Error Code: 11
ODBC Error: [Microsoft][ODBC SQL Server Driver][Shared Memory]SQL Server does not exist or access denied.
2008/07/18 18:00:11.390: SQL Error State: 01000
Native Error Code: 2
ODBC Error: [Microsoft][ODBC SQL Server Driver][Shared Memory]ConnectionOpen (Connect()).
2008/07/18 18:00:11.390: FAILURE
2008/07/18 18:00:11.390: [215] Fatal Stop
2008/07/18 18:00:11.390: FATAL
2008/07/18 18:00:11.390: Attempting to jump to Fatal Error Global Event label: 'APPLICATION_ERROR'

If it is acceptable for your system to have periodic losses of the database, it might be wise to have a retry loop around the DB Connect step. Perhaps you could try to connect, Wait for 60 seconds, and try again. If the second connection fails, then go to a Fatal Stop just as you have here. Alternatively, a longer wait, or more retries might be in order, depending on the situation.
- The Wait step the application runs between each run of the main application loop is only 58 milliseconds in length. Presently, I don't believe there is anything to suggest this is a cause of any problems. However, if you ever do need to run logging to see all events again, this will cause your log files to grow incredibly fast. It might also be a problem if you have many application instances, all of which are rapidly polling, but have nothing to do. This could be an overwhelming amount of polling for other services, such as the database. You would have to keep an eye on overall performance to determine if this causes any problems. One thing to consider though, is that if there is no reason to poll this frequently(I.e. Waiting 60 seconds between checks isn't going to cause a backlog of jobs to do, and there is no "just in time" demand from the application's perspective.), reducing the frequency now could save you some trouble down the road.

All that said, since you have indicated that you do see yellow lights in the TeleFlow Monitor, and we haven't been successful in getting good logs indicating what is going on, here is how I think we should proceed to get to the bottom of the issues most quickly:

In TeleFlow Config, make the following settings:
- Check "Unlimited Log File Size"
- Set "Log Lines" to 2000

In your active application list file, set all lines as follows:
- Log output: Errors Only
- Append to file: Checked
- No limit: NOT checked
- Restart Count: 5

The settings listed above will have the following effects:
- Every time an application error occurs, 2000 lines of log will result. This should be more than enough to see what is going wrong.
- The log will be appended to, allowing you to see multiple errors if they occur.
- After any one single line stops because of application errors more than 5 times, it will stop restarting. (This is just in case there is a situation that will result in the same line going into a spin of starting, encountering an error, restarting, encountering the same error, and so on.)

Once you restart TeleFlow Server with these settings in place, you can watch for errors by looking for red and yellow lights in TeleFlow Monitor. Every red/yellow light in TeleFlow Monitor indicates that an application error has occurred, and a log has been generated. (I know you have something restarting the TeleFlow Server service every now and then. Restarting TFServer will put the lights back to green, and reset the restart count to 0. If you leave your restart process in place, it will interfere with your ability to user TeleFlow Monitor to see errors have occurred. For these reasons, I would recommend disabling that process for now. If you can't do so for service reasons, you will have to keep an eye on the log directory for new log files.)

With the settings listed above, the logs will only be created when application errors occur, which means it should be fairly easy to determine what is going wrong.

Chris · **Posted:** Tue Jul 22, 2008 8:21 am

One more thing: While you have the system not set to "No limit" on restarts, you will want to restart the TFServer service periodically. How often depends on how often your application is getting errors. (If you only get one or two errors a day, restarting every other day should be enough.)

Since you are watching for errors to get logs anyway, you can do this anytime you are checking in on the system.

If a line has had 5 errors and stopped running, you should probably watch the system when you restart to see if whatever caused the 5 previous errors is immediately causing errors again.

susan.burkley · **Posted:** Tue Jul 22, 2008 12:00 pm

So we've changed the TAP a little to include a delay between SQL retries, and we added hangup events between each Play step. This have improved, but lines still drop - it's like there is something funny with the hangup event itself - I'm not sure. I've noticed that today they are dropping more when a call is placed. I will send the log output to the support email address.