Thanks for finding a way to provide the logs. Reviewing the logs, here are a few things I have observed:
- So far, it only looks like we have what you might call a "random sampling" of what is going on. This is because some of the time the logs were limited to a small size, and don't show anything useful, because they fill up almost instantly, before anything important happens. When the logs were allowed to grow, an extremely large log covering several hours only has 6 place calls in it (None of which had an error. With only 6 place calls, it is hard to say that this is in any way unexpected, seeing as we don't yet know what the problem is.)
- There was one application error, which follows:
Quote:
2008/07/18 18:00:11.390: ERROR: Connect: SQLConnect failed.
2008/07/18 18:00:11.390: SQL Error State: 08001
Native Error Code: 11
ODBC Error: [Microsoft][ODBC SQL Server Driver][Shared Memory]SQL Server does not exist or access denied.
2008/07/18 18:00:11.390: SQL Error State: 01000
Native Error Code: 2
ODBC Error: [Microsoft][ODBC SQL Server Driver][Shared Memory]ConnectionOpen (Connect()).
2008/07/18 18:00:11.390: FAILURE
2008/07/18 18:00:11.390: [215] Fatal Stop
2008/07/18 18:00:11.390: FATAL
2008/07/18 18:00:11.390: Attempting to jump to Fatal Error Global Event label: 'APPLICATION_ERROR'
If it is acceptable for your system to have periodic losses of the database, it might be wise to have a retry loop around the DB Connect step. Perhaps you could try to connect, Wait for 60 seconds, and try again. If the second connection fails, then go to a Fatal Stop just as you have here. Alternatively, a longer wait, or more retries might be in order, depending on the situation.
- The Wait step the application runs between each run of the main application loop is only 58 milliseconds in length. Presently, I don't believe there is anything to suggest this is a cause of any problems. However, if you ever do need to run logging to see all events again, this will cause your log files to grow incredibly fast. It might also be a problem if you have many application instances, all of which are rapidly polling, but have nothing to do. This could be an overwhelming amount of polling for other services, such as the database. You would have to keep an eye on overall performance to determine if this causes any problems. One thing to consider though, is that if there is no reason to poll this frequently(I.e. Waiting 60 seconds between checks isn't going to cause a backlog of jobs to do, and there is no "just in time" demand from the application's perspective.), reducing the frequency now could save you some trouble down the road.
All that said, since you have indicated that you do see yellow lights in the TeleFlow Monitor, and we haven't been successful in getting good logs indicating what is going on,
here is how I think we should proceed to get to the bottom of the issues most quickly:In TeleFlow Config, make the following settings:
- Check "Unlimited Log File Size"
- Set "Log Lines" to 2000
In your active application list file, set all lines as follows:
- Log output: Errors Only
- Append to file: Checked
- No limit: NOT checked
- Restart Count: 5
The settings listed above will have the following effects:
- Every time an application error occurs, 2000 lines of log will result. This should be more than enough to see what is going wrong.
- The log will be appended to, allowing you to see multiple errors if they occur.
- After any one single line stops because of application errors more than 5 times, it will stop restarting. (This is just in case there is a situation that will result in the same line going into a spin of starting, encountering an error, restarting, encountering the same error, and so on.)
Once you restart TeleFlow Server with these settings in place, you can watch for errors by looking for red and yellow lights in TeleFlow Monitor. Every red/yellow light in TeleFlow Monitor indicates that an application error has occurred, and a log has been generated. (I know you have something restarting the TeleFlow Server service every now and then. Restarting TFServer will put the lights back to green, and reset the restart count to 0. If you leave your restart process in place, it will interfere with your ability to user TeleFlow Monitor to see errors have occurred. For these reasons, I would recommend disabling that process for now. If you can't do so for service reasons, you will have to keep an eye on the log directory for new log files.)
With the settings listed above, the logs will only be created when application errors occur, which means it should be fairly easy to determine what is going wrong.