How to Troubleshoot Intermittent SQL Timeout Errors

.NetSqlSql ServerSql Server-2008Timeout

.Net Problem Overview


We've been having a few instances per day where we get a slew of SQL Timeout errors from multiple applications (System.Data.SqlClient.SqlException: Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.) We have over 100 different applications on our network, both web and desktop apps. Everything from VB6 and Classic ASP to .NET 4. I can find all kinds of data that show the side effects but can't pinpoint what is causing this. Our DBA says nothing is wrong with the SQL server, and IT says there's nothing wrong with the web servers or network, so of course I'm left in the middle trying to troubleshoot this.

I'm really just looking for suggestions on what other troubleshooting I can do to try and track this down.

We're running SQL Server 2008 R2 in a cluster. There's a handful of different servers that connect to it, ranging from Windows server 2003 to 2008 of different varieties.

Here's what I've done so far:

  • Run SQL trace of long running queries and deadlocks. This shows no deadlocks at the times of the problems, and long running queries all coincide with our timeout errors, but look to be a side effect, and not the cause. Queries that are very basic that typically return instantly end up taking 30, 60 or 120 seconds to run at times. This happens for a few minutes then everything picks up and works fine after that.
  • Use performance monitor to track connection pool connections. This sometimes shows some spikes in the number of connections near the times of the timeouts, but still not even halfway to the default 100 connection limit. Again, nothing here that seems to point to a cause.
  • Separate web applications into different App Pools. We tried to narrow down the apps we thought may be the main problem (most chatty, etc) and put them in separate Application Pools but that doesn't seem to affect anything or help us narrow down anything.
  • Monitor disk usage on SQL Server. We've done some monitoring on the SQL server and see no spikes or any signs of problems when these timeouts are occurring.
  • Verified TempDB was not the cause of the problem.

I'll come back and add more if I think of what else we've tried. Please let me know some ideas on what to troubleshoot next.

.Net Solutions


Solution 1 - .Net

> Run SQL trace of long running queries and deadlocks. This shows no > deadlocks at the times of the problems, and long running queries all > coincide with our timeout errors, but look to be a side effect, and > not the cause. Queries that are very basic that typically return > instantly end up taking 30, 60 or 120 seconds to run at times. This > happens for a few minutes then everything picks up and works fine > after that.

It looks like some queries/transaction lock your database till they are done. You have to find out which queries are blocking and rewrite them/run them at an other time to avoid blocking other processes. At this moment the waiting queries just timeout.

An extra point to dig into is the auto increment size of your transaction log and database. Set them on a fixed size instead of a percentage of the current files. If files are getting taller the time it takes to allocate enough space will eventually longer as your transaction timeout. And your db comes to a halt.

Solution 2 - .Net

Performance problems boil down to CPU, IO, or Lock contention. It sounds like you have ruled out IO. I would guess CPU is not a problem since this is a database, not a number cruncher. So, that leaves lock contention.

If you can execute a sp_who2 while the queries are timing out, you can use the BlkBy column to trace back to the holding the lock that everyone else is waiting on. Since this is only happening a few times a day, you may have trouble catching enough data if you are running this manually, so I suggest you rig up an automated system to dump this output on a regular basis, or maybe to be triggered by the application timeout exceptions. You can also use the Activity Monitor to watch the degradation of query responsiveness in real-time, as suggested by peer.

Once you find the long-running query and the application that executes it, you can immediately resolve the domino of timeouts by reducing the timeout for that single application below all the others (right now, it must be longer). Then, you should inspect the code to determine a better solution. You could reduce the time the lock is held by committing the transaction sooner within a sproc, or reduce the lock required by the reading query with hints such as NOLOCK or UPDLOCK.

Here's some more reading on sp_who2: http://sqlserverplanet.com/dba/using-sp_who2/

And query hints: http://msdn.microsoft.com/en-us/library/ms181714.aspx http://msdn.microsoft.com/en-us/library/ms187373.aspx

Solution 3 - .Net

Bit of a long shot, but on a lab a while back, we had a situation where a SQL Server appeared unresponsive, not because we had spiked the CPU or anything we could track within SQL Server, it appeared operational to all tests but connections failed under some load.

The issue turned out to be due to the volume of traffic against the server meant we were triggering the in built windows Syn Attack Flood Protection within Windows. Annoyingly when you hit this, there is no logged message within windows server, or within SQL - you only see the symtpoms which are connections failing to be made - this is because windows slows down on accepting the messages and let's a queue build. From the connection standpoint, the server appears to not respond when it should (it doesn't even acknowledge the message arrived)

http://msdn.microsoft.com/en-us/library/ee377084(v=bts.10).aspx

Scroll down to SynAttackProtect and you will see the default in windows server 2003 sp1 onwards was to enable this feature by default. It is a DDOS protection mechanism in effect, and the lack of logging that it is triggering makes it incredibly difficult to detect when your server does this.

It took 3 days within the MS lab before it was figured out.

You mentioned 100 conenctions, we had an app that constantly connected, ran queries and then disconnected, it did not hold the connections open. This meant that we had multiple threads on each machine connectiong doing this, 10 machines, multiple threads per machine, and it was considered enough different connections consistently being made / dropped to trigger the defense.

Whether you are at that level (since it is not a clearly defined threshold by MS) is hard to say.

Solution 4 - .Net

Like the other posters have suggested, it sounds like you have a lock contention issue. We faced a similar issue a few weeks back; however, ours was much more intermittent, and often cleared up before we could get a DBA onto the server to run sp_who2 to trace down the issue.

What we ended up doing was implement an e-mail notification if a lock exceeded a certain threshold. Once we put this in place, we were able to identify the processes that were locking, and change the isolation level to read uncommitted where appropriate to fix the issue.

Here's an article that provides an overview of how to configure this type of notification.

If locking turns out to be the issue, and if you're not already doing so, I would suggest looking into configuring row versioning-based isolation levels.

Solution 5 - .Net

You are on the right track with your tracing and profiling. what you need to do is look for what the queries that time-out have in common - it is likely they will all be hitting a small subset of tables or indexes. I suspect some application has a long-running update/insert that affects queries on tables that use indexes affected by the updates/inserts.

You have to work backwards a little - given the subset of tables you see timing out, see what indexes are on those tables. Look for other queries that are running at the smae time that touch those tables/indexes. I'm betting you will find a small set of updates/inserts doing this.

Then you have some decisions to make. One option is to changing the locking hints on the queries that are timing out. But that is geenrally bad practice because it will mask the real problem for a while. While you mgiht see the timeouts go away for a while, depending on the hint you choose you might end up with dirty reads and then bogus data coming back from those queries. That might turn out to be worse than the timeouts - hard to say.

Best bet is to figure out which of your applications are submitting the update/inserts you found and dig in to figure out why they take so long.

Solution 6 - .Net

I suggest you have a deep look at the super cool SQL Server's Dynamic Management Views feature:

> Dynamic management views and functions return server state information > that can be used to monitor the health of a server instance, diagnose > problems, and tune performance.

This article is a good start with DMVs, although it was written for SQL 2005 (DMVs feature first appearance): Troubleshooting Performance Problems in SQL Server 2005, especially the 'blocking' chapters.

Solution 7 - .Net

My experience with these issues (not on SQL Server though) is that overdone multi-tasking is often the cause of the problem. If there is similar/connected data/tables queried at (almost) the same time by many connections, the DBMS may have trouble keeping all the isolation at check. This is not that much of an issue of disk usage as to making some connections wait for things to be done by other ones. Synchronization is very expensive in terms of CPU usage.

The 100 connections is way too much in my opinion. (In my experience again) even 20 connections asked to be done by one machine may be overly optimistic.

Solution 8 - .Net

Sounds like you may already have your answer but in case you need one more place to look you may want to check out the size and activity of your temp DB. We had an issue like this once at a client site where a few times a day their performance would horribly degrade and occasionally timeout. The problem turned out to be a separate application that was thrashing the temp DB so much it was affecting overall server performance.

Good luck with the continued troubleshooting!

Solution 9 - .Net

I've seen similar problems happen if anti-virus was installed on the SQL server. The AV's auto-update features were clocking the server and not allowing enough CPU for SQL Server.

Also, have you put a small application on the SQL server itself that verifies that connections can be made or runs very basic SQL like "SELECT GETDATE();"? This would eliminate network possibilities.

Solution 10 - .Net

Since I do troubleshooting everyday as a part of my job, here is what I would like to do:

  1. Since it's SQL Server 2008 R2, you can run SQLDiag which comes as a part of the product. You can refer books online for more details. In brief, capture Server Side trace and blocker script.

  2. Once trace is captured, look for "Attention" event. That would be the spid which has received the error. If you filter by SPID, you would see RPC:Completed event before "Attention". Check the time over there. Is that time 30 seconds? If yes, then client waited for 30 second to get response from SQL and got "timed out" [This is client setting as SQL would never stop and connection]

  3. Now, check if the query which was running really should take 30 seconds?

  4. If yes then tune the query or increase the timeout setting from the client.

  5. If no then this query must be waiting for some resources (blocked)

  6. At this point go back to Blocker Script and check the time frame when "Attention" came

Above is assuming that issue is with SQL Server not network related!

Solution 11 - .Net

We experienced this with SQL Server 2012 / SP3, when running a query via an SqlCommand object from within a C# application. The Command was a simple invocation of a stored procedure having one table parameter; we were passing a list of about 300 integers. The procedure in turn called three user-defined functions and passed the table as a parameter to each of them. The CommandTimeout was set to 90 seconds.

When running precisely the same stored proc with the same argument from within SQL Server Management Studio, the query ran in 15 seconds. But when running it from our application using the above setup, the SqlCommand timed out. The same SqlCommand (with different but comparable data) had been running successfully for weeks, but now it failed with any table argument containing more than 20 or so integers. We did a trace and discovered that when run from the SqlCommand object, the database spent the entire 90 seconds acquiring locks, and would invoke the procedure only at about the moment of the timeout. We changed the CommandTimeout time, and no matter time what we selected the stored proc would be invoked only at the very end of that period. So we surmise that SQL Server was indefinitely acquiring the same locks over and over, and that only the timeout of the Command object caused SQL Server to stop its infinite loop and begin executing the query, by which time it was too late to succeed. A simulation of this same process on a similar server using similar data exhibited no such problem. Our solution was to reboot the entire database server, after which the problem disappeared.

So it appears that there is some problem in SQL Server wherein some resource gets cumulatively consumed and never released. Eventually when connecting via an SqlConnection and running an SqlCommand involving a table parameter, SQL Server goes into an infinite loop acquiring locks. The loop is terminated by the timeout of the SqlCommand object. The solution is to reboot, apparently restoring (temporary?) sanity to SQL Server.

Solution 12 - .Net

The issue is because of a bad query the time to executing query is taking more than 60 seconds or a Lock on the Table

> The issue looks like a deadlock is occurring; we have queries which are blocking the queries to complete in time. The default timeout for a query is 60 secs and beyond that we will have the SQLException for timeout. > > Please check the SQL Server logs for deadlocks. The other way to solve the issue to to increase the Timeout on the Command Object (Temp Solution).

Solution 13 - .Net

Are these servers virtualized? On another post I've read about a SQL server running sometimes very slowly because of lack of sufficient memory. This in turn was caused by a so-called memory balloon that the virtualizer used to limit the amount of memory used by that virtual server. It was hard to find because the pressure on physical memory had nothing to do with the SQL server itself.

Another common cause for a temporary performance degradation might be a virus scanner. When a new virus definition is installed, all other processes will suffer and run very slow. Check out any other automatic update process, this might also take a lot of resources quite unexpectedly. Good luck with it!

Solution 14 - .Net

Our issue was resolved by turning off the TLS-DHE* ciphers by windows team.

Issue we had: we have SSIS packages that run from one server (sql server 2012 and windows 2012 R2) and connect to another server (SQL server 2016 SP2 and windows 2019) and we used to have timeouts now and then for some of the SSIS packages and it was randomly failing. After turning off the TLS-DHE ciphers by windows team the issue was resolved.

https://support.microsoft.com/en-us/topic/transport-layer-security-tls-connections-might-fail-or-timeout-when-connecting-or-attempting-a-resumption-326bd5b1-52a1-b367-8179-b154e5c01e90

Solution 15 - .Net

I had an issue similar to this and found out is was due to a default .Net framework setting

Sqlcommand.Timeout

http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlcommand.commandtimeout(v=VS.100).aspx

The default is 30 seconds as sated in the above url by Microsoft, try setting this to a higher number of seconds or maybe -1 before opening the connection to see if this solves the issue.

It maybe a setting in your web.config or app.config files or on you applicaiton / web server config files.

Solution 16 - .Net

I am experiencing the same issue.. and I build some logging into several functions that I could identify that were frequently running long. when I say frequently I mean about 2% of the time. So part of the log inserted the start time and the end time of the procedure or query. I then produced a simple report sorting several days of logs by the total execution time descending. here is what I found.

the long running instances always started between HH:00 and HH:02 or HH:30 an HH:32 and none of the short running queries ran between those times. Interesting....

Now It seems that there is actually more order to the chaos that I was experiencing. I was using a recovery target of 0 this implemented "indirect checkpoints" in my databases so that my recovery time could be achieved at nearly 1 minute. causing these checkpoints to be created every 30 mins.

WOW, what a coincidence!

In Microsoft's online documentation about changing the recovery time of a database comes with this little warning...

"An online transactional workload on a database that is configured for indirect checkpoints could experience performance degradation."

Wow go figure...

so I modified my recovery time and bango no more issues.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionShawn StewardView Question on Stackoverflow
Solution 1 - .NetPeterView Answer on Stackoverflow
Solution 2 - .NetMatt FausView Answer on Stackoverflow
Solution 3 - .NetAndrewView Answer on Stackoverflow
Solution 4 - .NetMichael FredricksonView Answer on Stackoverflow
Solution 5 - .Netn8wrlView Answer on Stackoverflow
Solution 6 - .NetSimon MourierView Answer on Stackoverflow
Solution 7 - .NetMarianPView Answer on Stackoverflow
Solution 8 - .NetCarthView Answer on Stackoverflow
Solution 9 - .NetUnhandledExcepSeanView Answer on Stackoverflow
Solution 10 - .NetBalmukund LakhaniView Answer on Stackoverflow
Solution 11 - .NetDave ZifferView Answer on Stackoverflow
Solution 12 - .NetAmit BaggaView Answer on Stackoverflow
Solution 13 - .NetDonyView Answer on Stackoverflow
Solution 14 - .Netuser16441078View Answer on Stackoverflow
Solution 15 - .NetJRGWVView Answer on Stackoverflow
Solution 16 - .NetryagerView Answer on Stackoverflow