Database Corruption and IO Errors

A quick way to have your day turned upside down and rip your gut out with nerves and anxiety is to come in one day to find that users are panicked, applications are not working and the HelpDesk team is curled up in the fetal position in the corner. Why? The sky is falling and everybody thinks the database has blown up.

Calmly, you settle in and check the server and eventually find your way to the error logs to see the following:

Msg 823, Level 24, State 2, Line 1

The operating system returned error 1(Incorrect function.) to SQL Server during a read at offset 0x0000104c05e000 in file ‘E:\Database\myproddb.mdf’. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Suddenly you understand and feel the collective fear and paranoia. What do you do now that the world has seemingly come to an end for your database?

Corruption

What exactly does this error message mean? Well, typically, an 823 error is a very strong indicator that there is some sort of problem with the storage system, hardware or driver that is in the path of the I/O request.

Great! That fear is getting a little heavier knowing what the error represents. This doesn’t bode well for the database. Let’s go ahead and crack out the list of what we can do or check when a problem like this hits:

  1. Check msdb.dbo.suspect_pages
  2. Run a consistency check for all databases on the same volume
  3. Check Logs (SQL, Windows, Storage system) to see if there may be additional info (via different errors/warnings) in close proximity to the 823 error.
  4. Check your drivers
  5. Restore the database

This is where your experience, training, and preparedness come in handy. An experienced data professional will be prepared with database backups (including log backups). So you are not concerned here because all of your backups are reporting successful. As you prep to pull the backups (for the past couple of days just in case) you notice that there are no available backups in your repository. Looking closer at your backup jobs you discover that the backups completed in mere seconds where they normally take hours for this database.

Now that your heart is racing, forehead is beading up with sweat, gut is sinking and the fear is very palpable – what do you do? Time to step through the rest of the steps and pull out your lucky charms, right?

Querying against suspect_pages, you find the table to be completely empty. You know that checkdb runs regularly but maybe it didn’t run last night. That is easy enough to check with a little query from here. Since a consistency check does not seem to have run (as confirmed by the script) and is the second item on the checklist, let’s go ahead and run it now.

Msg 0, Level 11, State 0, Line 0

A severe error occurred on the current command.  The results, if any, should be discarded.

Msg 0, Level 20, State 0, Line 0

A severe error occurred on the current command.  The results, if any, should be discarded.

Crud. Blood pressure and nerves are getting a little more frazzled now. Maybe we can cycle through the database and find which table is causing the problem. Let’s try a checktable of every table in the database. Before doing the checktable, one more check against suspect_pages still shows no rows to be found.

Running the checktable, every table is coming up clean except one. That one table produces the same sort of error as the checkdb and just so happens to be the largest and most critical table to the database. Blood pressure is closing in on critical now. We have a corruption issue that is severe enough that checktable cannot complete, we know we have 823 errors and some sort of IO issue and do not have a backup.

Wait…backup. Let’s try to force a backup and see what happens. We can tell the backup to continue after error so let’s see what happens. Maybe that will allow you to move the database to a different server or different spindles to try and just recover the data.

Msg 3202, Level 16, State 2, Line 1

Write on “E:\SQLBackups\myproddb.bak” failed: 1(Incorrect function.)

Msg 3013, Level 16, State 1, Line 1

BACKUP DATABASE is terminating abnormally.

The situation just does not want to get any better at this point. Time for drastic measures – shut down the SQL Server services and try to xcopy the data and log files to a different server and try to re-attach from there. Anything is worth a shot, right?

Error 1: Incorrect Function

Glad to know it is not just SQL Server throwing the errors – sorta. The corruption is ever present and there is nothing more that can be done, right? All hope is lost. Time to fill out the resume and move on to something else, right? Give it one more shot. A spark of insanity hits and you wonder if a mere query against the clustered index will work, if not then maybe something against any of the indexes to try and spare any data at all.

You rub your luck horseshoe and query the table (yes the table that checktable aborts because of corruption). Lo and behold you get results from this very simple query. How is that possible. On a whim, you drop all the Non-Clustered Indexes and try a fresh backup.

Hallelujah! The backup completes without error. Time to take this backup and restore the database to a completely different server. Then on the new server run a consistency check to determine if it is all clear. To your liking, there is absolutely no corruption at this point so the non-clustered indexes can be recreated (easy to do because you have a script with the index definitions handy).

Wrap

This journey from fire and brimstone and the world ending to the epiphany and then the sweet euphoric feelings of success is not a normal resolution for these types of errors. This sort of thing happened for a client that called trying to get around the 823 errors. We had absolutely no indication whatsoever of where the corruption was beyond knowing we had failing disks at the time. We got lucky in that the non-clustered indexes in this case ended up being stored on the bad sectors and dropping those indexes allowed us to recover the database and make it usable.

When everything is breaking, the unconventional thought (especially without having a checkdb complete to tell you which index or which pages are corrupt) of dropping indexes may just save your bacon. It saved my clients bacon!

As a follow-up item, it is important to regularly check the dbccLastKnownGood for each database. That can be done by following the script in this article.

2 Comments - Leave a comment
  1. Great article Jason. When I saw your initial list of things to do and #5 was “Restore”, my initial thought was “wait, not all corruption needs a restore to fix it.” I was very delighted that this was the exact path your story went down.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.



Calendar
January 2018
M T W T F S S
« Dec   Mar »
1234567
891011121314
15161718192021
22232425262728
293031  

Welcome , today is Sunday, June 17, 2018