Database Corruption and IO Errors

A quick way to have your day turned upside down and rip your gut out with nerves and anxiety is to come in one day to find that users are panicked, applications are not working and the HelpDesk team is curled up in the fetal position in the corner. Why? The sky is falling and everybody thinks the database has blown up.

Calmly, you settle in and check the server and eventually find your way to the error logs to see the following:

Msg 823, Level 24, State 2, Line 1

The operating system returned error 1(Incorrect function.) to SQL Server during a read at offset 0x0000104c05e000 in file ‘E:\Database\myproddb.mdf’. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Suddenly you understand and feel the collective fear and paranoia. What do you do now that the world has seemingly come to an end for your database?

Corruption

What exactly does this error message mean? Well, typically, an 823 error is a very strong indicator that there is some sort of problem with the storage system, hardware or driver that is in the path of the I/O request.

Great! That fear is getting a little heavier knowing what the error represents. This doesn’t bode well for the database. Let’s go ahead and crack out the list of what we can do or check when a problem like this hits:

  1. Check msdb.dbo.suspect_pages
  2. Run a consistency check for all databases on the same volume
  3. Check Logs (SQL, Windows, Storage system) to see if there may be additional info (via different errors/warnings) in close proximity to the 823 error.
  4. Check your drivers
  5. Restore the database

This is where your experience, training, and preparedness come in handy. An experienced data professional will be prepared with database backups (including log backups). So you are not concerned here because all of your backups are reporting successful. As you prep to pull the backups (for the past couple of days just in case) you notice that there are no available backups in your repository. Looking closer at your backup jobs you discover that the backups completed in mere seconds where they normally take hours for this database.

Now that your heart is racing, forehead is beading up with sweat, gut is sinking and the fear is very palpable – what do you do? Time to step through the rest of the steps and pull out your lucky charms, right?

Querying against suspect_pages, you find the table to be completely empty. You know that checkdb runs regularly but maybe it didn’t run last night. That is easy enough to check with a little query from here. Since a consistency check does not seem to have run (as confirmed by the script) and is the second item on the checklist, let’s go ahead and run it now.

Msg 0, Level 11, State 0, Line 0

A severe error occurred on the current command.  The results, if any, should be discarded.

Msg 0, Level 20, State 0, Line 0

A severe error occurred on the current command.  The results, if any, should be discarded.

Crud. Blood pressure and nerves are getting a little more frazzled now. Maybe we can cycle through the database and find which table is causing the problem. Let’s try a checktable of every table in the database. Before doing the checktable, one more check against suspect_pages still shows no rows to be found.

Running the checktable, every table is coming up clean except one. That one table produces the same sort of error as the checkdb and just so happens to be the largest and most critical table to the database. Blood pressure is closing in on critical now. We have a corruption issue that is severe enough that checktable cannot complete, we know we have 823 errors and some sort of IO issue and do not have a backup.

Wait…backup. Let’s try to force a backup and see what happens. We can tell the backup to continue after error so let’s see what happens. Maybe that will allow you to move the database to a different server or different spindles to try and just recover the data.

Msg 3202, Level 16, State 2, Line 1

Write on “E:\SQLBackups\myproddb.bak” failed: 1(Incorrect function.)

Msg 3013, Level 16, State 1, Line 1

BACKUP DATABASE is terminating abnormally.

The situation just does not want to get any better at this point. Time for drastic measures – shut down the SQL Server services and try to xcopy the data and log files to a different server and try to re-attach from there. Anything is worth a shot, right?

Error 1: Incorrect Function

Glad to know it is not just SQL Server throwing the errors – sorta. The corruption is ever present and there is nothing more that can be done, right? All hope is lost. Time to fill out the resume and move on to something else, right? Give it one more shot. A spark of insanity hits and you wonder if a mere query against the clustered index will work, if not then maybe something against any of the indexes to try and spare any data at all.

You rub your luck horseshoe and query the table (yes the table that checktable aborts because of corruption). Lo and behold you get results from this very simple query. How is that possible. On a whim, you drop all the Non-Clustered Indexes and try a fresh backup.

Hallelujah! The backup completes without error. Time to take this backup and restore the database to a completely different server. Then on the new server run a consistency check to determine if it is all clear. To your liking, there is absolutely no corruption at this point so the non-clustered indexes can be recreated (easy to do because you have a script with the index definitions handy).

Wrap

This journey from fire and brimstone and the world ending to the epiphany and then the sweet euphoric feelings of success is not a normal resolution for these types of errors. This sort of thing happened for a client that called trying to get around the 823 errors. We had absolutely no indication whatsoever of where the corruption was beyond knowing we had failing disks at the time. We got lucky in that the non-clustered indexes in this case ended up being stored on the bad sectors and dropping those indexes allowed us to recover the database and make it usable.

When everything is breaking, the unconventional thought (especially without having a checkdb complete to tell you which index or which pages are corrupt) of dropping indexes may just save your bacon. It saved my clients bacon!

As a follow-up item, it is important to regularly check the dbccLastKnownGood for each database. That can be done by following the script in this article.

Last Time CheckDB was Run

Categories: Corner, News, Professional, SSC
Comments: 13 Comments
Published on: January 17, 2013

Corrupt PagesBack in November 2012, I shared a story about checking the last known good checkdb in the boot page of a database.  You can read that story here.

This is an important enough topic that it is worth repeating frequently if I wanted to do that.  If  for no other reason than to continue to hammer at how important it is to both run checkdb and know the last time that checkdb was run successfully.

Alas, I am writing to fix a few things with the script that I shared in that last past.

I run this script on every server I touch to get a report for the last known good checkdb for every single database.  I had been running the script flawlessly across many servers without error.  Then it happened.  The script failed with a nasty error.

After a bit of looking, it became apparent my flaw in the script.  I had not written the script with CS (case sensitivity) in mind.  I touch so few CS servers, that I sometimes forget to check for that.  Slap my hands and I will work on that going forward.

So here is the update to the script.

[codesyntax lang=”tsql”]

[/codesyntax]

On the Sixth Day…

What better way to kick off the sixth day of pre-Christmas than with six slices of foie-gras?  No animals have been hurt in the making of this post!

This could be a pretty festive dish thanks in part to those geese.  I enjoy a little foie gras every now and again.

No, this was not just a clever ploy to bring fat into another post.  Rather, I feel that foie gras is more of a delicacy than eggs.

Today, we will be talking of a delicacy in the database world.  It is a feature that should help every DBA sleep better at night.  Sadly, this delicacy can be tainted by another feature that may just give you hear burn instead.

First though, let’s recap.

  1. Table Compression – Day 5
  2. Exercise for msdb – Day 4
  3. Backup, Job and Mail History Cleanup – 3rd Day
  4. Service Broker Out of Control – 2nd Day
  5. Maint Plan Logs – 1st Day

On the Sixth Day of pre-Christmas…

Picture Courtesy of AllCreatures.org

My DBA gave to me a way to ensure my databases are not corrupt.

Sadly, the method chosen is to force feed (gravage) the checkdb through an SSIS style maintenance plan.  Have you ever tried to run checkdb through a maintenance plan?  It doesn’t always work so well.

Many times, when running checkdb, you may run into this pretty little error message.

Msg 5808, Level 16, State 1, Line 2
Ad hoc update to system catalogs is not supported.

Well, that is just downright wrong.  If I am provided a tool to create maintenance plans, then the tool should not generate an error such as that when I want to run consistency checks against the database.  This is akin to force feeding us something that isn’t all that good for our health.  There was even a connect item filed for this behavior, here.

So, what is it that causes this behavior?

Well, there are a couple of things that contribute to this behavior.  This can be reproduced from tsql as well.  To find what is causing this behavior, I used a tool that is a bit more reliable.  To recreate the failure, I created a small test database and then created a maintenance plan to run consistency checks for that database.  Then the reliable tool I used is Profiler.

Next up is to run the profiler with a filter for the test database, then to start the maintenance plan.  It shouldn’t take too long to have the maintenance plan complete.  When it completes, it is time to investigate the TSQL generated by Profiler and it should become apparent pretty quick what TSQL is being run during a maintenance plan that causes the checkdb to fail with the above mentioned error.

Are you ready?  This code is pretty complex and perplexing.  You better sit down so you don’t fall from the surprise and shock.

[codesyntax lang=”tsql”]

[/codesyntax]

Why would a maintenance plan need to run that snippet every time that a checkdb is performed?

Now, there is another piece to this puzzle.  This error is thrown when another configuration setting is present.  If we change that setting, the error no longer happens.  Soooo, a quick fix would be to change that setting.  The setting in question is “Allow Updates”.  It has a value of 1 and must be changed to a value of 0.  Since SQL 2005, we want it to be a 0 anyway.

[codesyntax lang=”tsql”]

[/codesyntax]

Now, another option for this issue would be that the maintenance plan not run the sp_configure at all.  Or, if it is deemed necessary, that it be changed to the following.

[codesyntax lang=”tsql”]

[/codesyntax]

The best option in my opinion is that a maintenance plan not be used for consistency checks.  I’d prefer to see a custom built routine, a routine using sp_msforeachdb, or the maintenance plan routines developed by Ola Hallengren that can be found here.  All of these methods require that a little more thought be put into the consistency checking of your database, but that means you will get better sleep through the night.

Last Known Good CheckDB

Comments: 11 Comments
Published on: November 20, 2012

Diligent DBAs like to check their databases for a little thing called corruption.  Some of those DBAs are masterful at keeping records and logging that the database was checked and came up with a clean bill of health.

There are many different ways of logging this kind of activity.  Today I will share one such way to track when the last successful run of Checkdb happened.

First a little back story

A question came across twitter on the SQLhelp hashtag. The question was “Does restoring a database clear the dbi_dbccLastKnownGood value on the boot page?”

This question prompted me to do a quick test to see.  The test is simple.  Create a small database, backup the database, run checkdb, check to see what the dbi_dbccLastKnownGood value is, restore the previous backup and check the dbi_dbccLastKnownGood value again.

So here is a quick script

[codesyntax lang=”tsql”]

[/codesyntax]

If you run this little test, you should observe that the date value for dbi_dbccLastKnownGood changes with each test.  Now let’s discuss the question and the answer to that original question about whether or not this value gets cleared.

The value does not get cleared.  The value does not get reset.  The cause for the change in the value that you have observed is due simply to the boot page having been restored.  If CheckDB has never been run on the database, you will get the SQL default date of ‘1900-01-01 00:00:00.000’.

And then…

Now that the back story is told, that brings us to how to track this – at least one such method.  I had to cover the back story since it is what prompted the writing of a method to gather this information in a quick script so I could use it to monitor.  Yes, it is just another tool to throw into the toolbox.  And to reiterate, it is by no means the only way to track or gather this type of information.  Some are more elaborate than others.  It is up to you to choose.

Keeping in mind that we can get the last time that Checkdb completed without a report of corruption, I delve into this cursor based method to retrieve the dbi_dbccLastKnownGood value for all databases within your SQL Server Instance.

[codesyntax lang=”tsql”]

[/codesyntax]

You can use this or find something else if you desire.  The sole purpose of this script is to provide a means to gather quickly the last date known to report a clean bill of health from Checkdb.  From there, I will leave it to you to add it to another process in order to track over the long term.

You can read the twitter conversation here.  I also have it on good authority that Bob Pusateri (twitter) will be blogging about the same topic.

page 1 of 1

Calendar
October 2018
M T W T F S S
« Jul    
1234567
891011121314
15161718192021
22232425262728
293031  

Welcome , today is Tuesday, October 16, 2018