Ghosts in your Database

Yes Virginia, there are ghosts in your database.  More specifically, there are ghosts in your SQL Server database.  They are not there to haunt you.  They are not there just for this holiday season (speaking of Halloween Month).

How can there be ghosts in the database?

Why would there be ghosts in the database?

Do they happen because somebody issued a KILL statement?

Let’s address each of those in turn.   A database ghost record is (in a very basic form) one that’s just been deleted in an index on a table . Delete operations don’t actually physically remove records from pages – they only mark them as having been deleted (ghosted). Now why is it done this way?  The answer here is largely performance based.  This is a performance optimization that allows delete operations to complete more quickly. Additionally, it allows the rollback of delete operations to process more quickly.  The rollback processes faster because all that needs to happen is to “flip the flag” for the records as being deleted/ghosted, instead of having to reinsert the deleted records.  That may be a bit over-generalized, but I hope you get the gist.  In short, records are marked as “ghosted” when a delete operation is performed; and to rollback, you simply undo that mark.

Now, what about this KILL statement thing?  The kill statement is pure Halloween fun and does not create ghost records.

Ghost Hunting

Now that we have established the purpose of Ghosts in the database, how do you verify the existence of Ghosts?  In other words, what can we do to prove there really are spectral things in the database?  This is where the fun really begins.  First, we need to get out the equipment and tools (as any good ghost hunter would do) so we can capture these phantasms.  Let’s call the first tool the “trap”.  Here is what you will need for it.

This trap, err database, can be a bit large.  As currently configured, we will need about 16GB of disk space to support it.  If that is too much, I recommend removing the last column – “TheBlob”.  As you can see, we are setting a rather large trap.  The table we create (Halloween.Ghosts) will receive One Million records.  This is most probably overkill to catch these ghosts, so you can also cut back on the number of records to be affected.

Now, to make sure we have some data and that we can use the table, let’s just run a little test query.

Excellent, we have a good sample of data.

database ghost records

At this point, it is important to note that we have done nothing that will cause database ghost records.  All that has been done is to set the framework so we can see the ghosts.  With the framework in place, let’s try to catch some ghosts.  To do so, we need to try to delete something.  Since we just happen to have had a clerical error in our database, we have 666 prime candidates to try and fix.  We happen to have several records that were supposed to be given a Slimer date of Halloween.  The clerk, being absent minded, thought that Halloween was supposed to be on Oct. 30.  Our business model dictates that the invalid records must be deleted first and then we can try to enter the replacement records.  So, let’s go ahead and try to remove those records.

Before we remove the records though, we need to discuss one important requirement for us to be able to see the ghosts.  Let’s call it spectral vision goggles.  In the database realm, we call it a trace flag.  In order to see the the ghosts on the pages, we need to enable TF 661.  We can do that with the following statement.  There is a serious side effect to this method too – it alters the behavior of the Ecto Containment Unit or automatic ghost cleanup process.  If you enable this, you will need to disable it later and/or manually run a ghost cleanup.

Now that we have the last piece of equipment in place, let’s go ahead and try to delete some records.

With all of those records deleted (all 666 of them), let’s see what we might have captured.  First, let’s take a look at some index stats.

If we look at the output of this query, we will see that we did indeed attempt to delete 666 records.  Those records will now display in the ghost_record_count column.  We will also see that, since we had two indexes on the table, there are 666 ghost records marked on each index.

idxstats_ghostcount

Very cool!  We are definitely on the track to capturing those ghosts.  We have a trail that they exist in the database.  Let’s keep going and see where we can see them.  You should note that there is an additional column in our result set that looks like it might be related to ghost records.  We are going to leave the discovery of version_ghost_record_count as a homework experiment for you to perform.  It is beyond the current scope of this article.

Now this is getting exciting.  We have stronger evidence in the log showing that these ghosts are hanging around in the database.  Not only are they hanging around in the database, we can see which pages in the database on which they are trying to hide.

dblog_output

This is really solid information!  fn_dblog is giving us just about everything we need in order to get those ghosts.  It took a little bit of work since the log reports the page number in hex.  Converting that to an integer page number is essential for us to look at the page (besides integer values are easier to interpret for most people).  Now I can take that PageID and pass that number, for any of the records reported by fn_dblog, and pass it into yet another undocumented procedure known as DBCC Page.

When looking to use DBCC page, we can either look at the PFS Page and see more pages that have ghost record counts.  Or we can take the results seen from the fn_dblog output  and then look at the contents of the page and catch those ghosts.  We will take a quick look at the PFS page first.  Then we will take a look at an index page next.  In this database that we have created, the PFS page will show several other pages that have ghost records on them.  Due to the size (over 2 million pages), we only see index pages with ghost records in that result.  If our database were smaller, we would quite possibly see data pages in our first PFS page of the database.  Let’s see a sample from the first PFS in this database.

ghost_displayedonpageduetotf

We can follow that link from this point to page 126.  Page 126 happens to be an index page similar to the following.  There are a couple of indicators that this is an index page.  First being that when we run DBCC Page with a format of 3, we will see two result sets.  The second result set will show statistics and index information.  The second being in the image attached after the query.  We will leave it as an exercise to you to see other ways to demonstrate that this is an index page.

ghost_indexpage

That is great, but we have more ghosts to find.  Let’s look at a ghost on a data page.  Randomly picking a PageID from that list that was output from fn_dblog, let’s see what DBCC Page will provide to us.

ghstcntondatapage

Conclusion

Well, isn’t that just cool!  We have trapped a bunch of ghosts and were even able to see them.  This has been a fantastic deep dive into the crypts of the database.  This is merely a scratch on the surface though.  We hope this will encourage you to explore a bit and at least try one of the homework assignments we left behind in this article.

With all of that, we have a bit of cleanup to do.  The cleanup comes in one of two methods.  Method one involves manual labor.  Method two involves our friendly little trace flag we already used.  Since most DBAs prefer the automated mechanisms over manual, let’s just discuss method two for now.  It is extremely effortless.

That will put the system back to the way it was when we started (and of course we trust that nobody did this on their prod box).

This has been one of a few articles about ghosts in the database. You can check out some of the others here and here.

PowerShell ISE Crashes

Working with PowerShell brings a lot of advantages and power to help manage a server. The more current your PoSh version, the more efficiently you will be able to manage your server. Sometimes getting to the current PoSh versions comes with a little pain such as ISE crashes.

I recently had the mis-adventure of working through some ISE crashes after bringing some systems up to PoSh 5.1 that were either PoSh 2.0 or 3.0. It’s not a very fun situation to run a WMI update and then run into a crash of any type when testing if it worked. Your first thought is something terrible has happened.

As it stands, the problem is more of a nuisance than a critical failure. That said, it is enough of a problem that anyone who uses the ISE or .Net applications may experience a slight cardiac event.

Fonts

As you work to quickly recover from your missed heart beat, you start digging through logs and then hitting good old trusty google.

Diving through the logs, you might just happen across an error similar to the following:

Problem signature: Problem Event Name: PowerShell NameOfExe: PowerShell_ISE.exe FileVersionOfSystemManagementAutomation: 6.1.7600.16385 InnermostExceptionType: System.Xml.XmlException OutermostExceptionType: System.Reflection.TargetInvocation
DeepestPowerShellFrame: indows.PowerShell.GuiExe.Internal.GPowerShell.Main DeepestFrame: indows.PowerShell.GuiExe.Internal.GPowerShell.Main ThreadName: unknown.

Maybe the first error you encounter might look like this one instead:

System.TypeInitializationException

“FileFormatException: No FontFamily element found in FontFamilyCollection
that matches current OS or greater: Win7SP1”.

Inner exception originates from: CompositeFontParser

Either way, the error shoots us back to the same fundamental problem. The ISE won’t load, you get an error message and you can’t confirm that the WMI patch was applied properly.

As you work your fingers faster and faster through the pages on google, you discover that this problem is caused more explicitly by a patch for the .Net framework and not necessarily the work to upgrade your PoSh version. It only waited to manifest itself after the upgrade.

That’s gravy and all, but how does one fix the problem? For me, the quickest and most reliable fix was to simply jump straight to the root of the problem – fonts. The ISE is a WPF application and it also requires a fallback font (if a character isn’t present in your font set, then the app chooses a substitute from the fallback font – or something like that).

The fix is extremely simple and really underscores why this is merely a nuisance issue and not a critical problem. Thus it shouldn’t cause any sort of sinking internal feelings of any sort. There are a few plausible fixes floating around out there. I recommend just doing a manual font replacement. It is all but three simple steps:

  1. Download GlobalUserInterface.CompositeFont
  2. XCOPY the font to %windir%\Microsoft.NET\Framework\v4.0.30319\WPF\Fonts
  3. XCOPY the font to %windir%\Microsoft.NET\Framework64\v4.0.30319\WPF\Fonts

After you have copied the font to those two directories, then all that is needed to be done is launch the ISE. I ran into the same problem on three or four servers and the fix took no more than 5 minutes on each of the servers.

Conclusion

I previously mentioned that I have been working more and more with PoSh to try and improve my skillset there. This is one of those very low-level trinkets that I ran into as I have been working to hone my skills in that tech. For other, possibly, interesting articles about my experiences with PowerShell, you can check out these articles.

Given this job is tightly related to the system_health black box sessions (sp_server_diagnostics and system_health xe session), I recommend fixing the job. In addition, I also recommend reading the following series about XE and some of those black box recorder sessions – here.

Syspolicy Phantom Health Records

SQL Server comes with a default SQL agent job installed (for most installations) to help manage the collection of system health data. I would dare say this job is ignored by most people and few probably even know it exists.

This topic is not new to me. You may recall a previous article I wrote entailing one type of failure and how to resolve that failure. That article can be found here.

I recently ran into a variant of the problem outline in that previous article that requires just a bit of a different approach. The errors turn out to be similar on the surface but really are different upon closer inspection.

Phantom Health Records

If you are unfamiliar with the topic, I recommend reading the previous article. Then after reading that article, maybe brush up a little bit on the SQL Agent. The failures we will be looking at are within the SQL Agent and come from the job called: syspolicy_purge_history.

For this latest bout of failure, I will start basically where I left off in the the last article. The job fails on a server and I have access to other servers of the same SQL version where the job works.

Before diving any further into this problem, let’s look at what the error is:

A job step received an error at line 1 in a PowerShell script.
The corresponding line is ‘set-executionpolicy RemoteSigned -scope process -Force’.
Correct the script and reschedule the job. The error information returned by PowerShell is:
‘Security error. ‘. Process Exit Code -1. The step failed.

Having the error in hand, and knowing that the job works elsewhere, my next logical step (again based on experience from the last article with this job) is to script the job from another server and use it to replace the job on the server where it fails. In principle this is an AWESOME idea.

 

Sadly, that idea was met with initial failure. As it turns out, the error remained exactly the same. This is good and unfortunate at the same time. Good in that I was able to confirm that the job was correctly configured with the following script in the job:

Since the step fails from SQL Server let’s see what else we can do to make it run. Let’s take that code and try it from a powershell ise. So, for giggles, let’s cram that script into powershell and see what blows up!

Now isn’t that a charming result! This result should be somewhat expected since the code I just threw into the ISE is not entirely powershell. If you look closer at the code, you will notice that it is using sqlcmd like conventions to execute a parameterized powershell script. Now, that makes perfect sense, right? So let’s clean it up to look like a standard PoSH script. We need to replace some parameters and then try again.

This will result in the following (resume reading after you scratch your head for a moment):

The key in this failure happens to be in the sqlserver. PoSH thinks we are trying to pass a drive letter when we are just trying to access the SQLServer stuff. Depending on your version of server, SQL Server, and PoSH you may need to do one of a couple different things. For this particular client/issue, this is what I had to try to get the script to work.

If you read the previous article, you may notice this command looks very much like the command that was causing the problems detailed in that previous article. Yes, we have just concluded our 180 return to where we started a few years back. Suffice it to say, this is expected to be a temporary fix until we are able to update the system to PoSH 5 and are able to install the updated sqlserver module.

As is, this script is not quite enough to make the job succeed now. To finish that off, I created a ps1 file to house this script. Then from a new step (defined as a sqlcmd step type) in the syspolicy purge job, I execute that powershell script as follows:

Tada, nuisance job failure alert is resolved and the system is functioning again.

Conclusion

I dare say the quickest resolution to this job is to probably just disable it. I have seen numerous servers with this job disabled entirely for the simple reason that it fails frequently and just creates noise alerts when it fails. Too many fixes abound for this particular job and too few resolve the failures permanently.

I would generally err on the side of fixing the job. Worst case, you learn 1000 ways of what not to do to fix it. 😉

Given this job is tightly related to the system_health black box sessions (sp_server_diagnostics and system_health xe session), I recommend fixing the job. In addition, I also recommend reading the following series about XE and some of those black box recorder sessions – here.

Index Cannot Be Reorganized…

sunburst_spaceWorking diligently as any good DBA might, you have established maintenance routines for each of the SQL Servers under your purview.

As a part of this maintenance you have scripted solutions to intelligently manage and maintain the fragmentation levels for each of the indexes within each database on each instance.

To further show how diligent you are as a DBA, the outcomes of each maintenance run are logged and you review the logs each morning. This routine helps keep you on top of everything that is happening within the environment.

For months, maybe even years, things are running smoothly. Never a failure. Never an error. Just routine log review day after day. Then one day it happens – there is an error. The index maintenance script failed one night.

Index Cannot be Reorganized…

You receive the error message similar to the following:

Msg 2552, Level 16, State 1, Line 1 The index “blah” (partition 1) on table “blah_blah_blah” cannot be reorganized because page level locking is disabled

Immediately, you start double-checking yourself and verifying that it worked the previous night. You even go so far as to confirm that the same index was previously reorganized. How is it possible that it is failing now on this index. What has changed? Has something changed?

Time for a little digging and investigating, so the dirty work begins. On one side of the coin you are relieved to be able to do something different. On the other side of that coin, you are rather annoyed that something seems to have changed. These feelings are perfectly normal!

First things first – you investigate the indexes in question to confirm what the error is saying. This is easily done with a query such as the following:

Scrolling through the results, you notice (eventually) that the IX_SpecialOfferProduct_ProductID in the AdventureWorks2014 database has page locks disabled. You are absolutely certain that this index was created allowing page locks. Pondering the problem for a moment, you recall having read about the default trace (there are several articles on the default trace – here) and the thought occurs to try and see if there is a breadcrumb there about the change. Using the query from that default trace article, a picture starts to unscramble. Here is that query reposted and a snippet of the results:

index_deftrace_audit

This is a great start. Not seen in the results is the timestamp showing when it was done – which was due solely to snipping. Also not shown is the text of the statement that was run for those three events. So close, yet so far away. This is not quite enough to have the smoking gun evidence to show Jason (me) that I did something wrong and unauthorized. What to do now?

All is not lost yet! Your stunning memory kicks in and you recall several articles about using Extended Events to audit server and database changes. Better yet, you recall that you deployed an XE session to the server where this error occurred. And yes, you are slightly embarrassed that you failed to remove the XE session after fiddling with it. We won’t tell anybody that you deployed a test XE session to a production server as if it were your sandbox. The session currently deployed is trapping all object changes unlike the following session that has it filtered down to just objects that are indexes.

You query the trace file with a query like this:

Wow! Look at the results! There is a smoking gun finally.

index_xe_audit

Two results happen to pin the root cause of the change squarely on my shoulders. Yup, Jason changed that index to disallow page locks. I highlighted three different areas of interest in the results. The yellow and green indicate the DDL phase. One row for the start of the statement, and another row for the commit of that statement. The red highlight shows me the exact change that was made to this index. This is all very good info!

What Now?

It really is great to have the smoking gun. If something is broke and it worked previously, it is essential to find the root cause. With a root cause under the belt, what needs to be done to fix the failure? That is a little bit easier that finding the root cause. That is, unless there is a technical reason for the index to no longer allow page locks (maybe that smoking gun is less of a smoking gun and more like baby spittle after all).

Here is how you fix the problem:

But, But, But…

The Extended Events session would be very noisy and capture every alter index statement, right? It should capture statements like the following, right?

The answer to that question is: Yes, Yes, Yes. These statements are all captured due to the use of the ALTER statement. Here are the results of the XE session with all of these scripts having been executed:

index_xe_audit2

If you want to audit when the indexes are changing and how they are changing, this will do just the trick. If there are a ton of changes, then be prepared for a deluge of information.

Compressing Encrypted Backups

TSQL2sDayA common requirement, whether it be based out of pure want or truly out of necessity, is to make a large database backup file, that is encrypted, be much smaller.

This was a knock for the early days of Transparent Data encryption (circa SQL Server 2012). If TDE were enabled, then a compressed backup (though compression was available) was not an option. Not only did compression in the 2012 implementation of TDE make the database backup not smaller, it occasionally caused it to be larger.

This was a problem.  And it is still a problem if you are still on SQL 2012. Having potentially seen this problem, amongst many others, Ken Wilson (blog | twitter) decided to ask us to talk about some of these things as a part of the TSQL Tuesday Blog party. Read all about that invite here.

Encrypted and Compressed

dbsecurityWell, thankfully Microsoft saw the shortcoming as well. With SQL Server 2014, MS released some pretty cool changes to help us encrypt and compress our database backups at rest.

Now, instead of a database backup that could potentially get larger due to encryption and compression combined, we have a significant hope of reducing the encrypted backup footprint to something much smaller. Here is a quick example using the AdventureWorks2014 database.

In this little exercise, I will perform three backups. But before I can even get to those, I need to ensure I have a Master Key set and a certificate created. The encrypted backups will require the use of that certificate.

Do this in a sandbox environment please. Do not do this on a production server.

In the first backup, I will attempt to backup the AW database using both encryption and compression. Once that is finished, then a backup that utilizes the encryption feature only will be done. And the last backup will be a compressed only backup. The three backups should show the space savings and encryption settings of the backup if all goes well. The compressed and encrypted backup should also show an equivalent savings as the compression only backup.

With that script executed, I can query the backup information in the msdb database to take a peek at what happened.

This should produce results similar to the following:

backup_results

Looking at the results, I can see that the compression only backup and the compression with encryption backup show very similar space savings. The compression only dropped to 45.50MB and the Compression with encryption dropped to 45.53MB. Then the encryption only backup showed that, interestingly, the CompBackSizeMB (compressed_backup_size) got larger (which is the actual size on disk of this particular backup).

At any rate, the compression now works with an encrypted backup and your backup footprint can be smaller while the data is protected at rest. Just don’t go using the same certificate and password for all of your encrypted backups. That would be like putting all of your eggs in one basket.

With the space savings available in 2014, and if you are using SQL 2014, why not use encrypted backups?

What is DBCC?

DB Nuts and BoltsIf I ask you what is DBCC, what would your answer be?

I want you to think about that one for just a little bit. What comes to mind when you hear DBCC? Is it the default go-to used when talking about consistency checks? Is it something more involved? Maybe it is something entirely different?

I ponder this because I hear it from time to time used in complete replacement for CheckDB. I have to stop and think for a minute because as of SQL Server 2016, there are 35 DBCC statements of various use that are documented. Never-mind the bounty of un-documented statements (e.g. page, ind).

So, I ask, what does DBCC mean to you?

In trying to come up with some sort of answer, I did a little bit of research. If you look in the documentation (including the SQL 2000 documentation), one will find this definition for DBCC:

The Transact-SQL programming language provides DBCC statements that act as Database Console Commands for SQL Server.

It seems the documentation says it pretty clearly that DBCC is Database Console Commands. (And the SQL 2000 documentation says something similar). But every now and then you hear somebody say it means Database Consistency Checker. And it seems the root of that may come from the SQL 6.5 or earlier days. I found this snippet in BOL from SQL 6.5:

Used to check the logical and physical consistency of a database, check memory usage, decrease the size of a database, check performance statistics, and so on. DBCC is the SQL Server “database consistency checker.” DBCC helps ensure the physical and logical consistency of a database; however, DBCC is not corrective. It is recommended that you make periodic checks to ensure the logical and physical consistency of your data.

This seems a little odd to me since not all DBCC statements are used for database consistency checking as this definition would call it. It seems more of a documentation bug that has taken hold than a legitimate acronym. Just think about it. With that definition, one will also see a list of DBCC statements – many of which do not perform consistency checks. Let’s look at them:

For instance, DBCC PINTABLE is hardly useful for checking consistency. The TRACESTATUS is also one that does not quite fit this acronym. That is likely why you will see the difference in the SQL 2000 and beyond documentation for DBCC.

Don’t be surprised if you hear me ask which DBCC statement is being inferred if I hear somebody say they ran DBCCs. It is just plain more clear to hear “CheckDB came up clean” over the alternative “DBCC came up clean”. There is little room for interpretation there.

In short, DBCC is an acronym for Database Console Command, and it seems more of a documentation mistake when it was called Database Consistency Checker.

Another interesting thought. How many people say DBCC Command(s)? Think about that one for a minute – Database Console Command Command(s). That one is less of an issue imho than the prior.

Just remember, think about what you mean to say when talking about a specific DBCC statement and if you really mean a consistency check, try saying CheckDB instead. I bet the clarity in the conversation will improve and there will be less hair tugging.

HealthySQL – A Review

How well do you know the health of your SQL Servers and databases? Ever wonder about the current health or even what the health was like a few months back? Sure, one could anecdotally say the databases and servers are healthy, but how do you quantify it?

Many have asked or pondered what to do to quantify a healthy SQL server. How do you get the hard numbers to show whether the server is or is not healthy? How do you get the numbers to show that the server is performing just as well or better than it did three months ago?

dbhealth_maintNow we have a way to answer those questions and more! Robert Pearl has finally published his book on Healthy SQL. In the book, Robert takes you on a journey from mapping out a plan, to capturing data, storing that data, and then how to report on that data. You will learn what to capture and what some of the various things actually mean (e.g. waits and indexes).

Throughout the book, you will see that Robert will introduce you to various tools. These tools can either be native to SQL Server or they could be third party tools. You will get a healthyDBgood primer on what the tool is and how to quickly use it as you become acquainted with the internals and the Health of your database and server.

As you progress through the book and learn about what makes a database healthy, you will encounter a chapter on how to store the data and make a repository to track your database health. You will want to create a repository so you can progress into the sections on reporting about your database health. Who doesn’t want to market to management how healthy the databases are?

With the tools in this book, and the repository you will create, audits and discussions will become much easier. You will be able to implement processes to help make you more aware of the environment and help keep you from getting that annoying page in the middle of the night.

Take a look at the book and enjoy.  You can get it from Amazon here.

If you are a young DBA, new to being a DBA, an accidental DBA, or just any DBA looking for a little extra help in figuring out how to check and track the health of your server, it is worth reading this book.

T-SQL Tuesday #050: Automating Database Restores

Comments: 1 Comment
Published on: January 14, 2014

Here it is time for the party of the month for the SQL Server acolytes and I was running a bit behind.  Why?  Well, that was due in part to me rushing around trying to do some of what this months topic is.  Some would call that pretty could timing.

TSQL2sDay150x150Hemanth.D (blog|twitter) is hosting the T-SQL Tuesday blogging party this month, and he wants all of us to post about Automation. As Hemanth.D put it on his blog in his own words:

You could write about, what options you would consider when automating something? Where do you draw the line? What are our preferred tools for automation? T-SQL, PowerShell, VBScript or Batch files(?) or maybe just share something that you automated in the last couple of years.

You can read the invite he posted here.

History

As Hemanth.D mentioned in his invitation, this is not the first time this topic has come up for TSQLTuesday.  As it would happen, I also participated in the first go around with my contribution about sizing databases on limited information.  You can read that here.

This time around, I have a little bit of a different topic to approach.  I hadn’t considered this until after having read that Wayne Sheffield wrote about his efforts to verify backup files via script (automation).  You can read what Wayne wrote at this link.

Having read that, it seemed painfully obvious to me that I should go ahead and write about my efforts to automate backup restores.  After all, if you are verifying the backup files existence, you might also want to test the backups to ensure they are good.  Besides, we all need to test our backups anyway, right?

Present

I have a few different methods I have used over the years to automate restores.  In one iteration, the restore operations were hard coded in a procedure that was scheduled to run on a nightly or weekly basis.  It probably was also just hard coded to be specific to a database.  That kind of setup is not super useful except for that one database.

With that in mind, I worked on several iterations to help create a script for myself that would automate the restores of any database, with any number of backup files, to a different server, and not have file paths/names/anything hard-coded.  Well – there ended up being one thing hard-coded but that can be changed easily enough.

I decided on a script that would read the backup path for the most recent backup file from the msdb database of the server where the database was backed up.  I do this via a linked server that can be created/destroyed in the proc or that can reside permanently (I prefer to leave it in place).  Take the filepath of that backup file and restore it to the destination server.  All of this via tsql.

Now a caveat with this solution is that the backup path works best if it is a UNC path.  Even if you are backing up the database to the local server, backing up to UNC means that the restore operation can just grab that path and not encounter errors due to drive mappings (e.g. backed up to D but the D on the restore server is the cd-rom drive).

What if you don’t want to restore the source database with the same name to the new server?  Well, that has also been considered and a parameter can be passed to the stored procedure to allow for a new database name.  What if the default file paths are different?  That consideration has been made too!  All of that said, more testing is always welcome.

The script can be evaluated from here.

Next steps:

With the script, the next things to do would be to create SQL Agent jobs to run the script on a routine basis.  Test the script and verify it.

User of this script assumes all risk.

Day 12 – High CPU and Bloat in Distribution

This is the final installment in the 12 day series for SQL tidbits during this holiday season.

Previous articles in this mini-series on quick tidbits:

  1. SQL Sat LV announcement
  2. Burning Time
  3. Reviewing Peers
  4. Broken Broker
  5. Peer Identity
  6. Lost in Space
  7. Command ‘n Conquer
  8. Ring in The New
  9. Queries Going Boom
  10. Retention of XE Session Data in a Table
  11. Purging syspolicy

distribution

Working with replication quite a bit with some clients you might run across some particularly puzzling problems.  This story should shed some light on one particularly puzzling issue I have seen on more than one occasion.

In working with a multi-site replication and multi-package replication topology, the cpu was constantly running above 90% utilization and there seemed to be a general slowness even in Windows operations.

Digging into the server took some time to find what might have been causing the slowness and high CPU.  Doing an overall server health check helped point in a general direction.

Some clues from the general health check were as follows.

  1. distribution database over 20GB.  This may not have been a necessarily bad thing but the databases between all the publications weren’t that big.
  2. distribution cleanup job taking more than 5 minutes to complete.  Had the job been cleaning up records, this might not have been an indicator.  In this case, 0 records were cleaned up on each run.

The root cause seemed to be pointing to a replication mis-configuration.  The mis-configuration could have been anywhere from the distribution agent to an individual publication.  Generally, it seems that the real problem is more on a configuration of an individual publication more than any other setting.

When these conditions are met, it would be a good idea to check the publication properties for each publication.  Dive into the distribution database and try to find if any single publication is the root cause and potentially is retaining more replication commands than any other publication.  You can use sp_helppublication to check the publication settings for each publication.  You can check MSrepl_commands in the distribution database to find a correlation of commands retained to publication.

Once having checked all of this information, it’s time to put a fix in place.  It is also time to do a little research before actually applying this fix.  Why?  Well, because you will want to make sure this is an appropriate change for your environment.  For instance, you may not want to try this for a peer-to-peer topology.  In part because one of the settings can’t be changed in a peer-to-peer topology.  I leave that challenge to you to discover in a testing environment.

The settings that can help are as follows.

[codesyntax lang=”tsql”]

[/codesyntax]

These settings can have a profound effect on the distribution retention, the cleanup process and your overall CPU consumption.  Please test and research before implementing these changes.

Besides the potential benefits just described, there are other benefits to changing these commands.  For instance, changing replication articles can become less burdensome by disabling these settings.  The disabling of these settings can help reduce the snapshot load and allow a single article to be snapped to the subscribers instead of the entire publication.

Day 11 – Purging syspolicy

This is the eleventh installment in the 12 day series for SQL tidbits during this holiday season.

Previous articles in this mini-series on quick tidbits:

  1. SQL Sat LV announcement
  2. Burning Time
  3. Reviewing Peers
  4. Broken Broker
  5. Peer Identity
  6. Lost in Space
  7. Command ‘n Conquer
  8. Ring in The New
  9. Queries Going Boom
  10. Retention of XE Session Data in a Table

Garbage-Dump

Did you know there is a default job in SQL Server that is created with the purpose of removing system health phantom records?  This job also helps keep the system tables ,that are related to policy based management, nice and trim if you have policy based management enabled.  This job could fail for one of a couple of reasons.  And when it fails it could be a little annoying.  This article is to discuss fixing one of the causes for this job to fail.

I want to discuss when the job will fail due to the job step related to the purging of the system health phantom records.  Having run into this on a few occasions, I found several proposed fixes, but only one really worked consistently.

The error that may be trapped is as follows:

A job step received an error at line 1 in a PowerShell script.
The corresponding line is ‘(Get-Item SQLSERVER:\SQLPolicy\SomeServer\DEFAULT).EraseSystemHealthPhantomRecords()’.
Correct the script and reschedule the job. The error information returned by PowerShell is:
‘SQL Server PowerShell provider error: Could not connect to ‘SomeServer\DEFAULT’.
[Failed to connect to server SomeServer. –>

The first proposed fix came from Microsoft at this link.  In the article it proposed the root cause of the problem being due to the servername not being correct.  Now that article is specifically for clusters, but I have seen this issue occur more frequently on non-clusters than on clusters.  Needless to say, the advice in that article has yet to work for me.

Another proposed solution I found was to try deleting the “\Default” in the agent job that read something like this.

(Get-Item SQLSERVER:\SQLPolicy\SomeServer\Default).EraseSystemHealthPhantomRecords()

Yet another wonderful proposal from the internet suggested using Set-ExecutionPolicy to change the execution policy to UNRESTRICTED.

Failed “fix” after failed “fix” is all I was finding.  Then it dawned on me.  I had several servers where this job did not fail.  I had plenty of examples of how the job should look.  Why not check those servers and see if something is different.  I found a difference and ever since I have been able to use the same fix on multiple occasions.

The server where the job was succeeding had this in the job step instead of the previously pasted code.

if (‘$(ESCAPE_SQUOTE(INST))’ -eq ‘MSSQLSERVER’) {$a = ‘\DEFAULT’} ELSE {$a = ”};
(Get-Item SQLSERVER:\SQLPolicy\$(ESCAPE_NONE(SRVR))$a).EraseSystemHealthPhantomRecords()

That, to my eyes, is a significant difference.  Changing the job step to use this version of the job step has been running successfully for me without error.

I probably should have referenced a different server instead of resorting to the internet in this case.  And that stands for many things – check a different server and see if there is a difference and see if you can get it to work on a different server.  I could have saved time and frustration by simply looking at local “resources” first.

If you have a failing syspolicy purge job, check to see if it is failing on the phantom record cleanup.  If it is, try this fix and help that job get back to dumping the garbage from your server.

«page 1 of 4

Calendar
March 2019
M T W T F S S
« Feb    
 123
45678910
11121314151617
18192021222324
25262728293031

Welcome , today is Saturday, March 23, 2019