Day 12 – High CPU and Bloat in Distribution

This is the final installment in the 12 day series for SQL tidbits during this holiday season.

Previous articles in this mini-series on quick tidbits:

  1. SQL Sat LV announcement
  2. Burning Time
  3. Reviewing Peers
  4. Broken Broker
  5. Peer Identity
  6. Lost in Space
  7. Command ‘n Conquer
  8. Ring in The New
  9. Queries Going Boom
  10. Retention of XE Session Data in a Table
  11. Purging syspolicy

distribution

Working with replication quite a bit with some clients you might run across some particularly puzzling problems.  This story should shed some light on one particularly puzzling issue I have seen on more than one occasion.

In working with a multi-site replication and multi-package replication topology, the cpu was constantly running above 90% utilization and there seemed to be a general slowness even in Windows operations.

Digging into the server took some time to find what might have been causing the slowness and high CPU.  Doing an overall server health check helped point in a general direction.

Some clues from the general health check were as follows.

  1. distribution database over 20GB.  This may not have been a necessarily bad thing but the databases between all the publications weren’t that big.
  2. distribution cleanup job taking more than 5 minutes to complete.  Had the job been cleaning up records, this might not have been an indicator.  In this case, 0 records were cleaned up on each run.

The root cause seemed to be pointing to a replication mis-configuration.  The mis-configuration could have been anywhere from the distribution agent to an individual publication.  Generally, it seems that the real problem is more on a configuration of an individual publication more than any other setting.

When these conditions are met, it would be a good idea to check the publication properties for each publication.  Dive into the distribution database and try to find if any single publication is the root cause and potentially is retaining more replication commands than any other publication.  You can use sp_helppublication to check the publication settings for each publication.  You can check MSrepl_commands in the distribution database to find a correlation of commands retained to publication.

Once having checked all of this information, it’s time to put a fix in place.  It is also time to do a little research before actually applying this fix.  Why?  Well, because you will want to make sure this is an appropriate change for your environment.  For instance, you may not want to try this for a peer-to-peer topology.  In part because one of the settings can’t be changed in a peer-to-peer topology.  I leave that challenge to you to discover in a testing environment.

The settings that can help are as follows.

[codesyntax lang="tsql"]

[/codesyntax]

These settings can have a profound effect on the distribution retention, the cleanup process and your overall CPU consumption.  Please test and research before implementing these changes.

Besides the potential benefits just described, there are other benefits to changing these commands.  For instance, changing replication articles can become less burdensome by disabling these settings.  The disabling of these settings can help reduce the snapshot load and allow a single article to be snapped to the subscribers instead of the entire publication.

Day 11 – Purging syspolicy

This is the eleventh installment in the 12 day series for SQL tidbits during this holiday season.

Previous articles in this mini-series on quick tidbits:

  1. SQL Sat LV announcement
  2. Burning Time
  3. Reviewing Peers
  4. Broken Broker
  5. Peer Identity
  6. Lost in Space
  7. Command ‘n Conquer
  8. Ring in The New
  9. Queries Going Boom
  10. Retention of XE Session Data in a Table

Garbage-Dump

Did you know there is a default job in SQL Server that is created with the purpose of removing system health phantom records?  This job also helps keep the system tables ,that are related to policy based management, nice and trim if you have policy based management enabled.  This job could fail for one of a couple of reasons.  And when it fails it could be a little annoying.  This article is to discuss fixing one of the causes for this job to fail.

I want to discuss when the job will fail due to the job step related to the purging of the system health phantom records.  Having run into this on a few occasions, I found several proposed fixes, but only one really worked consistently.

The error that may be trapped is as follows:

A job step received an error at line 1 in a PowerShell script.
The corresponding line is ‘(Get-Item SQLSERVER:\SQLPolicy\SomeServer\DEFAULT).EraseSystemHealthPhantomRecords()’.
Correct the script and reschedule the job. The error information returned by PowerShell is:
‘SQL Server PowerShell provider error: Could not connect to ‘SomeServer\DEFAULT’.
[Failed to connect to server SomeServer. -->

The first proposed fix came from Microsoft at this link.  In the article it proposed the root cause of the problem being due to the servername not being correct.  Now that article is specifically for clusters, but I have seen this issue occur more frequently on non-clusters than on clusters.  Needless to say, the advice in that article has yet to work for me.

Another proposed solution I found was to try deleting the “\Default” in the agent job that read something like this.

(Get-Item SQLSERVER:\SQLPolicy\SomeServer\Default).EraseSystemHealthPhantomRecords()

Yet another wonderful proposal from the internet suggested using Set-ExecutionPolicy to change the execution policy to UNRESTRICTED.

Failed “fix” after failed “fix” is all I was finding.  Then it dawned on me.  I had several servers where this job did not fail.  I had plenty of examples of how the job should look.  Why not check those servers and see if something is different.  I found a difference and ever since I have been able to use the same fix on multiple occasions.

The server where the job was succeeding had this in the job step instead of the previously pasted code.

if (‘$(ESCAPE_SQUOTE(INST))’ -eq ‘MSSQLSERVER’) {$a = ‘\DEFAULT’} ELSE {$a = ”};
(Get-Item SQLSERVER:\SQLPolicy\$(ESCAPE_NONE(SRVR))$a).EraseSystemHealthPhantomRecords()

That, to my eyes, is a significant difference.  Changing the job step to use this version of the job step has been running successfully for me without error.

I probably should have referenced a different server instead of resorting to the internet in this case.  And that stands for many things – check a different server and see if there is a difference and see if you can get it to work on a different server.  I could have saved time and frustration by simply looking at local “resources” first.

If you have a failing syspolicy purge job, check to see if it is failing on the phantom record cleanup.  If it is, try this fix and help that job get back to dumping the garbage from your server.

Day 10 – Retention of XE Session Data in a Table

This is the tenth installment in the 12 day series for SQL tidbits during this holiday season.

Previous articles in this mini-series on quick tidbits:

  1. SQL Sat LV announcement
  2. Burning Time
  3. Reviewing Peers
  4. Broken Broker
  5. Peer Identity
  6. Lost in Space
  7. Command ‘n Conquer
  8. Ring in The New
  9. Queries Going Boom

food-storage

 

Gathering event information is a pretty good thing.  It can do wonders for helping to troubleshoot.  What do you do if you need or want to be able to review the captured information in 3 months or maybe 12 months from now?

Retaining the session data for later consumption is often a pretty essential piece of the puzzle.  There is more than one way to accomplish that goal.  I am going to share one method that may be more like a sledgehammer for some.  It does require that Management Data Warehouse be enabled prior to implementing.

When using MDW to gather and retain the session data, you create a data collector pertinent to the data being collected and retained.  In the following example, I have a data collector that is created to gather deadlock information from the system health session.  In this particular case, I query the XML in the ring buffer to get the data that I want.  Then I tell the collector to gather that info every 15 minutes.  The collection interval is one of those things that needs to be adjusted for each environment.  If you collect the info too often, you could end up with a duplication of data.  Too seldom and you could miss some event data.  It is important to understand the environment and adjust accordingly.

Here is that example.

[codesyntax lang="tsql"]

[/codesyntax]

Looking this over, there is quite a bit going on.  The keys are the following paramaters: @parameters and @interval.  The @parameters parameter stores the XML query to be used when querying the ring buffer (in this case).  It is important to note that the XML query in this case needs to ensure that the values node is capped to a max of varchar(4000) like shown in the following.

[codesyntax lang="tsql"]

[/codesyntax]

With this data collector, I have trapped information and stored it for several months so I can compare notes at a later date.

Day 9 – Queries Going Boom

This is the ninth installment in the 12 day series for SQL tidbits during this holiday season.

Previous articles in this mini-series on quick tidbits:

  1. SQL Sat LV announcement
  2. Burning Time
  3. Reviewing Peers
  4. Broken Broker
  5. Peer Identity
  6. Lost in Space
  7. Command ‘n Conquer
  8. Ring in The New

Kaboom

Ever see an error like this??

The query processor ran out of internal resources and could not produce a query plan. This is a rare event and only expected for extremely complex queries or queries that reference a very large number of tables or partitions. Please simplify the query. If you believe you have received this message in error, contact Customer Support Services for more information.

Error 8623
Severity 16

That is a beautiful error.  The message is replete with information and gives you everything needed to fix the problem, right?  More than a handful of DBAs have been frustrated by this error.  It’s not just DBAs that this message seems to bother.  I have seen plenty of clients grumble about it too.

The obvious problem is that we have no real information as to what query caused the error to pop.  The frustrating part is that the error may not be a persistent or consistent issue.

Thanks to the super powers of XE (extended events), we can trap that information fairly easily now.  Bonus is that to trap that information, it is pretty lightweight as far as resource requirements go.

Without further ado, here is a quick XE session that could be setup to help trap this bugger.

[codesyntax lang="tsql"]

[/codesyntax]

And now for the caveats.  This session will only work on SQL 2012.  The second caveat is that there are two file paths defined in this session that must be changed to match your naming and directory structure for the output files etc.

Should you try to create this session on SQL Server 2008 (or 2008 R2) instead of SQL Server 2012, you will get the following error.

Msg 25706, Level 16, State 8, Line 1
The event attribute or predicate source, “error_number”, could not be found.

Now that you have the session, you have a tool to explore and troubleshoot the nuisance “complex query” error we have all grown to love.  From here, the next step would be to explore the output.

Day 7 – Command ‘n Conquer

This is the seventh installment in the 12 day series for SQL tidbits during this holiday season.

Previous articles in this mini-series on quick tidbits:

  1. SQL Sat LV announcement
  2. Burning Time
  3. Reviewing Peers
  4. Broken Broker
  5. Peer Identity
  6. Lost in Space

As a DBA, we sometimes like to shortcut things.  Not shortcutting a process or something of importance.  The shortcuts are usually in the realm of trying to shortcut time, or shortcut the number of steps to perform a task or shortcutting by automating a process.

We seldom like to perform the same task over and over and over again.  Click here, click there, open a new query window, yadda yadda yadda.  When you have 100 or so servers to run the same script against – it could be quite tedious and boring.  When that script is a complete one-off, there probably isn’t much sense in automating it either.

To do something like I just described, there are a few different methods to get it done.  The method I like to use is via SQLCMD mode in SSMS.  Granted, if I were to use it against 100 servers, it would be a self documenting type of script.  I like to use it when setting up little things like replication.

How many times have you scripted a publication and the subscriptions?  How many times have you read the comments?  You will see that the script has instructions to run certain segments at the publisher and then other segments at the subscriber.  How many times have you handed that script to somebody else to run and they just run it on the one server?

Using SQLCMD mode and then adding a CONNECT command in the appropriate places could solve that problem.  The only thing to remember is to switch to SQLCMD mode in SSMS.  Oh and switching to SQLCMD mode is really easy.  The process to switch to SQLCMD mode is even documented.  You can read all about that here.

And there you have it, yet another simple little tidbit to take home and play with in your own little lab.

Day 6 – Lost in Space

This is the sixth installment in the 12 day series for SQL tidbits during this holiday season.

Previous articles in this mini-series on quick tidbits:

  1. SQL Sat LV announcement
  2. Burning Time
  3. Reviewing Peers
  4. Broken Broker
  5. Peer Identity

Credit: NASA/JPL

One of the more frequently recurring themes I see in my travel and work is the perpetual lack of space.

For instance, every time I fly there is inevitably a handful of people that have at least three carry-on items and at least one of those items is larger than the person trying to “carry” it on the plane.  Imagine watching these people trying to lift 100+ pound bags over their heads to put them into these small confined overhead storage compartments.  We are talking bags that are easily 2-3 times larger than the accepted dimensions, yet somehow this person made it on the plane with such a huge bag for such a tiny space.

Another favorite of mine is watching what appears to be a college student driving home in a GEO Metro.  A peek inside the vehicle might reveal 5 or 6 baskets of soiled laundry and linens.  A look at the vehicle as a whole might reveal a desert caravan’s worth of supplies packed onto the vehicle.  Watching the vehicle for a while you might notice that it can only lumber along at a top speed of 50 mph going downhill and good luck getting back up the hill.  It is just far too over-weighted and over-packed.  The vehicle obviously does not have enough room internally.

In both of these examples we have a limited amount of storage space.  In both of these examples we see people pushing the boundaries of those limitations.  Pushing those boundaries could lead to some unwanted consequences.  The GEO could break down leaving the college student stranded with dirty laundry.  The air-traveler just may have to check their dog or leave it home.

But what happens when people try to push the boundaries of storage with their databases?  The consequences can be far more dire than either of the examples just shared.  What if pushing those boundaries causes an outage and your database is servicing a hospital full of patient information (everything from diagnostics to current allergies – like being allergic to dogs on planes)?  The doctor needs to give the patient some medication immediately or the patient could die.  The doctor only has two choices and one of those could mean death the other could mean life.  All of this is recorded in the patient records but the doctor can’t access those records because the server is offline due to space issues.

Yeah that would pretty much suck.  But we see it all the time.  Maybe nothing as extreme as that case, but plenty of times I have seen business lose money, revenue, and sales because the database was offline due to space.  The company wants to just keep pushing those boundaries.

In one case, I had a client run themselves completely out of disk space.  They wouldn’t allocate anymore space so it was time to start looking to see what could be done to alleviate the issue and get the server back online.

In digging about, I found that this database had 1Tb of the 1.8TB allocated to a single table.  That table had a clustered index built on 6 columns.  The cool thing about this clustered index is that not a single query ever used that particular combination.  Even better was that the database was seldom queried.  I did a little bit of digging and found that there really was a much better clustered index for the table in question.  Changing to the new clustered index reduced the table size by 300GB.  That is a huge chunk of waste.

Through similar exercises throughout the largest tables in the database, I was able to reduce index space waste by 800GB.  Disk is cheap until you can’t have anymore.  There is nothing wrong with being sensible about how we use the space we have been granted.

Thinking about that waste, I was reminded of a great resource that Paul Randal has shared.  You can find a script he wrote, to explore this kind of waste, from this link.  You can even read a bit of background on the topic from this link.

Day 5 – Peer Identity

This is the fifth installment in the 12 day series for SQL tidbits during this holiday season.

Previous articles in this mini-series on quick tidbits:

  1. SQL Sat LV announcement
  2. Burning Time
  3. Reviewing Peers
  4. Broken Broker

identityIn the digital age it seems we are constantly flooded with articles about identity crises.  From identity theft to mistaken identity.  SQL server is not immune to these types of problems and stories.  Whether SQL Server was housing the data that was stolen ,leading to identity theft, or if SQL Server is having an identity management issue of its own – SQL Server is definitely susceptible to the identity issues.

The beauty of SQL Server is that these identity issues seem to be most prevalent when trying to replicate data.  Better yet is when the replication multiple peers setup in a Peer-to-Peer topology.

When these Identity problems start to crop up there are a number of things that can be done to try and resolve them.  One can try to manually manage the identity ranges or one can flip the “Not for Replication” attribute on the table as two possible solutions.

The identity crisis in replication gets more fun when there are triggers involved.  The triggers can insert into a table that is not replicated or can insert into a table that is replicated.  Or even better is when the trigger inserts back into the same table it was created on.  I also particularly like the case when the identity range is manually managed but the application decides to reseed the identity values (yeah that is fun).

In one particular peer-to-peer topology I had to resort to a multitude of fixes depending on the article involved.  In one case we flipped the “Not for Replication” flag because the tables acted on via trigger were not involved in replication.  In another we disabled a trigger because we determined the logic it was performing was best handled in the application (it was inserting a record back into the table the trigger was built on).  And there was that case were the table was being reseeded by the application.

In the case of the table being reseeded we threw out a few possible solutions but in the end we felt the best practice for us would be to extend the schema and extend the primary key.  Looking back on it, this is something that I would suggest as a first option in most cases because it makes a lot of sense.

In our case, extending the schema and PK meant adding a new field to the PK and assigning a default value to that field.  We chose for the default value to be @@ServerName.  This gave us a quick location identifier for each location and offered us a quick replication check to ensure records were getting between all of the sites (besides relying on replication monitor).

When SQL Server starts throwing a tantrum about identities, keep in mind you have options.  It’s all about finding a few possible solutions or mix of solutions and proposing those solutions and then testing and implementing them.

One of the possible errors you will see during one of these tantrums is as follows.

Explicit value must be specified for identity column in table ‘blah’ either when IDENTITY_INSERT is set to ON or when a replication user is inserting into a NOT FOR REPLICATION identity column.

On the Ninth Day…

It’s the end of the world as we know it.  And as the song goes…I feel fine!  But hey, we didn’t start the fire.

Those are a couple of songs that pop into my head every time somebody starts talking doomsday and doomsday prophecies.

If you are reading this, I dare say that the world is still turning.  And that is a good thing because that gives us a chance to talk about the 12 days of pre-Christmas.

Today we will be talking about a tool that can be at the DBAs disposal to help in tracking performance as well as problems.

First there are a couple of items of housekeeping.  First item is that I only realized with this post that the post on the first day was wrong.  I had miscalculated my twelve days to end on Christmas day.  That is wrong!  Counting down from that first post on the 13th means the 12th day will end up on December 24th.  Maybe I will throw in a bonus 13th day post, or delay a post for a day, or just go with it.  It will be a mystery!

Second item is naturally the daily recap of the 12 days to date.

On the Ninth Day of pre-Christmas…

My DBA gave to me a Data Collection!

If only my DBA had told me that I would need to have Management Data Warehouse (MDW) preconfigured.  Well, that is not a problem.  We can handle that too.  For a tutorial on how to setup MDW, I am going to provide a link to one that has been very well written by Kalen Delaney.

The article written by Kalen covers the topic of MDW very well all the way from setting up the MDW, to setting up the Data Collectors, and all the way down to the default reports you can run for MDW.  The MDW and canned Data Collectors can provide some really useful information for your environment.

What I want to share though is a means to add custom data collections to your MDW.  To create a custom collection, we need to rely on two stored procedures provided to us by Microsoft.  Those stored procedures are: sp_syscollector_create_collection_item and sp_syscollector_create_collection_set.  Both of these stored procedures are found in the, you guessed it, msdb database.

Each of these stored procedures has a number of parameters to help in the creation of an appropriate data collection.  When creating a data collection, you will first create the collection set, and then you add collection items to that set.

There are a few notable parameters for each stored procedure that I will cover.  Otherwise, you can refer back to the links for each of the stored procedures to see more information about the parameters.

Starting with the sp_syscollector_create_collection_set stored procedure, I want to point out the @schedule_name, @name, and @collection_mode parameters.  The name is pretty obvious – make sure you have a distinguishable name that is descriptive (my opinion) or at least have good documentation.  The collection mode has two values.  As noted in the documentation, you will want to choose one value over the other depending on the intended use of this data collector.  If running continuously, just remember to run in cached mode.  And lastly is the schedule name.  This will help determine how frequently the job runs.

Unfortunately, the schedule names are not found in the documentation, but rather you are directed to query the sysschedules tables.  To help you find those schedules, here is a quick query.

[codesyntax lang="tsql"]

[/codesyntax]

Now on to the sp_syscollector_create_collection_item stored procedure.  There are three parameters that I want to lightly touch on.  For the rest, you can refer back to the documentation.  The parameters of interest here are @parameters, @frequency and @collector_type_uid.  Starting with the frequency parameter, this tells the collector how often to upload the data to the MDW if running in cached mode.  Be careful here to select an appropriate interval.  Next is the parameters parameter which is really the workhorse of the collection item.  In the case of the custom data collector that I will show in a bit, this is where the tsql query will go.

Last parameter to discuss is the collector type uid.  Like the schedule for the previous proc, the documentation for this one essentially refers you to a system view - syscollector_collector_types.  Here is a quick query to see the different collector types.

[codesyntax lang="tsql"]

[/codesyntax]

The collector type that I will be using for this example is Generic T-SQL Query Collector Type.  A discussion on the four types of collectors can be reserved for another time.

Let’s move on to the example now.  This custom data collector is designed to help troubleshoot deadlock problems.  The means I want to accomplish this is by querying the system_health extended event session.

I can query for deadlock information direct to the system_health session using a query like the following.

[codesyntax lang="tsql"]

[/codesyntax]

You may notice that I have converted to varchar(4000) from XML.  This is in large part to make sure the results will play nicely with the data collector.  Now to convert that to a query that can be used in the @parameters parameter, we get the following.

[codesyntax lang="tsql"]

[/codesyntax]

With this query, we are loading the necessary schema nodes that correlate to the Data Collector Type that we chose.  Since this parameter is XML, the schema must match or you will get an error.  We are now ready to generate a script that can create a deadlock data collector.

[codesyntax lang="tsql"]

[/codesyntax]

Upon creation, this will create a SQL Agent job with the defined schedule.  Since this is a non-cached data collector set, the Agent job will adhere to the schedule specified and upload data on that interval.

Now all we need to do is generate a deadlock to see if it is working.  It is also a good idea to introduce you to the table that will be created due to this data collector.  Once we create this collector set, a new table will be created in the MDW database.  In the case of this collector set, we have a table with the schema and name of custom_snapshots.systemhealthdeadlock.

This new table will have three columns.  One column represents the DeadlockGraph as we retrieved from the query we provided to the @parameters parameter.  The remaining columns are data collector columns for the collection date and the snapshot id.

Now that we have covered all of that, your favorite deadlock query has had enough time to finally fall victim to a deadlock.  We should also have some information recorded in the custom_snapshots.systemhealthdeadlock table relevant to the deadlock information (if not, it will be there once the agent job runs, or you can run a snapshot from SSMS of the data collector).  With a quick query, we can start looking into the deadlock problem.

[codesyntax lang="tsql"]

[/codesyntax]

This query will give me a few entries (since I went overkill and created a bunch of deadlocks).  If I click the DeadlockGraph cell in the result sets, I can then view the XML of the DeadlockGraph, as in the following.

[codesyntax lang="xml"]

[/codesyntax]

Code to generate deadlock courtesy of “SQL Server 2012 T-SQL Recipes: A Problem-Solution Approach” by Jason Brimhall, Wayne Sheffield et al (Chapter 12, pages 267-268).  If you examine the deadlock graph you will see the code that generated the deadlock.

Since this is being pulled from the RingBuffer target of the system_health, it can prove useful to store that data into a table such as I have done.  The reason being that the Ringbuffer can be overwritten, and with good timing on the data collector, we can preserve this information for later retrieval and troubleshooting.  Deadlocks don’t always happen at the most opportune time and even less likely to occur when we are staring at the screen waiting for them to happen.

As you read more about the stored procedures used to create a data collector, you will see that there is a retention parameter.  This helps prevent the table from getting too large on us.  We can also ensure that an appropriate retention is stored for these custom collectors.

Conclusion

Creating a custom data collector can be very handy for a DBA.  Especially in times of troubleshooting.  These collectors are also quite useful for trending and analysis.  Consider this a tool in the chest for the poor man. ;)

Enjoy and stay tuned!

All scripts and references were for SQL 2008 R2.  The Deadlock script was pulled from the 2012 book, but the script runs the same for the 2008 version of the AdventureWorks database.

page 1 of 1




Calendar
September 2014
M T W T F S S
« Aug    
1234567
891011121314
15161718192021
22232425262728
2930  
Content
SQLHelp

SQLHelp


Welcome , today is Tuesday, September 2, 2014