Get a List of Files with Data

The trick here is to find a way to search thousands of files quickly and filter easily the bad from the good. Sounds like an opportunity for an automation or script.

Suppose you have a directory on your server that houses over 300,000 csv files. These files are automatically created by an automated process related to your SQL Server operations and are automatically created when the process runs. The process runs on a schedule every 15 minutes.

Now suppose that you have been tasked with figuring out which of the files have data of value in them and which can be ignored. The files that can be ignored will all be either 0k or 1k in size. However, some of the 1k files can also be ignored because they only contain a header. Every csv file should have a header. So, we must figure out a way to filter out those files that are 1k or less and only have a header row.

Filtering out these files with this criteria will allow for us to find files that have useful data in them. Now why do you need to figure out which ones have useful data? Let’s just say that maybe some of the files were created with some bad data in them and you need to figure out which files may have bad data so those particular files can be regenerated. As luck would have it, the automated process does not have any sort of logging and does the bare minimum to create the files in the first place. (Sometimes you just inherit a flawed process where inadequate thought was given.)

Finding the Files

The hard part has been done at this point. At least you know some of the attributes that will help distinguish wanted files from the unwanted files. Sorting through these files by hand could be rather bothersome and cumbersome. The trick here is to find a way to search all of these files quickly and filter easily the bad from the good. Sounds like an opportunity for another automation or script. I have just the powershell script for that.

With this script, I have a couple of things to help me find files of interest quickly. First, I have a filter in place to allow me to search a specific number of days worth of files. Second, I have a filter in place to ensure it only returns file names where there is more than just a header row present.

Now, instead of spending hours perusing files trying to find something of value, I have reduced my time spent to just a mere fraction of that. Next steps after this would be to go and add some additional logging and better robustness to the initial process to reduce the chance of bad data being put into the files in the first place.

Put a bow on it

Automated processes are fantastic. We as DBAs strive to have more tasks automated than not. That said, automation without planning is just creating a time sink later on in the job. At some point, you may need to employ some powershell script similar to this in order to find where your automated process has gone wrong (if you did not plan that process well enough from the start).

Interested in learning about some deep technical information instead? Check these out!

Want to learn more about your indexes? Try this index maintenance article or this index size article.

This is the sixth article in the 2019 “12 Days of Christmas” series. For the full list of articles, please visit this page.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.