Using SSIS to Maintain the File System
We have all run into a need or a desire to clean up old stale files from the file system, whether it be to remove old backup files or flat files that are created from one process or another. And based on this need/desire, we have all come up with a method to help with achieve that goal.
Some of the methods might be to include a flag in a maintenance plan that may be used. Other methods may be to use a SQL script employing xp_cmdshell and delete statements. Yet another may utilize the sp_oa stored procs and DMO. And still others may have ventured into powershell to accomplish the same task. The point is, there are many methods.
I am adding yet another method to the mix. Why? I didn’t much like the option of using the sp_oa method or the xp_cmdshell route. I am very novice with powershell and it would take a lot more tinkering to get the script working properly. Also, I felt pretty comfortable with SSIS and had approval to try and get this done using that method. And just because I am a novice with powershell, does not mean that I will not circle back around to try and accomplish this task via that means.
Note: This article was originally written in 2011 and got stuck in an unpublished state. Things have changed since then so i will definitely be circling back around for a powershell version.
The method employed needs to be able to do the following:
- Remove multiple file types
- Be configurable
- Clean out files from numerous directories
- Remove files older than a specified number of days.
The solution I chose utilizes SSIS. It also requires that there be a table in a database that helps to drive the package.
The table looks like the following.
/****** Object: Table [dbo].[FilePaths] Script Date: 12/05/2011 23:36:26 ******/
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[FilePaths]') AND type in (N'U'))
DROP TABLE [dbo].[FilePaths]
/****** Object: Table [dbo].[FilePaths] Script Date: 12/05/2011 23:36:26 ******/
SET ANSI_NULLS ON
SET QUOTED_IDENTIFIER ON
SET ANSI_PADDING ON
CREATE TABLE [dbo].[FilePaths](
[PathID] [int] IDENTITY(1,1) NOT NULL,
[Process] [varchar](32) NULL,
[FilePath] [varchar](256) NULL,
CONSTRAINT [PK_FilePaths] PRIMARY KEY CLUSTERED
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
SET ANSI_PADDING OFF
The filepath column holds the FileSystem Path for each directory that needs to be cleaned. Paths that are supported are local (e.g. C:\temp ) and unc paths (\\machine\c$\temp). I set this attribute to a length of 256, but if you have a longer path, you will want to adjust the length.
The Process column will hold a value describing what that path relates to, such as MaintainDirectory. In my example, I am using MaintainDirectory to control which directories hold files that potentially need to be deleted.
Here is an example of the contents of that table I am using currently.
The last piece of the setup before we start working on the SSIS package is the need for a string splitting function. Pick the string splitter of your liking. I have one that I like and am sure you have one that you prefer. The SSIS package relies on the return field from the splitter being named “Item.” If it is named something else, please make the adjustments in the package as necessary.
The package I created has been created in SSIS 2008. To meet the requirements already set forth, I utilized the following objects: ADO.Net Data Source, 2 Execute SQL Tasks, 2 ForEach Loop Containers, a Script Task, and 8 variables. Let’s take a look at these starting with the variables.
- SQLServerName – The value held here is used in an Expression for the Data Source. This will overwrite the ServerName value in the Data Source.
- DatabaseName – Used alongside the SQLServerName variable in an Expression for the Data Source. This value will overwrite the InitialCatalog value in the Data Source. This should be the name of the database where the FilePaths table and String Split function exist.
- DaysToKeep – This value is the cutoff point for which files to keep and which files will be deleted. This variable is used as a ReadOnly variable in the Script Task.
- obj_FileExtension – This object variable is used to store the result set from one of the Execute SQL tasks and the results of the string split function from the FileExtensionList variable.
- FileExtensionList – This is a delimited list of file extensions that need to be evaluated for deletion. It is important to note that the file extensions that are to be processed are case sensitive. The extension must appear in this list as it appears in the file system.
- FileExtension – to be used in one of the ForEach loops. This variable will receive the FileExtension from the obj_FileExtension variable one at a time.
- obj_ListOfDirectories – This variable will receive the result set of an Execute SQL Task to be later consumed by one of the ForEach loops.
- DirectoryToMaintain – receives one at a time the Directory to process for file deletion. The ForEach loop stores a value from obj_ListOfDirectories in this variable for processing.
Execute SQL Tasks
The two Execute SQL Tasks are simple in function. One is to get the list of directories to maintain from the FilePaths table. The other is strictly to split the string for the FileExtensionList variable.
The first is named “Get Directory List” and should receive the Full Result Set from the following query.
from dbo.FilePaths FP
where fp.Process = 'MaintainDirectory'
order by fp.PathID
The Result Set tab of this task also needs to be modified. it should look like this.
From this task, we flow to the next Execute SQL Task named “Split FileList.” The setup of this task is very much like the previous task. We want to receive the full result set. We have a configuration to make on the result set tab. We also need to map a parameter. Let’s take a look at those real quick.
And this is the query that we will be executing.
Notice that the Parameter we named in the Parameter Mapping tab is being used in the function call. I chose this method because I could see and understand how it works better.
The next stop in the flow is the ForEach Loop – Directory object. As the name implies, this ForEach Loop is designed to work with the obj_ListOfDirectories variable/array.
With this first Loop container, we have two tabs that need to be configured in the properties. Both Loop containers are similar in that they need the same tabs to be configured. First, let’s talk about the Collection tab.
On the Collection tab, we need to set the Enumerator option to “ForEach ADO Enumerator.” Then we need to select the obj_ListOfDirectories from the drop down labeled “ADO Source Object Variable.” Your screen should look like the following image.
With this tab configured, we can focus our attention to the quick changes that need to be made on the Variable Mappings tab. On this tab, we are telling the enumerator how to handle the data from the object variable. We are mapping columns from the result set to variables for further consumption. When configured, it should look like the following.
Inside of this ForEach loop container, we have another ForEach loop container. This second ForEach loop container handles the file extensions that we listed out in delimited fashion in the FileExtensionList variable. I have called this container “ForEach Loop – FileExtension” (just keeping it simple).
The collection tab follows the same configuration setup. The difference of course being that this container will use the obj_FileExtension object from the source variable dropdown menu.
The variable mapping tab is also slightly different. We will be mapping column 0 of the object to the FileExtension variable. The explanation for the different number between the two loop container variable mappings is simple. In obj_ListOfDirectories, we have multiple columns being returned. In obj_FileExtension, we have but one single column being returned.
This inner Loop container will loop through each of the extensions for each of the directories that have been returned to the outer loop container. the inner loop container has the remainder of the workload in it via the Script Task.
It is via the script task that we actually get to start deleting files. This was the most difficult piece of the entire package – though the script is not very large.
For the script task, I chose to implement it via the Visual Basic option (instead of C#). I have three ReadOnlyVariables employed by the script. Those variables are: User::DaysToKeep,User::DirectoryToMaintain, and User::FileExtension.
Once you have set those on the script tab, the next step is to click the Edit Script… button where we need to place the following script.
' Microsoft SQL Server Integration Services Script Task
' Write scripts using Microsoft Visual Basic 2008.
' The ScriptMain is the entry point class of the script.
<System.AddIn.AddIn("ScriptMain", Version:="1.0", Publisher:="", Description:="")> _
Partial Public Class ScriptMain
Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success
Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
Public Sub Main()
Dim SourcePath As String
Dim PurgeDays As Integer
Dim FileExtension As String
Dim thisFileInUse As Boolean = False
PurgeDays = CInt(Dts.Variables("User::DaysToKeep").Value)
SourcePath = CStr(Dts.Variables("User::DirectoryToMaintain").Value)
FileExtension = CStr(Dts.Variables("User::FileExtension").Value)
For Each file As FileInfo In New DirectoryInfo(SourcePath).GetFiles()
If ((Now - file.LastWriteTime).Days > PurgeDays) And (file.Extension = FileExtension) Then
Catch ex As Exception
Dts.TaskResult = ScriptResults.Success
An important note of interest is the need for the Try…Catch. Without this block as it is, you could run into an issue where the file (such as those pesky temp files) may be in use by some process and cause the package to error. The Try…catch will move past that nasty error and delete the files that it can.
Inside this script, you will see that I am comparing the LastWriteTime to the PurgeDays and ensuring that the file extension matches one that is in the list. Then we move into the try…catch and either delete the file that matches those criteria or throw an exception and move on to the next file.
When all is said and done, your package should look something like this.
You should also have a variable list that looks like this.
Each variable that is not an Object has a value assigned to it at this point. These values will be overwritten where applicable.
Having this package is a good start. But unless you are prepared to manually run this on a daily basis, it needs to be added to a job and scheduled. There are two ways to go about scheduling this package.
The first option is to configure the FileExtensionList and DaysToKeep variables and save the package with those values. Then run this package through SQL Agent with those values every time. The drawback to this method is that if you need to add or remove a file extension (as an example) then you need to edit the package and re-save it.
The alternative option is pass the values through the job to overwrite those variables as the job runs. Should you need to remove or add a file extension, it would just be done at the job definition level.
Let’s take a look at this second option. I will skip past how to create the job as an SSIS job in SQL Server and we will look directly how to modify those variables from the job properties.
To configure these variables directly from the SQL Agent job, open the Job properties and click on the Set Values tab (assuming you have defined this job as an SSIS Job type). You should get a screen similar to this (void of the set values shown in the pic). Just add the parameters (variables) we have discussed to this point with appropriate values to fit your needs/environment.
I have chosen to only include the four variables shown above since the remaining variables are either objects or get overwritten in the ForEach loops during processing. The only thing remaining now is to set the schedule for the job. Once set, the job (and package) will take care of the rest.
I have now shown you how to maintain some of the directories on your system through the use of SSIS and SQL server. There are many methods to accomplish this goal, it is up to each of us to choose the best method for our environment and comfort level (by means of supporting the chosen solution).