SQL Server Integration Services
SSIS is the string that binds all data together. Being able to take separate pieces of data and combine them all together is the job of SSIS. Data is brought in using many different types of data-types and transformations to bring the data to a standardizes structure where it can be combined into a SQL Table, Aggregate SQL Table or SQL OLTP data-mart. Using SSIS you are able to use many inputs data-types and use Visual Basic or C#(SQL 2008) to customize each data stream to transform into the excepted data format. This allows new data storage methods such as a Unstructured, JSON or XML data types to be used as a data-source. In my first example I will demonstrate a powerful way to read just about any data and normalize it.
Example of a non-supported datasource
In this example I am going to use a file that was created on a Linux machine created by PERL. It is use to track all activity on our site. The format was decided by someone who needed to look at activity but is not in a native sql format. This is one of the most common problems a DBA has to deal with is data in varying formats and bringing it all together to get a single picture of what is happening on your site company, ETL Extract Transform and Load using SQL Server SSIS is the way. Again this example is not how you would work with this format but how to work with any format. I will demonstrate XML and JSON later.
What you need
SQL Server 2008 Standard Edition or Developer Edition
In my example
The script has 6 components to it
- Session: A line separator for all we care
- Source: The page the activity was done on
- Action: What happened on the page
- Date: The date the record was created
- SessionId: The unique identifier of the whole session from landing to to navigate in this example
- InstanceId: A id for every event
So if someone navigated to derekevans.us from google on landing there for the first time SessionId would be the same as their InstanceId. When I click on a link my Source is the same with a new InstanceId. This allows me to see the session from beginning to end and ordered by date I can see the activity.
Date: 2011-11-01 16:38:24.670
So if someone where to search google for derekevans.us and land on my page search for dbcc and look at the first 2 code samples it would look like this
There are some ugly ways in SSIS to normalize this file I will not go into any of them. What you need is a Script Component as your data source. With this you can perform normalizations and transformations to turn this into a regular datasource.
Create a SSIS Project in SQL Server 2008 Business Intelligence Studio
Add a DataFlow Task and add a script component within it. You need to set it as a Source as you will be adding all the outputs from the raw file.
Create your outputs. This is a great place to document what the columns are in the description. This becomes you meta data. You should describe what the columns contain and will help you in the long run. I will cover this later.
The script to normalize session file
When you open the script component you need to add your outputs
Just a note if you want to add your columns from first to last you have to select Output Columns when you click add column.
- Session: Ignore dont add it just a line seperator for all we care
- Source: [DT_STR] 2000. The source web address string
- Destination: [DT_STR] 2000. The destination web address
- Date: [Date] the date the action happened or when the row we are creating was made.
- SessionId: unique identifier [DT_GUID] A identifier of the time the user was on the page. It will be the same unless the wait over 30 minutes.
- InstanceId: unique identifier [DT_GUID] A unique identifier of the activity may it be click, navigate, AJAX script. Its unique and can be used as the primary key.
Create the script
If you have done everyting right you should see
Cusomizing the script to nomalize the file
In the declarations add using System.IO;
This will allow us to work with the file
In this example all you need to work with is CreateNewOutputRows()
delete public override void PreExecute()
delete public override void PostExecute()
In public override void CreateNewOutputRows() change to
public override void CreateNewOutputRows()
String sPath = @”C:\temp\Session.txt”;
using (StreamReader sr = new StreamReader(sPath))
while ((line = sr.ReadLine()) != null)
SessionBuffer.Source = line.Substring(line.IndexOf(“:”) + 2);
SessionBuffer.Destination = line.Substring(line.IndexOf(“:”) + 2);
SessionBuffer.Action = line.Substring(line.IndexOf(“:”) + 2);
SessionBuffer.Date = Convert.ToDateTime(line.Substring(line.IndexOf(“:”) + 2));
SessionBuffer.SessionId = line.Substring(line.IndexOf(“:”) + 2);
SessionBuffer.InstanceId = line.Substring(line.IndexOf(“:”) + 2);
Save and close script component
Add a flat file destination
Click on destination and add a delimited format
Make sure to do the mappings and you now should be able to run and convert to csv Here is the final project