SSIS

SQL Server Integration Services

SSIS is the string that binds all data together. Being able to take separate pieces of data and combine them all together is the job of SSIS. Data is brought in using many different types of data-types and transformations to bring the data to a standardizes structure where it can be combined into a SQL Table, Aggregate SQL Table or SQL OLTP data-mart. Using SSIS you are able to use many inputs data-types and use Visual Basic or C#(SQL 2008) to customize each data stream to transform into the excepted data format. This allows new data storage methods such as a Unstructured, JSON or XML data types to be used as a data-source. In my first example I will demonstrate a powerful way to read just about any data and normalize it.

Example of a non-supported datasource

In this example I am going to use a file that was created on a Linux machine created by PERL. It is use to track all activity on our site. The format was decided by someone who needed to look at activity but is not in a native sql format. This is one of the most common problems a DBA has to deal with is data in varying formats and bringing it all together to get a single picture of what is happening on your site company, ETL Extract Transform and Load using SQL Server SSIS is the way. Again this example is not how you would work with this format but how to work with any format. I will demonstrate XML and JSON later.

What you need

SQL Server 2008 Standard Edition or Developer Edition

In my example

The script has 6 components to it

  • Session: A line separator for all we care
  • Source: The page the activity was done on
  • Action: What happened on the page
  • Date: The date the record was created
  • SessionId: The unique identifier of the whole session from landing to to navigate in this example
  • InstanceId: A id for every event

So if someone navigated to derekevans.us from google on landing there for the first time SessionId would be the same as their InstanceId. When I click on a link my Source is the same with a new InstanceId. This allows me to see the session from beginning to end and ordered by date I can see the activity.

First Session

Session:
Source: http://www.google.com/#sclient=psy-ab&hl=en&source=hp&q=derekevans.us&pbx=1&oq=derekevans.us
Destination: http://www.derekevans.us/
Action: navigate
Date: 2011-11-01 16:38:24.670
SessionId: 9584FD07-F1E9-4580-BBD6-4C4CF7D96301
InstanceId: 9584FD07-F1E9-4580-BBD6-4C4CF7D96301

So if someone where to search google for derekevans.us and land on my page search for dbcc and look at the first 2 code samples it would look like this

Session

There are some ugly ways in SSIS to normalize this file I will not go into any of them. What you need is a Script Component as your data source. With this you can perform normalizations and transformations to turn this into a regular datasource.

 Create a SSIS Project in SQL Server 2008 Business Intelligence Studio

Create Project SSIS SQL 2008

Add a DataFlow Task and add a script component within it. You need to set it as a Source as you will be adding all the outputs from the raw file.

 

Create Script Component

Create your outputs. This is a great place to document what the columns are in the description. This becomes you meta data. You should describe what the columns contain and will help you in the long run. I will cover this later.

The script to normalize session file

When you open the script component you need to add your outputs

Create your ourput rows

 Just a note if you want to add your columns from first to last you have to select Output Columns when you click add column.

  • Session: Ignore dont add it just a line seperator for all we care
  • Source: [DT_STR] 2000. The source web address string
  • Destination: [DT_STR] 2000. The destination web address
  • Action: [DT_STR] 50. The action that happened on the page. I only use navigate but for Java, Javascript and AJAX you will need to create others because these activities happened at the client end but we need to track them.
  • Date:  [Date] the date the action happened or when the row we are creating was made.
  • SessionId: unique identifier [DT_GUID] A identifier of the time the user was on the page. It will be the same unless the wait over 30 minutes.
  • InstanceId: unique identifier [DT_GUID] A unique identifier of the activity may it be click, navigate, AJAX script. Its unique and can be used as the primary key.

Create the script

If you have done everyting right you should see

ScriptComponent

Cusomizing the script to nomalize the file

In the declarations add using System.IO;

This will allow us to work with the file

In this example all you need to work with is CreateNewOutputRows()

delete public override void PreExecute()

delete public override void PostExecute()

In public override void CreateNewOutputRows() change to

public override void CreateNewOutputRows()
{
String sPath = @”C:\temp\Session.txt”;
using (StreamReader sr = new StreamReader(sPath))
{
String line;
while ((line = sr.ReadLine()) != null)
{
if (line.Contains(“Session:”))
{
SessionBuffer.AddRow();
}
if (line.Contains(“Source:”))
{
SessionBuffer.Source = line.Substring(line.IndexOf(“:”) + 2);
}
if (line.Contains(“Destination:”))
{
SessionBuffer.Destination = line.Substring(line.IndexOf(“:”) + 2);
}
if (line.Contains(“Action:”))
{
SessionBuffer.Action = line.Substring(line.IndexOf(“:”) + 2);
}
if (line.Contains(“Date:”))
{
SessionBuffer.Date = Convert.ToDateTime(line.Substring(line.IndexOf(“:”) + 2));
}
if (line.Contains(“SessionId:”))
{
SessionBuffer.SessionId = line.Substring(line.IndexOf(“:”) + 2);
}
if (line.Contains(“InstanceId:”))
{
SessionBuffer.InstanceId = line.Substring(line.IndexOf(“:”) + 2);
}
}
}

Save and close script component
 Add a flat file destination

SSIS Flat file destination

Click on destination and add a delimited format

Create DTS Output

Make sure to do the mappings and you now should be able to run and convert to csv Here is the final project

SSIS_Normailize_Session