Earlier today someone asked me about the steps I take when creating an SSIS package. That was a great question. Often the creative process involved is decidedly non-linear and it helps to force a more linear form. The answer I gave was a list of points to broader categories. It went something like this:
- Define the flow between the end points.
For some this might seem to be the whole business of designing an ETL package. This is where we list what we are extracting, where we are loading it, and what transformations have to take place for that load to be successful. Sometimes, this will involve simply copying data to a staging system and triggering a stored procedure to load transactional data into a warehouse. This approach is sometimes called ELT. Other times we are building ETL in support of data integration and need to match data from multiple source systems together to get a single version of it. Sometimes we are extracting from a system to distribute it to multiple consumers.
Our sources could be any number of things. They maybe database servers, spreadsheets, flat files, WMI providers, web services, XML files, or any sort of custom source. Databases could be from multiple vendors and can require drivers and permissions. The account that the ETL runs as will need read permissions to extract data (not to mention connection permissions). It is important to get details about the sources up front to avoid a lot of back and forth getting connections to work.
Like our sources, destinations could be any number of the same types of things as our destinations. There are also additional destinations specific to ETL tools such as SSIS. SSIS can have as destinations related to SSAS and data warehouse dimensions. Knowing the destinations and their capabilities can help to make transformation choices less burdensome.
- List the transformation(s).
Different sources and destinations will have different specific implementations of data types. They may likewise have differing code pages and collations. These sort of transformations may be necessary even when loading to a staging environment. Other transformations can include matching data between separate sources, decoding industry codes, extracting nested data, adding audit and lineage data, or capturing statistics about data flowing between systems. One of the more common types of transformation will be schema transformations necessary to load transaction data into a warehouse form.
- Define the environmental parameters
The environmental parameters address more specific issues about the source and destination as well as performance requirements and scheduling. It is good to list these items even for those which may not make it into an initial design, because they will expose options and show due diligence undertaken in the selection of techniques from those available.
- List the preparation(s).
Does a process have to execute to generate the data? The account the ETL package is running as will need execute permissions, or need to be called by something that executes the data generating process. Is there a list that can be used to limit operations or variables that must be set before proceeding to the data flow?
- List the cleanup(s).
Do connections need to be closed? Does a completion message need to be logged or sent to a receiver? Is there another process that should be started as in the case of an ELT style load which should trigger stored procedures on the destination side?
- List the logging.
To what extent should the ETL package perform its own logging? Are there custom events for the package that need to be logged? Does the etl controller log for the package? Should the logging be pushed to the system logs or custom ones?
- List the error handling.
How does the package handle errors? If the volume of data is large it should be discretized to enable roll back to an intermediary state rather than to default. Should the package be broken into several that can be run on the separate servers (e.g. the source system has an extractor and the destination a loader) how would/should communication of the errors be handled?
- List the configuration(s)/environment(s).
While developing the package what systems will be used? How will this change when moving the package to testing and production environment? What variables and package elements will be effected by this move?
- List the relevant infrastructure element(s).
This can be broken down into several subordinate questions about the physical locations of the source and destination systems, as well that those systems themselves. In many respects this is a list of opportunities and conjectures related to performance enhancement. Listing these items can help to identify major issues to work around up front but is often best served as a place to revisit ideas for performance after the ETL is actually working.
- Same server
Sometime ETL is a matter of moving some data around on the same server. It still pays to be aware of what disks the files are residing on and how they are formatted as well as what kind of IO load that can be expected.
- Server to Server (with ETL controller)
Whether or not there is an ETL controller can determine how a package is being run. A separate ETL controller will have its own scheduling system and not run against limits the source or destination servers might have owing to their own scheduled operations. It might be desirable to have each the source and destination handle their own scheduling and operate asynchronously (i.e. the source system outputs to a file on the ETL server, which when resources become available on the destination is read by the destination server).
- Server to Multi-server (with ETL controller)
Like with a server to server ETL a broadcast type ETL might benefit from a separation of the extract and load stages. It might also benefit from a T like operation where the flow itself between the source and (at least one) destination flow directly but put a copy down to a share that can be picked up by asynchronous destinations.
- Multi-site (with SAN Replication)
When the source and destination server(s) are located in separate building understanding the WAN dynamics can help in avoiding frustrations. There are also opportunities with asynchronous ETL where there is SAN replication involved (i.e. we can have an extract write to the SAN and a load monitor its local SAN replica for the file and load when it arrives). Bandwidth restrictions can cause SLA headaches and it is best to note these things up front rather than fail when exceeding the limitations of the network. Even with fiber between the sites there is still going to be routers and switches which are electronic therefore not moving at the speed of light (which is not instantaneous transmission). ETL is also not the only traffic and your network admin can help you understand how much can be used, if only by getting you some data to ETL so you can analyze it.
How often is the ETL going to run? What other processes will be run that can interfere with it? Does it have to run 24/7 or are there windows of outage? How and when are updates to the source system going to be tested for their impact on the ETL?
How much delay between the source system and destination system changes is tolerable (ideally/realistically, cost/benefit)? If this is low, can a subset of the ETL’d data be considered higher priority (i.e. if certain columns in a table contribute to an aggregate shouldn’t that arrive first followed by the drill down detail data?)
- Define the data quality parameters
What are the meanings of numbers and letters in fields? How will they be stored on the destination side?
Are there organizations that govern any aspect of the data and how it is to be represented for the industry? (e.g. ICD9, ICD10, IATA, NAIC, etc)
- Code Page
What code page translation issues must be specifically handled (e.g. squares or question marks for unknown characters)?
- Data Types
What data types are supported by the destinations? What are the default mappings to the source data types? What will be the process for changing a data type or its scope for either the source or destination?
- Traceability (Data Lineage)
What source system data will be retained as evidence of the who, what, where, when, and why of the source sytem data? What data will be added in ETL as evidence of the who, what, where, when, and why of the load?