Extract Tranform Load Design

Earlier today someone asked me about the steps I take when creating an SSIS package. That was a great question. Often the creative process involved is decidedly non-linear and it helps to force a more linear form. The answer I gave was a list of points to broader categories. It went something like this:

  • Define the flow between the end points.

For some this might seem to be the whole business of designing an ETL package. This is where we list what we are extracting, where we are loading it, and what transformations have to take place for that load to be successful. Sometimes, this will involve simply copying data to a staging system and triggering a stored procedure to load transactional data into a warehouse. This approach is sometimes called ELT. Other times we are building ETL in support of data integration and need to match data from multiple source systems together to get a single version of it. Sometimes we are extracting from a system to distribute it to multiple consumers.

  • List the source(s).

Our sources could be any number of things. They maybe database servers, spreadsheets, flat files, WMI providers, web services, XML files, or any sort of custom source. Databases could be from multiple vendors and can require drivers and permissions. The account that the ETL runs as will need read permissions to extract data (not to mention connection permissions). It is important to get details about the sources up front to avoid a lot of back and forth getting connections to work.

  • List the destination(s).

Like our sources, destinations could be any number of the same types of things as our destinations. There are also additional destinations specific to ETL tools such as SSIS. SSIS can have as destinations related to SSAS and data warehouse dimensions. Knowing the destinations and their capabilities can help to make transformation choices less burdensome.

  • List the transformation(s).

Different sources and destinations will have different specific implementations of data types. They may likewise have differing code pages and collations. These sort of transformations may be necessary even when loading to a staging environment. Other transformations can include matching data between separate sources, decoding industry codes, extracting nested data, adding audit and lineage data, or capturing statistics about data flowing between systems. One of the more common types of transformation will be schema transformations necessary to load transaction data into a warehouse form.

  • Define the environmental parameters

The environmental parameters address more specific issues about the source and destination as well as performance requirements and scheduling. It is good to list these items even for those which may not make it into an initial design, because they will expose options and show due diligence undertaken in the selection of techniques from those available.

    1. List the preparation(s).

Does a process have to execute to generate the data? The account the ETL package is running as will need execute permissions, or need to be called by something that executes the data generating process. Is there a list that can be used to limit operations or variables that must be set before proceeding to the data flow?

    1. List the cleanup(s).

Do connections need to be closed? Does a completion message need to be logged or sent to a receiver? Is there another process that should be started as in the case of an ELT style load which should trigger stored procedures on the destination side?

    1. List the logging.

To what extent should the ETL package perform its own logging? Are there custom events for the package that need to be logged? Does the etl controller log for the package? Should the logging be pushed to the system logs or custom ones?

    1. List the error handling.

How does the package handle errors? If the volume of data is large it should be discretized to enable roll back to an intermediary state rather than to default. Should the package be broken into several that can be run on the separate servers (e.g. the source system has an extractor and the destination a loader) how would/should communication of the errors be handled?

    1. List the configuration(s)/environment(s).

While developing the package what systems will be used? How will this change when moving the package to testing and production environment? What variables and package elements will be effected by this move?

    1. List the relevant infrastructure element(s).

This can be broken down into several subordinate questions about the physical locations of the source and destination systems, as well that those systems themselves. In many respects this is a list of opportunities and conjectures related to performance enhancement. Listing these items can help to identify major issues to work around up front but is often best served as a place to revisit ideas for performance after the ETL is actually working.

      1. Same server

Sometime ETL is a matter of moving some data around on the same server. It still pays to be aware of what disks the files are residing on and how they are formatted as well as what kind of IO load that can be expected.

      1. Server to Server (with ETL controller)

Whether or not there is an ETL controller can determine how a package is being run. A separate ETL controller will have its own scheduling system and not run against limits the source or destination servers might have owing to their own scheduled operations. It might be desirable to have each the source and destination handle their own scheduling and operate asynchronously (i.e. the source system outputs to a file on the ETL server, which when resources become available on the destination is read by the destination server).

      1. Server to Multi-server (with ETL controller)

Like with a server to server ETL a broadcast type ETL might benefit from a separation of the extract and load stages. It might also benefit from a T like operation where the flow itself between the source and (at least one) destination flow directly but put a copy down to a share that can be picked up by asynchronous destinations.

      1. Multi-site (with SAN Replication)

When the source and destination server(s) are located in separate building understanding the WAN dynamics can help in avoiding frustrations. There are also opportunities with asynchronous ETL where there is SAN replication involved (i.e. we can have an extract write to the SAN and a load monitor its local SAN replica for the file and load when it arrives). Bandwidth restrictions can cause SLA headaches and it is best to note these things up front rather than fail when exceeding the limitations of the network. Even with fiber between the sites there is still going to be routers and switches which are electronic therefore not moving at the speed of light (which is not instantaneous transmission). ETL is also not the only traffic and your network admin can help you understand how much can be used, if only by getting you some data to ETL so you can analyze it.

    1. Scheduling

How often is the ETL going to run? What other processes will be run that can interfere with it? Does it have to run 24/7 or are there windows of outage? How and when are updates to the source system going to be tested for their impact on the ETL?

  1. Synchronicity

How much delay between the source system and destination system changes is tolerable (ideally/realistically, cost/benefit)? If this is low, can a subset of the ETL’d data be considered higher priority (i.e. if certain columns in a table contribute to an aggregate shouldn’t that arrive first followed by the drill down detail data?)

  • Define the data quality parameters
    1. Encoding

What are the meanings of numbers and letters in fields? How will they be stored on the destination side?

    1. Standards

Are there organizations that govern any aspect of the data and how it is to be represented for the industry? (e.g. ICD9, ICD10, IATA, NAIC, etc)

    1. Code Page

What code page translation issues must be specifically handled (e.g. squares or question marks for unknown characters)?

    1. Data Types

What data types are supported by the destinations? What are the default mappings to the source data types? What will be the process for changing a data type or its scope for either the source or destination?

  1. Traceability (Data Lineage)

What source system data will be retained as evidence of the who, what, where, when, and why of the source sytem data? What data will be added in ETL as evidence of the who, what, where, when, and why of the load?

Let’s Talk About Dataflow Architecture

This morning I (Thomas Marshall) put together this drawing to facilitate a discussion on generic high level aspects of data architecture.

 

2015-1-9 Conceptual Generic High Level Data Flow

  • Application
    • These will be the primary business applications.
  • OLTP Data store
    • These are the databases applications used by your production employees, sales and customer service people. Some may be purely transactional, some will be more hybrid other will tend to be special purpose. Not all of these will have a relational database backend. Deciding what to include in this category of server is an important step as these are the systems we will be bringing together to give you a more comprehensive picture of your organization’s relation with is industry, customers, vendors, profits, sales, and competition.
  • External Data Source
    • External data sources will include government data, data from credit bureaus, industry organizations, even vendors and customers. This data will need to be processed both for fitting it to your system as well as for cleaning  it to prevent it containing an attack payload. This data is often important in predictive analytics. Knowing things like how market conditions affect customer credit ratings and being able to correlate that to marketing campaign success or customer loyalty can help a business turn crisis into opportunity.
  • Integration Staging Data store
    • The integration staging data store will serve as an intermediary and temporary store for the data that comes in from multiple sources. This can also be a staging source for transforming the OLTP data to the warehouse model. The reason for putting a staging store in has to do with contention for resources as well as timing of delivery and validation of external data. The OLTP system will tend to be tied directly to revenue and we will want to minimize our time reading and processing data on the OLTP server. External data will be out of an organizations control and its model can change without notice (especially when its free data).
  • Integration ETL
    • Integration ETL extracts data from transactional (application) and non-transactional source systems (industry publications, government and vendor supplied data). In larger organizations this might also include data stores for enterprise monitoring, service tickets, and asset management software which can help when it comes to generating reports on IT operations.
  • Data Warehouse
    • Many people see data warehouse as synonymous with star schema. It will actually tend to be a constellation of stars and snowflakes, as well as more specifically purpose oriented models. That is to say that the stars are generally optimal for analyzing data over specific dimensions (e.g. time), but some data can be included without transformations and inclusion (this is part of the impetus for big data). The primary purpose of the data warehouse is that it serves as a hub for distributing data throughout an organization. This will ideally be tended by data stewards applying techniques to deduplicate and standardize the data.
  • Dissemination ETL
    • This ETL process transforms the more generic read friendly model of the data warehouse into forms more specific to the consumers. These forms will be specific to reports and analytic structures. It will also tend to produce small reports about the delta on the warehouse.
  • Report Data store
    • Depending on the scale of the organization the report data store can reside on its own or several database instances or even on the same server as the data warehouse. Initially an organization will begin by running its reports against its OLTP server, and later against another server for load balancing. It might call this second server (or group of servers) replica, data warehouse, data mart, or reporting server. This data store will evolve over time in the interest of report performance and will do so along with the ETL that feeds it. As the reports will tend to have been run against the OLTP server they will have complicated joins that reflect a system modeled for insert and update efficiency rather than read efficiency. As the ETL starts to fill a model specific to the reports the report logic will tend to become simpler and subsequently the reports will run faster.
  • Descriptive reports
    • Descriptive reports are the standard reports most business stakeholders are familiar with. These tell the organization where it has been, what it has done, how many sales, how much money, and where other key performance indicators sit.
  • Analytic Data store
    • The analytic data store is an important component to predictive analytics. In the Microsoft BI stack this will be your SQL Server Analytic Services server(s tabular, cube). The analytic data store will have the results of calculations stored at several levels of aggregation which helps to speed returning of results as those calculations aren’t performed for each request.
  • Predictive reports
    • These reports will tend to be the most complex to create. They will have underlying data mining projects which generate the trends, clusters, and lift metrics which help your organization to determine future trends. Some of these will need to be designed to match up with descriptive reports. These will help to produce trend lines. Others won’t match up as they will be making predictions that contribute to the assumptions that contribute to those trend lines, or be used to create marketing plans.
  • Production reports
    • These reports will tend to be simple and serve to alert business stakeholders of issues with the dataflow. It can also alert to reasons for delays in processing and subsequently the descriptive and predictive reports. They also provide talking points and serve to delineate or codify the aspects of the dataflow that the stakeholders are concerned with.