If you haven’t heard of Matillion, then it’s time you knew about it. If you’ve used ETL tools in the past like the traditional SSIS for Microsoft’s SQL Server, then Matillion would be very familiar to you. Where SSIS was tailored for SQL Server, Matillion is tailored for Amazon’s Data Warehouse offering, AWS Redshift.
Matillion’s user interface is very intuitive as it groups all its components into three natural categories: those for extract, those for transform, and those for load.
Since Matillion sits on top of Redshift, both must live within AWS, so the features for extracting include components for RDS (as well as any other relational databases that support JDBC drivers). It also supports DynamoDB, S3, as well as components for MongoDB, Cassandra and Couchbase.
Components used for transformation provide the ability to aggregate, filter, rank, and even to roll up sums over multiple records from a transient “staging” Redshift table to load it to a more permanent Redshift table. The underlying functions for these components are in fact Redshift query commands.
Matillion seems quite mature with countless component features at your fingertips. Much of the transformation logic you can construct can definitely be written by hand using Redshift’s comprehensive documentation on ANSI SQL commands, but what Matillion can provide is a visual representation which significantly improves collaboration amongst team members so they can all be equally responsible for maintaining the ETL process in an efficient way.
Many could be forgiven for associating an ETL process with the old techniques of extracting large amounts of data from a monolithic data store to push into a read store using the typical CQRS architecture, but Matillion can be used for much more than that.
Our use-case at Campaign Monitor not only required data from a snapshot-style structure as is found in a traditional monolithic relational database, but also data that is sourced from an event stream for a rolling daily aggregation.
In order to achieve this we needed an initial query to obtain the total snapshot on a particular day, which can then be stored in a “roll-up” table in Redshift as the first day of aggregated totals. Let’s take the following table as an example snapshot that can be the baseline for any future events:
ListID | Subscribers | Date |
---|---|---|
1 | 1000 | 2017-10-01 |
2 | 2000 | 2017-10-01 |
3 | 3000 | 2017-10-01 |
4 | 4000 | 2017-10-01 |
This can be done using a traditional ETL job in Matillion as follows:
There is a lot of sophisticated magic under the hood with regards to the Database Query component. The SQL query that is run from the relational database will return a resultset that Matillion can chunk into separate concurrent processes. Each of these processes will create an underlying S3 file before loading the data into the “rs_lists_subscriber_rollup_staging” table in Redshift.
Once we have a snapshot on a particular date (in this case 2017–10–01 in the example above) all events for subscribers that come in after this date can be streamed from our event source, and then aggregated into the same table with a rolling count of subscribers for subsequent dates.
This requires some form of consumption externally to receive the incoming events, which can be batched into S3 files, and then persisted into an ephemeral “raw” table in Redshift using the COPY command. This raw table would look as follows:
ListID | SubscriberID | Subscribe Tally | Date |
---|---|---|---|
1 | 1 | 1 | 2017-10-02 |
1 | 2 | -1 | 2017-10-02 |
3 | 1 | 1 | 2017-10-02 |
2 | 1 | 1 | 2017-10-02 |
2 | 2 | 1 | 2017-10-02 |
2 | 3 | 1 | 2017-10-02 |
4 | 1 | -1 | 2017-10-02 |
4 | 2 | -1 | 2017-10-02 |
4 | 3 | -1 | 2017-10-02 |
4 | 4 | 1 | 2017-10-02 |
4 | 1 | 1 | 2017-10-03 |
4 | 5 | 1 | 2017-10-03 |
4 | 6 | 1 | 2017-10-03 |
4 | 7 | 1 | 2017-10-03 |
Generally, a contact subscribes or unsubscribes at a specific time from a list, and this will need to be aggregated daily per list so the data above would look as follows:
ListID | Subscribers | Date |
---|---|---|
1 | 0 | 2017-10-02 |
2 | 3 | 2017-10-02 |
3 | 1 | 2017-10-02 |
4 | -2 | 2017-10-02 |
4 | 4 | 2017-10-03 |
This aggregation can also be done in Matillion very easily from the raw table:
The Aggregate component would simply need to sum the Subscribe/Unsubscribe column with a grouping of ListID. Once this is achieved, there will need to be a union of data from the existing snapshot data so that there can be a baseline to work with for rolling counts. Only then can the very powerful Window Calculation component be used for rolling up the sum of subscribers based on a partition (in this case the ListID) and a partition ordering (in this case Date). The Matillion job would look as follows:
There are further properties for the Windows Calculation component that need to be considered, including setting the lower bound to “unbounded preceding”. This ensures that the calculated sum is rolled up from the very first record of the dataset that was unioned from the previous component. The properties would look as follows:
There are of course other edge cases that will need to be considered that could slowly make your transformation job increasingly complex. These can include lists that were created after the snapshot was taken, or even more interesting is catering for missing days that receive no events which will need to be populated with a default of zero so aggregation can be included in the roll-up for that day. Thankfully Matillion makes it simple to grow on the above transformation so it can be easily maintained.
The final table would then look as follows:
ListID | Subscribers | Date |
---|---|---|
1 | 1000 | 2017-10-01 |
2 | 2000 | 2017-10-01 |
3 | 3000 | 2017-10-01 |
4 | 4000 | 2017-10-01 |
1 | 1000 | 2017-10-02 |
2 | 2003 | 2017-10-02 |
3 | 3001 | 2017-10-02 |
4 | 3998 | 2017-10-02 |
1 | 1000 | 2017-10-03 |
2 | 2003 | 2017-10-03 |
3 | 3001 | 2017-10-03 |
4 | 4002 | 2017-10-03 |
And this can then be used when reporting over date ranges from your front end reporting application.
This example is just one of many use-cases where Matillion has helped Campaign Monitor fulfill its ETL requirements in order to provide reporting on large datasets. In particular, event based data has helped significantly improve SLAs since the frequency of aggregating data can be done within minutes compared to that of the traditional relational data ETLs that process large queries over many hours and provide much lower refresh rates for report data.