Wednesday, September 13, 2017

Gotchas with Altering Table Schemas in Redshift

As I learned recently the hard way, simply altering tables in Redshift to perform a simple operation of adding columns doesn't behave the same way in a data warehouse as it does in regular relational databases.

Running the following SQL statement in Redshift actually has a lot more issues than you would expect:

ALTER TABLE rs_table ADD COLUMN new_column;

Despite the simple nature of this statement, many issues begin to emerge.

One of these is table locking. If an application is connecting to it via JDBC, connections would end up being queued because one Redshift cluster only allows 500 concurrent connections as per their online documentation.

Moreover, it was noticed that there were a number of duplicate records when inserts occurred on a matching key instead of updating the existing record or deleting the old record. This was most likely due to the fact that it was in the process of being altered during an insert and the key match could not be found.

This meant that an alternative approach was needed in order to prevent the locking and concurrency.
The following steps were tried and tested, and proved to work well:
  • Create the new table with the new schema suffixed with _new
  • Copy data from the old table to the new table
  • Rename old table suffixed with _old
  • Rename new table removing _new suffix
This is done in SQL as follows:

-- Create new table with new schema
CREATE TABLE rs_table_new
    id numeric(10),
    name varchar(250),
    date timestamp,
    new_column numeric(10)
) SORTKEY(dbid, date, clientid);
-- Copy data from old table to new
INSERT INTO rs_table_new
FROM rs_table;
-- Rename table
ALTER TABLE rs_table RENAME TO rs_table_old;
ALTER TABLE rs_table_new RENAME TO rs_table;

Amazingly, the total execution time for a table with over 10 million records completed in under 2 minutes as opposed to the initial 2 hours.