[Redshift Week] Loading Data

Loading Data with Amazon Redshift : procedure

S3 Redshift loading workflow

Loading data into Redshift requires to use AWS S3 service
Screen-Shot-2014-01-29-at-4.29.38-PM

How to best export for RS?

  • Export should be done in differential mode to reduce workload in Redshift:
    • For fact tables: use the last event/row exported & start next after it. You can for example uses a server side timestamp column to know where to start/end your next export
    • For dimension tables: export only modified/inserted records with the help of insert/modifications dates if they are accurate
  • Exported files should be done with CSV format meaning with correct escaping of double quotes if necessary

 

Importing Data

  • /COPY Command
    • Best working with gzip input files. Redshift /COPY command is able to decompress on the fly when loading data
    • Support loading all directory content with option (exclude/include…)
  • Import can be done in two ways:
    • Direct loading into final tables if no transformation is required/ append only data is loaded (ex: event data)
    • Indirect loading into staging tables if transformation or upsert is required
      • Note: performing transformation in RS will leverage the cluster resources… but will impact queries performances too.

 

Upsert procedure

Upsert operation is typically designed for loading dimension data. It consists in first updating existing rows with the help of a primary/unique key, and then insert new rows:
--First update data existing on both side
update destination a1
set a1.nonpkcols=a2.nonpkcols
from source a2 on (a1.pkcols=a2.pkcols)

--First update data existing on both side
insert into destination
select a1.cols
from source a1
left join destination a2 on (a1.pkcols=a2.pkcols)
where a2.pkcols is null
--ADD THE DIST KEY IF NOT EQUAL TO THE PK