[Redshift Week] Partitioning Data

Data Partitioning with Amazon Redshift

Overview

  • We need it to optimize queries on deep historical tables
    • Several years at the day/session/event level
  • Redshift doesn’t offer a data partitioning feature for the moment & this is not known to be in their roadmap.
  • But there is a workaround with SORT key…

 

Sort key

  • But they offer a sork key mechanism allowing the database to sort data & use the order to retrieve quickly data when querying using a condition on this key.
  • The idea is to use for fact tables or huge dimensions a date (not timestamp) field such as event date, creation date.
  • As for distribution key, it is possible to propagate this sort key to child tables. For example, if you distribute retrieval & search using the session key, and sort the session table according to the session creation date, you have to add this date to retrievals & searches & sort them using it.
  • And you will have to join them using both keys for better selectivity & efficiency.
  • The main issue about this mechanism is that these tables have to be sorted regularly if the data is not appended in the sort key ascending order (vacuum). This operation can be more and more costly as the history grows. Also maintaining statistics up-to-date will cost more & more as the data expands.
  • The other issue is that dropping partitions is not possible. It is possible to delete rows but that is costly on Redshift (delete + vacuum).

 

Emulate Data Partitioning

  • If loading new data/dropping old data becomes too costly, the only solution proposed by Redshift is to use one different table for each partition.
  • A view will centralized access on top of them: using union statement.
  • It will be possible to add/drop tables & recreate the view more easily, but the data loading process will have to take care of inserting data in the right table & maintaining tables & view.