[Redshift Week] Data Distribution Rules

All you want to know about Data Distribution Rules

Data Distribution Rules

Distributing data efficiently is key to ensure good performances in a MPP database.

Data must be evenly distributed

  • If uneven distribution, one slice may become a bottleneck for the whole execution

Try to minimize data movement between slices when querying data

  • Data redistribution may occur when joining tables

Redshift decides depending the values of the distribution key on which slice a single row will go.

Worst case: use a column having only null or one unique value for distribution. All rows will go to one unique slice & no parallelization will occur. Always choose a non null column to distribute data

 

Data Redistribution

  • For example, querying a fact table joined with a dimension table should cause the planner to redistribute temporarily the whole dimension data on all slices (broadcast operation) & then join these small full local copies with the slice’s resident fact data.
  • If the planner choose the opposite way (broadcast the fact data), the amount of data moved between slices would be enormous & costly (all fact data moved & joined on each slice). A broadcast is also performed when no distribution key is involved on a join clause
  • The planner can also choose to move only a subset of the local copy to the slices needing it only (redistribution operation)

 

 Distribution Key / FACT

 

FACT Tables:

  • Use session or even better, visitor id as key (visitor means a person/computer).
  • By doing this you can expect a very good distribution due to the high cardinality of these key, and ensuring that all data for a single session (or even better, all sessions’ data for a single visitor) remain on the same slice (collocation)

 

Example with 2 slices: same dist key ensure search & session data for each indivual session reside on the same slice. This is optimal as the database doesn’t have to move rows between slices

 Ex datadistribution 1

 

Example with 2 slices: differents dist key rows movements between slices for the same individual session. This is under-optimal

 ex 2

 

Distribution Key / Dimension

Dimension tables:

  • By default it is the primary key of the dimension
  • But if a dimension is referenced by other dimensions, it may be more efficient to concat all primary keys
    • Ex: we could distribute content referencing journals by the journal id on both side even if content id has a higher cardinality than journal id.
    • In such case, it will be also more efficient to add journal id in addition to content id in fact table: indeed you should never join tables using conditions not involving the distribution key.
    • That means in our example that joining retrievals fact & content (distributed by journal id), must use a equality between content id but also journal id. Not including journal id in this example will cause the planner to broadcast the whole local content data to every slice. Including journal id will allow the planner to redistribute only the needed part of content data to the correct slice.
  • ALWAYS include the full distribution key when joining tables!
  • NEW Dec 2013: ALL distribution policy allows to have a full copy on all slices: Good for account data for example

 Distribution / troubleshooting

In case of doubts, the only solution consists in test case queries you would like to improve with different distribution policy and look at the explain plan to understand data movements between slices.