We just imported the world – part II

In a previous article, we talked about OpenStreetMap, and its database. Our goal is to make it as fast as possible, and we will use three ways to do it:

  • Improve the OSM database structure (DDL)
  • Import it in a Greenplum cluster: the database will be distributed across 48 independent segments hosted in 8 servers, based on a shared-nothing, Massively Parallel Processing architecture. That way, each SQL statement will be performed by 48 parallel processes.
  • Build efficient SQL queries with our big-data-specialized API engine

Defining a better database structure

During the extraction, we took the time to define our future database structure (or DDL). Our database schema is of course heavily inspired by OSM’s schema. We’ve made some improvements to benefit from Greenplum architecture:

  • We created partitions in every table with a date column. The oldest one takes the entire 2005 year, as the first changeset was from April 2005; we assumed that contributions may have grown over the years, so the five next years are cut in quarters (4 partitions by year), and the last three years are cut by month (12 partitions by year). This will greatly improve the speed of our requests when filtering data on period.
  • Every partition, except the three most recent, are “append only” and compressed. We won’t update or delete rows in those partitions, so we can save 20 bytes for each row. And we compress them, so we’ll save space and IOs on disks.
  • We changed version columns type from BIGINT to INTEGER.
  • As data are distributed across several nodes, we chose the tables id as the unique distribution key.
  • In further enrichment process, we’ll link changesets to countries. That’s why we added a “country_code” column in the changesets table.

You can download the schema here: schema.sql.

Loading the data

After 30 hours of extraction, our script gave us this:

$ ls -1sh
total 477G
1.7G changeset.csv
1.2G changeset_tags.csv
222G node.csv
58G node_tags.csv
449M relation.csv
15G relation_members.csv
1.2G relation_tags.csv
19G way.csv
119G way_nodes.csv
42G way_tags.csv
$ wc -l *
14922523 changeset.csv
30463339 changeset_tags.csv
2790813127 node.csv
1200122107 node_tags.csv
7649388 relation.csv
511079950 relation_members.csv
36888667 relation_tags.csv
322810448 way.csv
5151526585 way_nodes.csv
1109648479 way_tags.csv

One of the few limitations in this extraction is that we did not count the number of objects involved in each changeset. That is a job for the SQL file involved in the further transformation of data.

Now that we have a .csv file for each table to import, we use the fantastic gpload utility provided with greenplum. It’s very flexible YAML configuration – called control file – allow us to perform basic ELT work during import.

For example, OSM Nodes have coordinates (latitude and longitude). They are of type DOUBLE in the XML export, but in INTEGER type in the database for optimisation. With gpload utility, we can transform on the fly data while importing. Here is the control file used during Nodes import:
DATABASE: *******
USER: *******
HOST: *******
PORT: *******
- FILE: [nodes*.csv.gz]
- latitude: NUMERIC
- longitude: NUMERIC
- changeset_id: INTEGER
- visible: BOOLEAN
- timestamp: TIMESTAMP
- date: DATE
- tile: BIGINT
- version: INTEGER
- user_id: INTEGER
- FORMAT: csv
- ESCAPE: ''
- NULL_AS: ''
- HEADER: false
- QUOTE: '"'
- ERROR_TABLE: dataprocessing.errors_import
- TABLE: dataprocessing.nodes
- MODE: insert
id: id
latitude: CAST (latitude * 10000000 as INTEGER)
longitude: CAST (longitude * 10000000 as INTEGER)
changeset_id: changeset_id
visible: visible
timestamp: timestamp
date: date
tile: tile
version: version
user_id: user_id
- TRUNCATE: true

In this example, we have to define each input column and map them to each output ones, because we need to transform two columns values on the fly. Of course, if you do not need this type of transformation, you can skip the boring COLUMNS and MAPPING entries.

OK, but now, how’s it fast ? You’ll see in our last article about OpenStreetMap & SquidSolutions.