In a previous article, we talked about OpenStreetMap, and its database. Our goal is to make it as fast as possible, and we will use three ways to do it:
- Improve the OSM database structure (DDL)
- Import it in a Greenplum cluster: the database will be distributed across 48 independent segments hosted in 8 servers, based on a shared-nothing, Massively Parallel Processing architecture. That way, each SQL statement will be performed by 48 parallel processes.
- Build efficient SQL queries with our big-data-specialized API engine
Defining a better database structure
During the extraction, we took the time to define our future database structure (or DDL). Our database schema is of course heavily inspired by OSM’s schema. We’ve made some improvements to benefit from Greenplum architecture:
- We created partitions in every table with a date column. The oldest one takes the entire 2005 year, as the first changeset was from April 2005; we assumed that contributions may have grown over the years, so the five next years are cut in quarters (4 partitions by year), and the last three years are cut by month (12 partitions by year). This will greatly improve the speed of our requests when filtering data on period.
- Every partition, except the three most recent, are “append only” and compressed. We won’t update or delete rows in those partitions, so we can save 20 bytes for each row. And we compress them, so we’ll save space and IOs on disks.
- We changed version columns type from BIGINT to INTEGER.
- As data are distributed across several nodes, we chose the tables id as the unique distribution key.
- In further enrichment process, we’ll link changesets to countries. That’s why we added a “country_code” column in the changesets table.
You can download the schema here: schema.sql.
Loading the data
After 30 hours of extraction, our script gave us this:
$ ls -1sh
$ wc -l *
One of the few limitations in this extraction is that we did not count the number of objects involved in each changeset. That is a job for the SQL file involved in the further transformation of data.
Now that we have a .csv file for each table to import, we use the fantastic gpload utility provided with greenplum. It’s very flexible YAML configuration – called control file – allow us to perform basic ELT work during import.
For example, OSM Nodes have coordinates (latitude and longitude). They are of type DOUBLE in the XML export, but in INTEGER type in the database for optimisation. With gpload utility, we can transform on the fly data while importing. Here is the control file used during Nodes import:
- FILE: [nodes*.csv.gz]
- id: INTEGER
- latitude: NUMERIC
- longitude: NUMERIC
- changeset_id: INTEGER
- visible: BOOLEAN
- timestamp: TIMESTAMP
- date: DATE
- tile: BIGINT
- version: INTEGER
- user_id: INTEGER
- FORMAT: csv
- DELIMITER: ';'
- ESCAPE: ''
- NULL_AS: ''
- HEADER: false
- QUOTE: '"'
- ERROR_LIMIT: 25
- ERROR_TABLE: dataprocessing.errors_import
- TABLE: dataprocessing.nodes
- MODE: insert
latitude: CAST (latitude * 10000000 as INTEGER)
longitude: CAST (longitude * 10000000 as INTEGER)
- TRUNCATE: true
In this example, we have to define each input column and map them to each output ones, because we need to transform two columns values on the fly. Of course, if you do not need this type of transformation, you can skip the boring COLUMNS and MAPPING entries.
OK, but now, how’s it fast ? You’ll see in our last article about OpenStreetMap & SquidSolutions.