We just imported the World. And its history. In 47 minutes.
We, at SquidSolutions, like data. We like importing, structuring and analysing huge amounts of numbers and texts. In fact, we like it so much that we were frustrated not to be able to share more of it with the public. That’s the price to pay for working with clients like eBay, La Poste, www.sciencedirect.com: their data are so sensitive and so private, that we can’t play with them in a public manner. We run for them our stack of proprietary softwares able to perform in-database analytics on top of our hosted Greenplum appliance (shared-nothing, massively parallel processing database).
This is why we’ve decided to import the OpenStreetMap Rails port database. A fairly large data store – well, average for those who dream about Genome project! – OpenStreetMap (OSM) is a unique Open Data gathering project. Thousands of people around the world have decided to build a geographical map of our beloved planet.
They describe it with only 3 objects (what they call Data Primitives): Nodes (a place), Ways (a street) and Relations (that glue nodes and ways together). Each of them can be tagged so that contributors can easily enrich them (“this way is a highway”, “this node is a restaurant”).
Imagine a world described like this. How simple ! How huge! But that’s not all: these primitives are versioned. So in a single database, we can travel the world, and travel it through time. Sound pretty cool? Let’s check it out. We’ll extract OSM data, and then import it to our Greenplum database. It’s as simple as a Extract-Load-Transform (ELT) story.
OK, but why ?
Just for fun. Our goal is not to implement another OSM History Browser… it already exists. But OSM history database is a good way to show how our API can help reduce information flow from huge data to simple graphical representations.
Let’s import it – oh wait
Our main database here is a 8 nodes (384 GB RAM, 96 CPU cores, 96 hard drives and 96 database segments) Greenplum cluster. This cluster is fully redundant both in term of hardware and database segments. Half of the capacities are available for fail-over management. One of the greatest features of this cluster, is that all nodes can import CSV data in parallel. Give it one big text file, the cluster will split it and send it via a 10 Gb network to all nodes at a time. Our first problem is: OSM data is in XML format. Look for yourself by downloading it here. As we don’t have (yet) the latest Greenplum version, which allows to directly import XML data, we extract all data to various csv files. First, we tested osmconvert, which was very quick but … did not extract all the info we needed.
So we wrote our own Perl script, using XML::Parser. And thanks to the awesome OSM documentation, it can extract changesets, nodes, ways and relations in parallel. You can find it here: osm2csv.pl.
In a future article, we’ll see how powerful and how fast a Greenplum cluster can import massive data. Of course, good software is necessary but not sufficient for handling big data. Stay tuned !