Sunday 18 November 2012

How to minify GeoJSON files?

You can't do web mapping these days without knowing your GeoJSON. It's the vector format of choice among popular mapping libraries like Leaflet, D3.js and Polymaps. Size matters on the web, especially if you want to distribute complex geometries, like the world's countries. The challenge is even bigger if you want to target mobile users - or support web browsers with poor vector handling (IE < 9). This blog post will show you how to minify your GeoJSON files before sending them over the wire.

The first thing you should do is to generalize your vectors so they don't contain more detail than you need. In a previous blog post, I was able to remove 90% of the coordinates without loosing to much detail for map scale I wanted to use. This will of course have a great effect on the file size.

Today, I'm going to use country borders from the Natural Earth dataset. These datasets are already generalized for different scales (1:10m, 1:50m, and 1:110 million), so I'll use them as they are. The 1:110m (small scale) and 1:50m (medium scale) shapefiles will cover the needs for the thematic world maps I plan to make:

The 110m and 50m country polygons shown in QGIS.

Let's open the datasets in QGIS. If you look at the attribute table you'll see that each dataset contains 63 attributes, which makes them very versatile. For your web maps, you probably need just a few of the attributes, and you should remove the ones you don't need. I'm keeping the country name and the ISO 3166-1 country codes (alpha-2, alpha-3, and numeric), which can be used to link country geometries to statistical data. 

Only keep the attributes you need.

Next, we can convert the shapefiles to GeoJSON with ogr2ogr:

ogr2ogr -f "GeoJSON" -lco COORDINATE_PRECISION=1 ne_110m_admin_0_countries.json ne_110m_admin_0_countries.shp

ogr2ogr -f "GeoJSON" -lco COORDINATE_PRECISION=2 ne_50m_admin_0_countries.json ne_50m_admin_0_countries.shp

The important thing is that I'm only keeping one decimal (coordinate precision) for the 110m dataset, and two decimals for the 50m dataset, which is sufficient for my map scales. This will reduce the size of the GeoJSON files by more than half. The size of the 110m GeoJSON is now 207 kB and the 50m version is 1,897 kB. But we can do better.

The files contains a lot of whitespace, which is waste of space. I planned to use Sublime Text to remove the whitespace, but it were not able to handle the 50m GeoJSON file, so I switched to Notepad++. I used these regular expressions:

Find: "([^a-z.]) "
Replace: "$1"
This will remove all whitespace which is not succeeding a letter or a dot, which are present in country names.

Find: "\n,"
Replace: ","
Remove line breaks (keeping some for readability).

Find: "\.0([,\]])"
Replace: "$1"
Remove trailing zeros.

This will reduce the file size of the 110m GeoJSON from 207 to 156 kB, without loosing any data quality. More than 400k of whitespace characters was removed from the 50m GeoJSON file, reducing the file size from 1,897 to 1,481 kB.

If your web server is supporting gzipping on-the-fly, the 110m GeoJSON will end up being 45 kB and the 50m version will be 430 kB. Not bad!

And if this is too much work, you can always download the final GeoJSON files on thematicmapping.org.

NB! Mike Bostock’s TopoJSON would allow us to compress the GeoJSON even more, while preserving topology (shared borders between countries) - but we would need to use a map client supporting the format. Looks promising!

9 comments:

Unknown said...

Using the regular expressions like that can easily break your labels or attributes. I would suggest to use a json parser that supports minification to remove white space.

You can further minify the GeoJSON by
- removing invisible geometries, if the simplification process did not already.
- reducing the output precision of float coordinates according to the desired zoom level.
- use shorter ids for all attributes.

Bjørn Sandvik said...

Hi unknown,

Thanks for your comments.

I agree that I could also use a JSON parsers for minification, but I wanted more control to keep each country on a separate line for readability.

> removing invisible geometries, if the simplification process did not already.

The geometries is already simplified for the targeted map scale (maybe the 50m version could be a bit more simplified).

> reducing the output precision of float coordinates according to the desired zoom level.

I'm already doing this by only keeping one decimal for the 110m dataset, and two for 50m.

> use shorter ids for all attributes.

I agree!

Unknown said...

Thanks for the helpful post!

I've been playing with ogr2ogr to convert shapefiles to GeoJSON and I've used the -simplify option to reduce file size. Looking at the ogr2ogr reference I see the -lco option you've used, but where does the COORDINATE_PRECISION come from? Is there another reference I can use?

Also, the link you posted to TopoJSON is not properly formatted...

Arnie Shore said...

Have you looked into http://en.wikipedia.org/wiki/Ramer%E2%80%93Douglas%E2%80%93Peucker_algorithm for possibly reducing the number of useful points?

AS

Devdatta Tengshe said...

One more thing which will really help, is gzipping the JSON as you server it from your server. I've seen that Gzipping in itself provides upto 70% in bandwidth decreases.

Bjørn Sandvik said...

Eli: COORDINATE_PRECISION is documented here. I've fixed the TopoJSON url. Thanks!

Arnie: Yes, here.

Devdatta: Yes, I'm mentioning gzipping which also gives me a 70% reduction.

Jeff said...

Thanks for the link to the minified dataset! Much appreciated.

Anonymous said...

Hi, thanks for the useful article, just wanted to say that I released a very simple javascript page for automatically removing attributes and whitespace from GeoJSON files.
It takes an input GeoJSON and removes every attribute except the country IDs and names.
You can find it here on gitHub.
https://github.com/Pimentoso/GeoJSON-Attribute-Cleaner

Anonymous said...

Have you looked at this javascript (node) module? geojson-mend It reduces unnecessary precision and closely clustered coordinates.