Distributed rendering architecture for rendering lots and lots of tiles

January 18, 2014    architecture GEO mapnik Python scaling

A customer contacted me after seeing my nen1878reader repository (used to process GBKN data.) He wanted to render lots and lots of tiles from several data sets, including GBKN data, used for one of their applications. The solution proposed here uses a library to distribute tasks (rendering tiles, in this case) to several workers: Celery. We use Mapnik to render tiles. The solution we ended up with is horizontal scalable. Furthermore, we built an on-the-fly tile renderer as well, to render the deeper zoom levels on the fly.

At first I looked at QGis. QGis is relatively easy to use and create maps. The QTiles plugin can create tiles from your project, for use with Google Maps, for example. Unfortunately, the QGIS renderer is not multi threaded. Also, in my tests, the rendered tiles were not anti-aliased and has some rendering errors. So, QGis+QTiles was quickly ditched.

Mapnik is a good alternative. It is an open source library written in C++ able to render beautiful maps. Python bindings for Mapnik are also available, making it easy to use in Python. Also, Mapnik is thread safe, making it easy to use on all cores of a processor at the same time, thus improving rendering speed. In combination with Tilemill, one can easily make maps from all sorts of data.

Once we determined the used rendering method, we estimated the time and storage needs. After rendering one zoom level and extrapolating the time required, the number of tiles, and size, we came to several months of rendering on a single quad-core machine a whooping 8TB of required storage (a bit over-estimated, but not that much.)

Given these numbers, we went to a dual-strategy approach: pre-render the tiles for higher zoom levels and render tiles for lower zoom levels on the fly. The tiles for the higher zoom levels (8-18, Goole maps levels) generally require more time to render, while the tiles of the lower zoom levels (>18) require less time to render. This is due to the amount of features shown on the tile. On my 3-year-old laptop, rendering a tile on a lower zoom level costs about 200ms. On a fast server, this time is greatly reduced. Add a caching mechanism, and we can drop this to a few msec per tile on subsequent requests.

Architecture for distributed rendering

The architecture for the implementation of the distributed renderer heavily depends on Celery. Celery is a distributed task queue. One can create a task, which is then distributed to one of the attached Workers.

The Workers use Mapnik to render the tiles. A tile-task consists of the zoom level, the X and the Y coordinator. Each worker opens the given Mapnik configuration file on startup. This Mapnik configuration file contains information about the rendered data sources, and the used styles for the data sources. Once can create this configuration file manually, or generate it in Tilemill. Tilemill, in turn, provides a nice GUI and uses CartoCSS for rendering styles.

A Coordinator creates a batch of tiles, which sends it the task queue. Celery then distributes the batch to one of the Workers. Tiles are batched to reduce the Celery-overhead for each tile. The worker renders the tiles and returns the rendered tiles to the Coordinator. The Coordinator then stores the tiles in a Backend. A Backend is a module which provides a means to store and (later) retrieve a tile. For example, I built a MySQL Backend to store all the tiles in a MySQL database. Another Backend, the FileStorageBackend, stores tiles on the File System.

Architecture for on the fly rendering

A simple HTTP server serves tiles from the a Backend. The HTTP server queries the Backend if it has a tile. If so, it serves the tile from the Backend. If not, it returns a 404. Nothing very special.

We render the tiles for the lower levels (>18, Google maps zoom levels) on the fly. This way, not all tiles are pre-rendered, sparing the need for an expensive storage array and investing lots of time rendering-time beforehand. To implement this, I created a new Backend: OnTheFly Backend. This backend actually contains a Mapnik renderer, which renders the tiles on the fly. Tiles for the lower levels render in under 200ms on my 3-year-old laptop. A quick server is able to do it in less than 100ms per tile.

Still, 100ms per tile is very noticeable when the shown map consists of around 20 tiles: 20 tiles * 100ms/tile = 2 seconds. To overcome this problem, another Backend can be used by the OnTheFly Backend to provide caching. In our case, we used the MySQL Backend for caching. Now, when the OnTheFly Backend is asked for a tile, it first queries the caching MySQL Backend and serves it from there. If a tile does not exist, it is rendered, stored in the cache, and served to the user. Once a tile has been rendered and cached, a requests takes just a few ms instead of 100ms.

Furthermore, to make full use of all cores in the server, I introduced HAProxy to load balance the requests over multiple HTTP servers with OnTheFly Backends. Multiple cores now render tiles simultaneously, speeding everything up and improving the user experience. HAProxy can be used to scale to multiple servers as well. Virtually, providing a horizontally scalable solution.

In conclusion, we solved two problems: rendering a large number of tiles at once, and rendering tiles on the fly. The Mapnik library was heavily used. Also we use Celery and HAProxy to make both approaches scalable. The customer was happy with the outcome!