Msc thesis subject: Big data geospatial analytics with Apache Spark

Today, technologies are changing very fast to cope with the ever-increasing volume and velocity of data. Big data methods and tools is a disruptive technology that drives the competitiveness of companies, and governmental bodies to set and reach policy targets. While geospatial data is pervasive, currently are few tools available for parsing and query very big spatial datasets, in a scalable way. This thesis aims to review existing tools and preview their use for addressing common geo-spatial operations with very large datasets.

Apache Spark is a fast and general engine for large-scale data processing, which does not support for geo-information “out of the box”. There are a couple of libraries, such as Magellan and SparkGeo, which are currently evolving very fast. The aim of this thesis is to explore the capabilities of such libraries and test their performance for common geospatial modelling tasks related to scale out with global datasets open environmental, hydrological or urban models (to be defined based on student interests and tools capabilities).


  • Evaluate the capabilities of Apache Spark libraries for big geodata processing
  • Create geoscripting examples with big raster datasets (i.e. global elevation data)
  • Create geoscripting examples with big vector datasets (i.e. Open Street Maps)
  • Demonstrate the “big data approach” for a more complex modelling task


  • H. Karau, A. Konwinski, P. Wendell and M. Zaharia, Learning Spark, O’Reilly, 2015.


  • Big data course
  • Geo-scripting course
  • Programming in Python course

Theme(s): Modelling & visualisation, Human – space interaction