Msc thesis subject: Big data habitat suitability modelling with Apache Spark

Today, technologies are changing very fast to cope with the ever-increasing volume and velocity of data. Big data methods and tools is a disruptive technology that drives the competitiveness of companies, and governmental bodies to set and reach policy targets. While geospatial data is pervasive, currently are few tools available for parsing and query very big spatial datasets, in a scalable way. This thesis aims to implement habitat suitability models in Apache Spark, the leading tool for big data processing.

Most habitat suitability maps are generated through ecological models as MAXENT, ENFA and GARP. These models require as inputs spatial information not only for species occurrence, but also for other environmental predictor variables. These kinds of models are data-greedy as most algorithms are of complexity O(n^2). It remains a challenge how to scale those models to continental/global scales. In this research will employ Spark to investigate geo-processing libraries in Spark as SpatialSpark and Magellan, for preprocessing of environmental predictor variables, and species occurrence data, coming from global datasets as GBIF, globeland30 and/or FAO.


  • In-depth big geodata processing skills Apache Spark
  • Re-design existing ecological models using the Map/Reduce framework
  • Develop geoscripting tools for harvesting and processing global open datasets
  • Evaluate the performance of the developed solution against existing implementations


  • H. Karau, A. Konwinski, P. Wendell and M. Zaharia, Learning Spark, O’Reilly, 2015.


  • Big data course
  • Geo-scripting course
  • Programming in Python course

Theme(s): Modelling & visualisation, Human – space interaction