Managing Big Geospatial Data with Apache Spark

Organised by Laboratory of Geo-information Science and Remote Sensing

Fri 9 March 2018 15:30 to 16:00

Venue Gaia, gebouwnummer 101
Room 2

By Hector Muro (Spain)

Apache Spark is one of the most widely used and fast-evolving cluster-computing frameworks for big data. As most environmental modeling applications involve spatial data, this research investigates what is the state of the art with managing big geospatial data. As Apache Spark is a relatively new platform, and geospatial data extensions are mostly still work-in-progress, three packages for dealing with geospatial data in Apache Spark have been investigated, namely GeoSpark, GeoPySpark, and Magellan. First describing the functionality, then evaluating their performance with annoyingly big data geospatial datasets; and finally, compare their performance with a relational database management system. Conclussions have been derived about the maturity of the libraries, the scalability of solutions in Apache Spark, and discuss opportunities for large-scale environmental modeling.