Colloquium

Managing Big Geospatial Data with Apache Spark

Organisator	Laboratorium voor Geo-informatiekunde en Remote Sensing
Datum	vr 9 maart 2018 15:30 tot 16:00
Locatie	Gaia, gebouwnummer 101 Droevendaalsesteeg 3 101 6708 PB Wageningen +31 (0) 317 - 48 17 00
Zaal/kamer	2

By Hector Muro (Spain)

Abstract:
Apache Spark is one of the most widely used and fast-evolving cluster-computing frameworks for big data. As most environmental modeling applications involve spatial data, this research investigates what is the state of the art with managing big geospatial data. As Apache Spark is a relatively new platform, and geospatial data extensions are mostly still work-in-progress, three packages for dealing with geospatial data in Apache Spark have been investigated, namely GeoSpark, GeoPySpark, and Magellan. First describing the functionality, then evaluating their performance with annoyingly big data geospatial datasets; and finally, compare their performance with a relational database management system. Conclussions have been derived about the maturity of the libraries, the scalability of solutions in Apache Spark, and discuss opportunities for large-scale environmental modeling.