Big spatial data technology

Use case

Big spatial data technology: from traditional systems to scalable computing

A vast amount of data with both spatial and temporal attributes is being generated due to our increasing use of sensory devices such as drones, smartphones and the Internet of Things. At the same time, people are becoming more comfortable with high-speed data processing, artificial intelligence applications and real-time visualisations from vendors such as Google, Microsoft, Apple and Facebook.

eTheir products depend largely on their ability to achieve extremely fast data processing driven by extensive hardware and scalable computing. If we want to do similar things in our own field, albeit on a smaller scale, we must consider this paradigm shift from the traditional systems we have grown accustomed to (e.g. in ICT, spatial data infrastructures and geographic information systems) to more scalable computing solutions.

Our approach: tackling the paradigm shift

Scalable computing means using an infrastructure that can adapt when the need for more powerful computing capabilities increases and scales back again when that demand decreases. This ensures the effective use of available computing power in a computer cluster (e.g. from a datacentre or a cloud service provider). WUR IT will provide such systems via Microsoft Azure and Red Hat OpenShift. These facilities can be used to build reactive systems, which is currently the most adopted software architecture for creating applications that are responsive, resilient, elastic, and message-driven. These four key elements are clearly described in the Reactive Manifesto and ensure that the resulting system is flexible, loosely-coupled and scalable. This makes it highly responsive and ensures that it can meet the requirements of modern users, even when processing large volumes of spatial and temporal data produced at high velocities.

Tackling this paradigm shift first calls for the acquisition of essential knowledge about new technologies (e.g. Spark, Hadoop, Fink and NoSQL databases), new programming paradigms (functional programming) and the accompanying programming languages (e.g. Scala, Clojure and Haskell). These can then be combined with more domain-specific functionalities for spatial data processing (e.g. GeoTrellis and GeoMesa) and machine learning (e.g. TensorFlow, Keras and SparkML). Lastly, there are new software and architecture design patterns to consider (Command/Query Responsibility Segregation, Event Sourcing, micro services), and packaging and deployment options (Docker containers, Kubernetes). Ultimately, systems should be made accessible for a large researcher and developer audience by proving easy-to-use application programming interfaces (APIs) and Python or R bindings.

(Expected) impact of the approach

Reactive systems and scalable computing are essential for true big data processing. One of the core aspects of big data is that it exceeds data volumes and velocities that can be handled with traditional IT solutions. While we don’t have to invent new technologies, we do have to invest in learning how to use them and in understanding how to apply them to our specific domain. For example, they could benefit subsequent iterations of our AgInfra+ use cases and subsequent versions of the AgroDataCube for handling sub-field data at scale and with high responsiveness for advanced analytics and visualisations.

Next steps

In the coming years, substantial developments in the handling of big spatio-temporal data are expected to take place. As this is an essential expertise in many WUR domains, we will continue improving our skills and capacities and apply these in innovative research.

Facts and figures

  • The NASA archive of satellite Earth images has more than 500 TB of available data and is increasing by 25 GB per day.
  • Twitter issues about 10 million geotagged tweets per day, which is 2% of the entire daily Twitter stream.
Figure 1 GeoTrellis geographic data processing engine demo
Figure 1 GeoTrellis geographic data processing engine demo

Tools used

  • Container platforms (Docker, Kubernetes)
  • Computing clusters (surfSARA, Microsoft Azure, OpenShift)
  • Grid infrastructure software (D4Science, gCube)
  • Scalable computing frameworks (Apache Hadoop, Spark)
  • Streaming data processing (Apache Fink, Akka)
  • Spatio-temporal data processing (GeoTrellis, GeoMesa, GeoSpark)
  • Machine learning frameworks (TensorFlow, SparkML, Keras)
  • Functional programming languages (Scala, Haskell, Clojure)

Cooperation with partners

  • WUR Information Technology Group
  • WUR Laboratory of Geo-Information Science and Remote Sensing
  • WUR Wageningen Data Competence Centre
  • WUR IT Services
  • AgInfra+ and AgroDataCube partners