Colloquium

Multi-dimensional event detection on Twitter; CAS knowledge discovery using Big Data Analysis

Organisator Laboratory of Geo-information Science and Remote Sensing
Datum

di 22 april 2014 13:00 tot 13:30

Locatie Gaia, building number 101
Droevendaalsesteeg 3
101
6708 PB Wageningen
+31 317 48 16 00
Zaal/kamer 1

By Boyen van Gorp

Abstract

Twitter is an easily accessible geo referenced data platform which has a large potential for geo-information science. Many researchers have made attempts to extract events from Twitter with mixed results. Ad hoc definitions are used to solve specific problems without attention for the overall problem. A paradigm which serves as theoretic background for event detection from social media is missing. This research attempts to provide a definitions for "events" and "Twitter" leading to an objective polymorph methodology that can be applied regardless of context.

A combination of cross-disciplinary definitions is used to form an overarching event definition. Events are identified as multi-dimensional collection of happenings that have four dimensions: context, spatial, temporal and actor. As such, event detection needs to consider the multi-dimensionality of events within its approach. Each dimension has properties which are identified as a: centroid, volume and clustering. This research lays a focus on events in the digital reality and attempts to create a methodology that is analogous to event detection in the physical reality in the human brain. In which happenings of similar dimensionality are clustered together forming events. Twitter is classified as a Complex Adaptive System (CAS), which gives implications on its features and behaviours. Assumptions of homogeneity and linearity are not fitting in the context of a CAS which requires a complex analysis approach. Big Data analysis is an especially suitable method to analyse Twitter as it allows describing complex events in great detail. However, to reach a knowledge level in which the data can be described, several steps need to be taken. A methodology is created for the first three steps following the BSID knowledge pyramid: procurement (Big Data), clustering (Small Data) and classification (Information) of the data.

The implementation of the procurement of Big Data (Tweets) is done using the Twitter REST API in a Python interface. Tweets are saved in a MongoDB which is chosen due to its flexibility in queries. A flexible querying structure is essential as it allows efficient selections on the data saving computing resources. The step from Big Data to Small Data is made by clustering using the K-Means algorithm, with optimized initial centroids, in both the spatial and context dimension. Small Data is turned into Information by classification of the created clusters, using the properties of the dimensions in line with the event definition. The clustering of a collection is measured using the k-Nearest Neighbour (kNN) and k-Cosine Similarity (kCS) algorithms for the spatial and context dimension respectively. The context dimension is normalized by creating a vector space model term frequency (tf) inverse document frequency (idf).

Event collections are often built up using the Twitter structuring mechanisms (hashtags, mentions). Even if alternative methods are used the data is not classified, which leads to uncertainty on the contents of the data. The proposed methodology solves these issues by: identification of event clusters in multiple dimensions, event clustering between/within collections improving the value of the data and an objective multi-dimensional classification methodology.