Thesis subject

Automated Data Extraction of Scientific Literature

Level: MSc

Research area/discipline: Data Science, Information Systems

Prerequisites: Programming in Python (INF-22306), Big Data (INF-34306), Data Science Concepts (INF-34306) (or Machine Learning (FTE-35306))

Short description:

Recently, Unilever is aiming to shift its focus from artificial to natural flavors as it adds value to their products. Since most research on flavors has been focused on artificial flavors, literature reviews on this topic must be re-evaluated.

Systematic Literature Review (SLR) studies, one of the most robust review methods, aim to identify relevant primary papers, extract the required data, analyze, and synthesize results to gain further and broader insight into the investigated domain. Conducting an SLR is a time-consuming, laborious, and costly effort. As such, several researchers developed different techniques to automate the SLR process. Previous research has also shown that the data extraction step is one of the most time-consuming steps in the SLR process. Therefore, this thesis will look at automated data extraction of scientific literature focusing on food sciences. Examples of extracted data from scientific literature could be ingredients, volumes, cooking time, tables, and images.


The work in this master thesis entails:

  • To collect full-text articles or PDFs from SLRs in the food sciences field.
  • To assess the solutions available to extract data from the scientific literature in a scalable and efficient manner.
  • To design and develop a Machine Learning algorithm that enables the implementation of data extraction in the scientific literature.

    Required skills/knowledge:

    Programming in Python, basic data analytics, and machine learning techniques

    Relevant literature:

    • D.D.A. Bui, G. Del Fiol, J.F. Hurdle, S. Jonnalagadda, Extractive text summarization system to aid data extraction from full text in systematic review development, Journal of Biomedical Informatics, 64 (2016a) 265-272.
    • D.D.A. Bui, G. Del Fiol, S. Jonnalagadda, PDF text classification to leverage information extraction from publication reports, Journal of Biomedical Informatics, 61 (2016b) 141-148.
    • M.B. Aliyu, R. Iqbal, A. James, The Canonical Model of Structure for Data Extraction in Systematic Reviews of Scientific Research Articles, in: 2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS), 2018, pp. 264-271.
    • C. Blake, A. Lucic, Automatic endpoint detection to support the systematic review process, Journal of Biomedical Informatics, 56 (2015) 42-56.
    • B.C. Wallace, J. Kuiper, A. Sharma, M. Zhu, I.J. Marshall, Extracting PICO sentences from clinical trial reports using supervised distant supervision, J. Mach. Learn. Res., 17 (2016) 4572–4596.

    For more information:

    For making an appointment to discuss the thesis topic, please send an email to: Dr.Ir. Tarek AlSkaif | Assistant Professor | Information Technology group (INF) | Wageningen University & Research (WUR) | |