Course

PhD course on Big data exploration and object-oriented programming with Python

Organised by Wageningen Institute for Environment and Climate Research WIMEK
Venue Wageningen Campus

Scope

Python is a dynamic, readable language that is a popular platform fit for executing different kinds of numerical problems, from simple one-off scripts to large, complex software projects. This workshop is aimed at people who already have a basic knowledge of Python and are interested in using the language to explore and visualize large datasets and write more complex programs using object-oriented programming techniques.

The workshop will use examples and exercises drawn from various aspects of environmental and climate sciences, along with a variety of different datasets. We will focus on the use of the pandas and seaborn packages for data manipulation and visualization, as well as using parts of the standard library to write custom classes and integrate them with the rest of the language.

Learning goals

After completing the workshop, students will be able to:

  • Take advantage of the advanced object-oriented Python features in their own programs;
  • Have a deeper understanding of how Python works internally, which will be invaluable when making sense of existing codes and packages;
  • Be confident in tackling very large datasets with pandas;
  • Have a good overview of the different options for data visualization with Python, and what kinds of data each are suitable for.

Contents

The course starts with the core concepts of object-oriented programming, i.e., an introduction of classes, instances, methods vs functions, constructors, and magic methods. Following, we will introduce some advanced ideas where we will look at their concepts, where it is useful to use them, and details of how they work in Python. Core concepts include inheritance and class hierarchies, method overriding, superclasses and subclasses, polymorphism, composition, multiple inheritance. Next, we look at the main difference between working with core Python objects and working with pandas. Following, we will turn our attention from data analysis to data visualization. We'll start with an overview of the seaborn package then dive straight into the core chart types for looking at distributions and relationships. Further, we will survey very common chart types like strip plots, box plots and bar plots, along with less common types like swarm, violin and boxen plots. We will finish our learning part of the course by looking at real life datasets and see what tools pandas gives us to overcome possible difficulties. We will also look at some best practices for making sure that code runs quickly, and some options for what to do when code is too slow.

Programme

Session 1 (Monday morning): Classes and objects

General In this session, we will introduce the core concepts of object-oriented programming, and see how the data types that we use all the time in Python are actually examples of classes. We'll take a very simple example and use it to examine how we can construct our own classes, moving from an imperative style of programming to an object-oriented style. As we do so, we'll discuss where and when object-orientation is a good idea. Core concepts introduced: classes, instances, methods vs. functions, self, constructors, magic methods.
Practical We will practise writing classes to solve simple (climate) problems, and familiarize ourselves with the division of code into library and client that object-oriented programming demands.

Session 2 (Monday afternoon): Object-oriented programming

General Following on from the previous session, we will go over some advanced ideas that are common to most object-oriented programming languages. For each idea we'll discuss the basic concept, the scenarios in which it is useful, and the details of how it works in Python. This overview will also allow us to consider the challenges involved in designing object-oriented code. Core concepts introduced: inheritance and class hierarchies, method overriding, superclasses and subclasses, polymorphism, composition, multiple inheritance.
Practical In practice, we will work on a simulation which will involve multiple classes working together.

Session 3 (Tuesday morning): Data models, Series objects and thinking in columns

General In this session, we will address the main difference between working with core Python objects and working with pandas: the need to think about operating on entire columns of values rather than one value at a time. Here, a large number of examples will help to make this clear. Once we start thinking in this way, we will find that we can do many common data processing tasks - filtering rows and columns, creating new columns, sorting, and summarizing columns - with very little code.
Practical After a look at some special types of filtering that require slightly different syntax, we are in a position to practice solving some fairly tricky data analysis questions, than involve a mixture of selecting, filtering and aggregating columns.

Session 4 (Wednesday morning): Introducing seaborn

General In this session, we will turn our attention from data *analysis* (which normally produces tables of values as output) to data *visualization* (which produces figures as output). We will start with an overview of the seaborn package, then dive straight into the core chart types for looking at distributions and relationships. Histograms, kernel density plots and scatter plots are covered in this session, along with a few more exotic chart types like hex plot and contour plots, which can be useful alternatives to scatter plots when we have very large numbers of points to deal with. In this session we will also explore the power of seaborn's ability to map dataframe columns to things like marker size, shape and colour, and to easily make small multiple plots.
Practical Just like with pandas, by the end of this session we will understand how to make complex charts with only a small amount of code.

Session 5 (Wednesday Afternoon): Categorical axes with seaborn

General In this session, we will survey the surprisingly large number of options we have for displaying categorical data. These include very common chart types like strip plots, box plots and bar plots, along with less common types like swarm, violin and boxen plots.
Practical The diversity of chart types, covered in this and previous session, will be a good chance to discuss the trade-offs involved in creating visualizations of differing levels of detail. We will also look at a few more options for determining the style and appearance of charts, with a particular focus on the use of colour. We will learn about best practices for using colour effectively along with pitfalls to avoid.

Session 6 (Thursday morning): Complex and large data files

General To make things easier when getting started, all of the data files we have used in the course up to this point have been designed to be straightforward to use. However, real life datasets will often not be so cooperative. In this session we will look at some common features of datasets that can be difficult to work with, and see what tools pandas gives us to overcome these difficulties.
Practical We will look at examples of datasets that have missing and invalid values; that are spread over multiple files; and that are too large to easily fit into memory. All of these challenges can be overcome with some careful use of the pandas API. We will also look at some best practices for making sure that code runs quickly, and some options for what to do when code is too slow. Some of these ideas are quite technical, but the principles are useful for many different types of programming. We will discuss the difference between loops and vectorized code, *caching* and *memorization*, sampling, and how pandas-specific tools like indices and categories influence performance.

Session 7 (Friday morning): Wrap up / presentation / discussion

General information

Registration

Early bird registration deadline: 26 January 2024
Regular registration deadline: 12 February 2024

Target group

PhD candidates who wish to extend their programming knowledge using python. Additionally, for postdocs and professionals using complex datasets.

Preparation info

Preparation material will be provided several weeks before the course starts.

Group size

Minimum: 10
Maximum: 15

Course duration

3.5 days

Credit points

1.2 EC

Language

English

Fee

WUR PhDs with TSP €220 (Early) / €270 (Regular)
SENSE PhDs with TSP €440 (Early) / €490 (Regular)
Other PhDs €480 (Early) / €530 (Regular)
Staff of WUR graduate schools €480 (Early) / €530 (Regular)
Others €520 (Early) / €570 (regular)

The course fee includes coffee, tea and lunch on all 5 days, and drinks on day 5.

The fee does not include accommodation, breakfast or dinner. Accommodation is not included in the fee of the course, but there are several possibilities in Wageningen. For information on B&B’s and hotels in Wageningen please visit proefwageningen.nl/overnachten. Another option is Short Stay Wageningen. Furthermore, Airbnb offers several rooms in the area. Note that besides the restaurants in Wageningen, there are also options to have dinner at Wageningen Campus.

Cancellation conditions

  • Up to 4 (four) weeks prior to the start of the course, cancellation is free of charge.
  • Up to 2 (two) weeks prior to the start of the course, a fee of 150 € will be charged.
  • In case of cancellation within ten or less days prior to the start of the course, or if you do not show at all, a total fee will be charged.

Note: If you would like to cancel your registration, ALWAYS inform us. By NOT paying the participation fee, your registration is NOT automatically cancelled (and do note that you will be kept to the cancellation conditions).

Also note that when there are not enough participants, we can cancel the course. We will inform you if this is the case a week after the early bird deadline. Please take this into account when arranging your trip to the course (I.e. check the re-imburstment policies).