A data-driven prediction of lifetime resilience of dairy cows using commercial sensor data collected during first lactation

Ouweltjes, Wijbrand; Spoelstra, Mirjam; Ducro, Bart; Haas, Yvette de; Kamphuis, Claudia


Reliable prediction of lifetime resilience early in life can contribute to improved management decisions of dairy farmers. Several studies have shown that time series sensor data can be used to predict lifetime resilience rankings. However, such predictions generally require the translation of sensor data into biologically meaningful sensor features, which involve proper feature definitions and a lot of preprocessing. The objective of this study was to investigate the hypothesis that data-driven random forest algorithms can equal or improve the prediction of lifetime resilience scores compared with ordinal logistic regression, and that these algorithms require considerably less effort for data preprocessing. We studied this by developing prediction models that forecast lifetime resilience of a cow early in her productive life using sensor data from the first lactation. We used an existing data set from a Dutch experimental herd, with data of culled cows for which birth dates, insemination dates, calving dates, culling dates, and health treatments were available to calculate lifetime resilience scores. Moreover, 4 types of first-lactation sensor data, converted to daily aggregated values, were available: milk yield, body weight, activity, and rumination. For each sensor, 14 sensor features were calculated, of which part were based on absolute daily values and part on relative to herd average values. First, we predicted lifetime resilience rank with stepwise logistic regression using sensor features as predictors and a P-value of <0.2 as the cut-off. Next, we applied a random forest with the 6 features that remained in the final logistic regression model. We then applied a random forest with all sensor features, and finally applied a random forest with daily aggregated values as features. All models were validated with stratified 10-fold cross-validation with 90% of the records in the training set and 10% in the validation set. Model performances expressed in percentage of correctly classified cows (accuracy) and percentage of cows being critically misclassified (i.e., high as low and vice versa) ± standard deviation were 45.1 ± 8.1% and 10.8% with the ordinal logistic regression model, 45.7 ± 8.4% and 16.0% with the random forest using the same 6 features as the logistic regression model, 48.4 ± 6.7% and 10.0% for the random forest with all sensor features, and 50.5 ± 6.3% and 8.4% for the random forest with daily sensor values. This random forest also revealed that data collected in early and late stages of first lactation seem to be of particular importance in the prediction compared with that in mid lactation. Accuracies of the models were not significantly different, but the percentage of critically misclassified cows was significantly higher for the second model than for the other models. We concluded that a data-driven random forest algorithm with daily aggregated sensor data as input can be used for the prediction of lifetime resilience classification with an overall accuracy of ∼50%, and provides at least as good prediction as models with sensor features as input.