File formats

A file format is the container in which data or information resides, such as .csv, .xlsx, .R, .mat, .sas, .pdf, etc. The type of file format may be software or machine specific and are then termed proprietary formats (i.e. controlled and owned by a company).

If you want to exchange data with others or if you want to use data at a later stage, saving data as proprietary formats may cause problems. Software versions change, their associated formats change (think of .doc vs .docx incompatibility issues), and if you or others don’t have a licence for the software required, then the files are unusable.

Therefore, when choosing a file format consider:

  • whether you could save data in a non-proprietary (i.e. open) file format when possible. Open file formats can be accessed by many different open (and closed) software applications, allowing easier reuse, exchange, and integration. However, some specific functionality from proprietary formats may not be transferrable to open formats. For example figures, formulas, or multiple sheets in .xlsx, cannot be transferred to .csv or .txt. It is important to determine what information needs to be retained, what might not be supported, and decide how to preserve the data. For example, each .xlsx sheet can be separately stored as csv, creation of figures and use of formulas could be done with scripts (such as .R or .py), figures could separately be stored as .jpg, and formulas used should be well described in readme files. When specific functionality is required, it is recommended to store the data in both the proprietary and open format
  • when it is necessary to save files in a proprietary format, to add at least the (company) name and version of the software used to generate the files in the documentation.
  • that some proprietary formats have become ad-hoc standards like PDF or ESRI Shapefiles. However, be aware that ad-hoc standards may become obsolete and may not always be the best choice. For example, Adobe Flash has been widely used for moving images, but it is now becoming obsolete and Adobe is not developing further.
  • that data repositories provide lists of recommended or preferred open file formats to use when publishing data through them. Data repositories are confident that these formats will offer the best long-term guarantees in terms of usability, accessibility and sustainability. The lists can help select the format for data exchange during your work. As an example, check the preferred file format lists of two data repositories, supported by WUR; DANS-EASY and 4TU.ResearchData.

Support

Questions? Don't hesitate to contact data@wur.nl.