Skip to Main Content

Research Data Management

Data description and collection or re-use of existing data

According to Science Europe, when developing a data management plan, the first topic researchers are required to address is "Data description and collection or re-use of existing data", which broadly encompasses two main questions:

 What data will be collected or produced?

  • Give details on the kind of data: for example numeric (databases, spreadsheets), textual (documents), image, audio, video, and/or mixed media.
  • What data formats will be used (for example pdf, csv, txt or rdf)?
  • Justify the use of certain formats. For example, decisions may be based on staff expertise within the host organisation, a preference for open formats, standards accepted by data repositories, widespread usage within the research community, or on the software or equipment that will be used.
  • Give preference to open and standard formats as they facilitate sharing and long-term re-use of data.
  • Give details on the volumes (they can be expressed in storage space required (bytes), and/or in numbers of objects or files).

 How will new data be collected or produced and/or how will existing data be re-used?

  • Explain which methodologies or software will be needed to create/process/visualise the data.
  • State any constraints on re-use of existing data if there are any.
  • Explain how data provenance will be documented.
  • Briefly state the reasons if the re-use of any existing data sources has been considered but discarded.

What are Research Data?

Although research data can take many forms, and are often discipline-specific, at a basic level research data can be described as "any information that has been collected, observed, generated or created to validate original research findings" (University of Leeds). Common examples of research data include measurements, experimental results, fieldwork observations, interview recordings and images. Although usually digital, research data also includes non-digital formats such as laboratory notebooks. 

File Formats and Standards

When choosing file formats for research data it's important to consider whether the format is open and/or ubiquitous. File formats that are open or non-proprietary will tend to remain accessible, even if the software that created them is no longer available. Therefore, the use of closed proprietary formats will not normally be appropriate. However, formats which are ubiquitous or have become the default standard within a discipline, whether proprietary or not, are also more likely to be maintained into the future. It may be useful to store your data using one format for data collection and analysis and also in a more open or accessible format for sharing or archiving once your project is complete. Many data archives and repositories will already have recommended file formats based on best practice within the disciplines they support.

When choosing a file format you should consider the following:

  • How you plan to analyse your data
  • Which software and file formats you and your colleagues have used in the past
  • Any discipline specific norms or technical standards
  • Whether file formats are at risk of obsolescence because of their dependence on a particular technology.
  • Which formats are best to use for the long-term preservation of data
  • Whether important information might be lost by converting between different formats
  • The possibility of embedding metadata that describes content within the file itself, e.g. creator information, variable names and labels

If you are unsure which format you should use, the UK Data Service provides the following guidelines:

  • Textual data: Rich Text Format (.rtf), eXtensible Mark-up Language (XML) text according to an appropriate Document Type Definition (DTD) or schema (.xml), plain text, ASCII (.txt), PDF/A (.pdf, Archival PDF)
  • Tabular data with extensive metadata: SPSS portable format (.por), delimited text and command ('setup') file (SPSS, Stata, SAS, etc.) and structured text or mark-up file of metadata information, e.g. DDI XML file
  • Tabular data with minimal metadata (including spreadsheets): Comma-separated values (CSV) file (.csv) and tab-delimited file (.tab)
  • Video: MPEG-4 (.mp4), OGG video (.ogv, .ogg) and motion JPEG 2000 (.mj2)
  • Images: TIFF version 6.0 uncompressed (.tif), JPEG (.jpeg, .jpg, .jp) (note: JPEGS are a 'lossy' format which lose information when re-saved, so only use them if the data were created in this format or if you are not concerned about image quality)
  • Audio: Free Lossless Audio Codec (FLAC) (.flac), Waveform Audio Format (WAV) (.wav) and MPEG-1 Audio Layer 3 (.mp3) if the data were created in this format.

Reusing Existing Research Data

Contemporary research is often collaborative and reusing existing research data has become common practice in many disciplines. Although convenient and cost effective, when reusing existing research data it essential that you take the time to familiarise yourself with the data and check the accompanying documentation for collection procedures, data cleaning procedures, usage licenses and other technical information to make sure the data are suitable for your research. Reusing existing research data does also not make them exempt from GDPR and any other relevant regulatory and ethical policies that researchers must comply with.

Data Collection and Description Resources