Skip to Main Content

Research Data Management

Introduction to data description

According to Science Europe, when developing a data management plan, the first area you should address is "Data description and collection or re-use of existing data", which broadly encompasses two main questions:


1. What data will be collected or produced?

  • Give details on the kind of data: for example numeric (databases, spreadsheets), textual (documents), image, audio, video, and/or mixed media.
  • What data formats will be used (for example pdf, csv, txt or rdf)?
  • Justify the use of certain formats. For example, decisions may be based on staff expertise within the host organisation, a preference for open formats, standards accepted by data repositories, widespread usage within the research community, or on the software or equipment that will be used.
  • Give preference to open and standard formats as they facilitate sharing and long-term re-use of data.
  • Give details on the volumes (they can be expressed in storage space required (bytes), and/or in numbers of objects or files)

2. How will new data be collected or produced and/or how will existing data be re-used? 

  • Explain which methodologies or software will be needed to create/process/visualise the data.
  • State any constraints on re-use of existing data if there are any.
  • Explain how data provenance will be documented.
  • Briefly state the reasons if the re-use of any existing data sources has been considered but discarded.

What data will be collected or produced?

 


In this section of the Data Management Plan you should give a summary of the data you will collect or create, noting the content, coverage and type of the data you will collect or create. According to the guidance from DMPTool, data can come in many forms, including:

  • Text: field or laboratory notes, survey responses
  • Numeric: tables, counts, measurements
  • Audiovisual: images, sound recordings, video
  • Models, computer code
  • Discipline-specific such as CIF in chemistry
  • Instrument-specific such outputs from lab equipment

Why do you need to think about file formats? When choosing an electronic file format to create and store data, it's important to consider whether the format is open and/or ubiquitous. The format you use determines how accessible these data are to other users, as some files can only be opened when you have a license to use that software. File format also determine how accessible the data will be to yourself and others into the future - technology evolves quickly, and the software that you use today will become obsolete in time.

Why use open file formats? File formats that are open or non-proprietary will tend to remain accessible, even if the software that created them is no longer available. However, formats which are ubiquitous or have become the default standard within a discipline, whether proprietary or not, are also likely to be maintained into the future.

What if you have a preferred software? If you find it necessary or convenient to work with a proprietary format, it may be useful to store your data using that format for data collection and analysis, while also storing a copy in an open or accessible format for sharing or archiving once your project is complete.

Which format is best for FAIR data? Many data archives and repositories will already have recommended file formats based on best practice within the disciplines they support. 


When choosing a file format you should consider the following:

  • How you plan to analyse your data
  • Which software and file formats you and your colleagues have used in the past
  • Any discipline specific norms or technical standards
  • Whether file formats are at risk of obsolescence because of their dependence on a particular technology.
  • Which formats are best to use for the long-term preservation of data
  • Whether important information might be lost by converting between different formats


File formats likely to be accessible into the future (from DMPTool Guidance):

  • Non-proprietary
  • Open, with documented standards
  • In common usage by the research community
  • Using standard character encodings (i.e., ASCII, UTF-8)
  • Uncompressed (space permitting)

Examples of preferred format choices (from DMPTool Guidance):

  • Image: JPEG, JPG-2000, PNG, TIFF
  • Text: plain text (TXT), HTML, XML, PDF/A
  • Audio: AIFF, WAVE
  • Containers: TAR, GZIP, ZIP
  • Databases: prefer XML or CSV to native binary formats

For more information on recommended formats, see the UK Data Service guidance on recommended formats.

Library of Congress Recommended Formats Statement The Library of Congress identified preferred and acceptable file formats for textual works and musical compositions, still image works, audio works, moving image works, software and electronic gaming and learning, datasets/databases and websites.

UK Data Service Recommended Formats Guidance on file formats recommended and accepted by the UK Data Service for data sharing, reuse and preservation.

UCD Digital Library Preferred Formats for Data Preferred formats identified by the UCD Digital Library and Repository which facilitate processing, storage, and dissemination of data, assuring both useability and longer-term durability of the data.

In the Data Management Plan, when you are asked to describe the data volume, you are describing how much data your project is likely to produce. Some research (e.g. research using wearable devices) will produce lots of data over an extended period, and will require a lot of storage space. Other research will produce a relatively small amount of data, in terms of bytes, and the cloud storage provided by RCSI will be more that sufficient. There is more on cloud storage at RCSI later on this LibGuide.

In the DMP you are planning how much data you expect to be working with, and what storage capacity you’re going to need while the project is active (how big is the storage capacity for projects at RCSI?), and when the project concludes (how much data does the data repository accept?).


Some questions on data volume to consider (from DMPTool Guidance):

  • Are you manually collecting and recording data?
  • Are you using observational instruments and computers to collect data?
  • Is your data collection highly iterative?
  • How much data will you accumluate every month or every 90 days?
  • How much data do you anticipate collecting and generating by the end of your project?

You might find it helpful to present this information about data type, format and volume in your DMP by using a table. This can help to really clarify the various data that you will be working with, and you can add as much information (columns) as you need to help you with your planning. Below is an example based on a real study at RCSI. 

Presenting the data description using a table
Data type Method Data format Data volume
Demographic data Patient survey Tabular data stored as Excel (.xls) and open format CSV (.csv)

<200GB

(~90 variables x

300 participants x

1 data collection wave)

Developmental questionnaires Standardised tests by hospital staff Tabular data stored as Excel (.xls) and open format CSV (.csv)

<100GB

(~10 variables x

300 participants x

2 data collection waves) 

Clinical data Clinical review

Tabular data stored as Excel (.xls) and open format CSV (.csv)

Text data stored as plain text files (.txt)

<100GB

(25 variables x 

300 participants x

2 data collection waves)

Qualitative data Interviews with patients, care givers and staff

Audio data stored as .wav files 

Text data stored as plain text files (.txt)

<200GB

20 interviews x

1 hour in length

 

How will data be collected or produced?

According to the guidance from DMPTool, data can be grouped into four main categories, and these categories will affect the choices that you make throughout your data management plan.

1. Observational data

  • Captured in real-time, typically outside the lab
  • Usually irreplaceable and therefore the most important to safeguard
  • Examples: Sensor readings, telemetry, survey results, images

2. Experimental data

  • Typically generated in the lab or under controlled conditions
  • Often reproducible, but can be expensive or time-consuming
  • Examples: gene sequences, chromatograms, magnetic field readings

3. Simulation data

  • Machine generated from test models
  • Likely to be reproducible if the model and inputs are preserved
  • Examples: climate models, economic models

4. Derived / Compiled

  • Generated from existing datasets
  • Reproducible, but can be very expensive and time-consuming
  • Examples: text and data mining, compiled database, 3D models

Contemporary research is often collaborative, and reusing existing research data has become common practice in many disciplines. Although convenient and cost effective, when reusing existing research data it essential that you take the time to familiarise yourself with the data and check the accompanying documentation for collection procedures, data cleaning procedures, usage licenses and other technical information to make sure the data are suitable for your research. Reusing existing research data does also not make them exempt from GDPR and any other relevant regulatory and ethical policies that researchers must comply with.


If you plan to reuse data you have collected for a previous project, or access/purchase from a data repository or other source, you should explain in the DMP how issues such as ethics, copyright and IPR have been addressed. Are there any restrictions on how you may reuse these data?