Skip to Main Content

Research Data Management

How to describe your data

When you begin your DMP, the first thing you should provide is a description of the data you intend to use for your research project. This section is generally composed of two distinct but related questions: 

  1. How will new data be collected or produced and/or how will existing data be re-used? i.e. where is data that you are using for this study coming from? 
  2. What data will be collected or produced? i.e. what type of data are they, and what are the format and volume of these data?

1. How will new data be collected, how will existing data be re-used?

The National Library of Medicine define 'data collection' as: 

the process of gathering data, typically in the context of a research project or for ongoing surveillance and tracking. This involves measuring information on variables of interest in a pre-established, systematic way that enables researchers to address research questions, test hypotheses, and evaluate outcomes.

In this section you should briefly describe the source of data that you are using in your study. Examples of data collection include: 

  • Data captured using data collection instruments such as surveys, wearable devices, standardised scales.
  • Data generated by the processing of biological materials, such as blood and tissue samples, that were collected as part of routine care.
  • Text, audio or video data captured by observational or interview methodologies.
  • Images captured by microscopes, and other types of imaging hardware.

You should give clear details of how new data will be collected or produced. You can describe the methods and software that you will use to generate, capture or extract data. 

 

Sample answer 1: 

A digital questionnaire will be distributed using MS Forms to a sample of 600 people who are engaging with the service, as baseline and again 3 months after they complete treatment. Data will also be captured from a sub-sample of 200 people via a wearable device for the duration of their treatment (3 months). 

Sample answer 2:

The study will involve conducing 20-25 semi-structured interviews with young adults (aged 16-24) who are participating in [name] programme. Interviews will be conduced face to face and recorded for transcription using digital audio recording where interviewees permit. In case they do not, interviews will be undertaken in pairs to enable detailed note-taking. Interview recordings/notes will be transcribed by the research assistant for analysis.  

Contemporary research is often collaborative, and using pre-existing research data has become common practice in many disciplines. Although convenient and cost effective, when using pre-existing data it essential that you take the time to familiarise yourself with the data, and to check the accompanying documentation to make sure the data are suitable for your research. This includes documentation on how the data were collected and cleaned, under what conditions the data can be used (usage licenses) and other technical information.

In this section you should briefly describe the source of data that you are re-using in your study.

Examples of data re-use include: 

  1. Data that were collected for the provision of health care, that you are re-using for research purposes e.g., patient data from ambulatory or hospital medical records.
  2. Data that was captured or generated by an earlier study which will be reused by the current study.
  3. Data accessed from a national or international data repository or archive, or data service, e.g., open government data
  4. Data from several datasets that have been compiled to create a new database or 3D model.

You should give details of how the data were collected or produced by the original study or data source. This should include information on the methods and software that were used to generate, capture or extract the data. Using pre-existing research data does not make them exempt from GDPR and other relevant regulatory and ethical requirements. If you plan to re-use data that you collected for a previous project, or access data from a hospital record system, or download data from a data repository, you should explain in the DMP how issues such as ethics, copyright and IPR have been addressed. Consider whether there are any restrictions on how you may use these pre-existing data to answer new research questions.

 

Sample answer:

The study will re-use data that have been extracted from hospital records via our clinical partner Dr Smith who runs the clinical department at St James' hospital in Dublin. These data are captured as part of routine care at the clinic. The team has been granted access by St James' hospital to use an anonymised version of these data for the current study, but may not share or reuse these data in any additional research unless explicitly permitted by St James' hospital. 

2. What data will be collected or produced?

 

Now that you've described where the data you are using for your study will be sourced from, the next step is to describe the attributes of that data. A good approach is to describe the type, format and volume of each type of data that you will be working with. 

You may find it clearer to present this information as a table in your DMP and there is an example of how this might look  on the following tab. 


Data type: You should clearly define the type of data that you are going to generate or work with. Research data can take many forms, and are often discipline-specific, but at a basic level, research data can be described as any information that has been collected, observed, generated or created to validate original research findings. 

Common examples of research data include measurements, experimental results, fieldwork, observations, interview recordings and images. Although usually digital, research data also includes non-digital formats such as hand-written laboratory notes. 

 

What type of data are you using in your study?

  • Text - such as field or laboratory notes, survey responses, transcription of interviews
  • Numeric or tabular (data in rows and columns) - such as tables, counts, measurements
  • Audiovisual (audio / video recordings) - such as sound recordings or videos of qualitative interviews, observations 
  • Images from microscopes, scans, ultrasound, x-ray or other equipment that generate images 
  • Models, computer code
  • Discipline-specific - such as CIF in chemistry
  • Instrument-specific - such as equipment outputs (source: DMPTool)

 

Data format: For each type of data you are using, list the file format in which the data will be collected and stored during the research process. The file format is usually reflected by the filename extension (such as .doc or .pdf). Not all data formats are electronic, and if you must hold some data in paper copy then your format will be listed as paper or hard copy, but of course this will limit the accessibility of the data. 

While open formats are recommended to ensure long-term usability of data, it is acceptable for you to store data in proprietary format during the research process, if you require the functionality of that format. In the DMP, you should explain why you have chosen certain formats, and this can be based on staff expertise, because a certain software tool us widely used within your community for this type of data, or because it is the format accepted by the repository where you will store the data in the long term. 

 

Below are some examples of common file formats for data:

Spreadsheets stored as:

  • comma-separated values file (.csv)
  • Excel file (.xls, .xlsx)
  • OpenDocument Spreadsheet file (.ods)

Text files stored as: 

  • plain text file (.txt)
  • PDF/A file (.pdf)
  • Word file (.docx)
  • XML file (.xml)
  • HTML file (.html)

Image files stored as: 

  • TIFF file (.tif)
  • JPEG file (.jpg)
  • Portable Network Graphics file (.png)

Video / audio stored as: 

  • Material Exchange Format file (.mxf)
  • FLAC file (.flac)
  • MPEG-4 file (.mp4)

Reports or images stored as: 

  • PDF/A file
  • PDF/X file

At the end of your study you should consider converting the data files to a more accessibility-friendly format that can be opened and used without the need of a specific software license. If you store your data in a preservation-recommended format, it will save you time at the end of the study in converting your files for long-term access. This topic is discussed in more detail in the 'Preserving Research Data' section of this guide. 

Formats for preserving text data

  • Plain text (.txt) format is generally the preferred format for both accessibility and preservation
  • Formatted text (such as .docx or PDF files, or well-structured HTML rendered in a browser) are usually accepted by data repositories for preservation.

Formats for preserving tabular data

  • Comma- or Tab-Separated Value (CSV/TSV) formats are generally the preferred for by data repositories.
  • Any proprietary format that is a de facto standard for a profession or supported by multiple tools (e.g. Excel .xls or .xlsx, Shapefile) are usually accepted by data repositories for preservation.

Formats for preserving image data: 

  • JPEG Image Encoding family (.jpeg, .jpg)
  • TIFF (.tiff, .tif)
  • Portable Network Graphics (.png)
  • Scalable Vectors Graphics (.svg)

Formats for preserving audio data: 

  • Material Exchange Format (.mxf)
  • FLAC (.flac)

Format for preserving video data: 

  • Material Exchange Format (.mxf)

 

4TU.ResearchData have provided a 1-page guide on preferred file formats for long term access to files which is a useful reference point. For more detailed and updated guidance, visit the the Library of Congress Recommended Formats Statement. Every year the LOC release a statement of recommendations for digital file formats, guiding the community on the text, tabular and audiovisual file formats that are preferred and which are accepted for long term access to the contents. There are further resources at the end of this page. 


 

Data volume: For each type of data you are working with, you should give a clear estimate of the volume of that data in the DMP. This will help you to plan how much storage capacity you’re going to need while the project is active, and how much storage capacity you'll need to preserve the data at the end of the study. 

Some research (e.g. research using wearable devices) will produce lots of data over an extended period, and will require a lot of storage space. Other research will produce a relatively small amount of data, in terms of bytes, and the cloud storage provided by RCSI will be more than sufficient to store this volume of data. 


In general, data volume should be expressed in bytes such as:

  • gigabytes / GBs
  • megabytes / MBs
  • terabytes / TBs

If you cannot express the volume in terms of bytes, alternative ways to express data volume include: 

  • Audio recordings: Number of expected interviews x average length of interview
  • Tabular files: Number of columns (variables) x rows (respondents/cases) 
  • Survey data: Number of survey questions or variables x expected sample size
  • Image data: Number of images x average size of image this software produces 

You might find it helpful to present this information about type, format and volume in your DMP by using a table. This can help to really clarify the various data that you will be working with, and you can add as much information (columns) as you need to help you with your planning. 

 

Sample of how to represent data type, format and volume as a single table

Data source Type Format Volume

Demographic data from survey of patients

Tabular data MS Excel (.xls) ~50GB
Images from experiments on cells Image data JPEG (.jpg) 523GB
Clinical data from medical records Tabular data CSV (.csv) 83GB

Qualitative data from interviews

- recordings and transcripts

Audio, Text data WAV (.wav)
Plain text (.txt)

~200GB

10 interviews x 60mins

 

Further resources

Data description

File formats

  • Medical image file formats by Larobina and Murino (2014).  This article presents a demystifying overview of the major file formats currently used in medical imaging 
  • 5star Open Data
    Clear and simple explanation on preferred formats for sharing file, why PDF is less preferable to Excel which is less preferable to CSV and so on.

  • Preferred File Formats from 4TU.ResearchData in the NL
    One-page list of recommended formats to ensure that research data remain usable into the future (published 2023)

File volume

Advanced tools for format validation