When you begin your DMP, the first thing you should provide is a description of the data you intend to use for your research project. This section is generally composed of two distinct but related questions:
The National Library of Medicine define 'data collection' as:
the process of gathering data, typically in the context of a research project or for ongoing surveillance and tracking. This involves measuring information on variables of interest in a pre-established, systematic way that enables researchers to address research questions, test hypotheses, and evaluate outcomes.
In this section you should briefly describe the source of data that you are using in your study. Examples of data collection include:
You should give clear details of how new data will be collected or produced. You can describe the methods and software that you will use to generate, capture or extract data.
Sample answer 1:
A digital questionnaire will be distributed using MS Forms to a sample of 600 people who are engaging with the service, as baseline and again 3 months after they complete treatment. Data will also be captured from a sub-sample of 200 people via a wearable device for the duration of their treatment (3 months).
Sample answer 2:
The study will involve conducing 20-25 semi-structured interviews with young adults (aged 16-24) who are participating in [name] programme. Interviews will be conduced face to face and recorded for transcription using digital audio recording where interviewees permit. In case they do not, interviews will be undertaken in pairs to enable detailed note-taking. Interview recordings/notes will be transcribed by the research assistant for analysis.
Contemporary research is often collaborative, and using pre-existing research data has become common practice in many disciplines. Although convenient and cost effective, when using pre-existing data it essential that you take the time to familiarise yourself with the data, and to check the accompanying documentation to make sure the data are suitable for your research. This includes documentation on how the data were collected and cleaned, under what conditions the data can be used (usage licenses) and other technical information.
In this section you should briefly describe the source of data that you are re-using in your study.
Examples of data re-use include:
You should give details of how the data were collected or produced by the original study or data source. This should include information on the methods and software that were used to generate, capture or extract the data. Using pre-existing research data does not make them exempt from GDPR and other relevant regulatory and ethical requirements. If you plan to re-use data that you collected for a previous project, or access data from a hospital record system, or download data from a data repository, you should explain in the DMP how issues such as ethics, copyright and IPR have been addressed. Consider whether there are any restrictions on how you may use these pre-existing data to answer new research questions.
Sample answer:
The study will re-use data that have been extracted from hospital records via our clinical partner Dr Smith who runs the clinical department at St James' hospital in Dublin. These data are captured as part of routine care at the clinic. The team has been granted access by St James' hospital to use an anonymised version of these data for the current study, but may not share or reuse these data in any additional research unless explicitly permitted by St James' hospital.
Now that you've described where the data you are using for your study will be sourced from, the next step is to describe the attributes of that data. A good approach is to describe the type, format and volume of each type of data that you will be working with.
You may find it clearer to present this information as a table in your DMP and there is an example of how this might look on the following tab.
Data type: You should clearly define the type of data that you are going to generate or work with. Research data can take many forms, and are often discipline-specific, but at a basic level, research data can be described as any information that has been collected, observed, generated or created to validate original research findings.
Common examples of research data include measurements, experimental results, fieldwork, observations, interview recordings and images. Although usually digital, research data also includes non-digital formats such as hand-written laboratory notes.
What type of data are you using in your study?
Data format: For each type of data you are using, list the file format in which the data will be collected and stored during the research process. The file format is usually reflected by the filename extension (such as .doc or .pdf). Not all data formats are electronic, and if you must hold some data in paper copy then your format will be listed as paper or hard copy, but of course this will limit the accessibility of the data.
While open formats are recommended to ensure long-term usability of data, it is acceptable for you to store data in proprietary format during the research process, if you require the functionality of that format. In the DMP, you should explain why you have chosen certain formats, and this can be based on staff expertise, because a certain software tool us widely used within your community for this type of data, or because it is the format accepted by the repository where you will store the data in the long term.
Below are some examples of common file formats for data:
Spreadsheets stored as:
- comma-separated values file (.csv)
- Excel file (.xls, .xlsx)
- OpenDocument Spreadsheet file (.ods)
Text files stored as:
- plain text file (.txt)
- PDF/A file (.pdf)
- Word file (.docx)
- XML file (.xml)
- HTML file (.html)
Image files stored as:
- TIFF file (.tif)
- JPEG file (.jpg)
- Portable Network Graphics file (.png)
Video / audio stored as:
- Material Exchange Format file (.mxf)
- FLAC file (.flac)
- MPEG-4 file (.mp4)
Reports or images stored as:
- PDF/A file
- PDF/X file
At the end of your study you should consider converting the data files to a more accessibility-friendly format that can be opened and used without the need of a specific software license. If you store your data in a preservation-recommended format, it will save you time at the end of the study in converting your files for long-term access. This topic is discussed in more detail in the 'Preserving Research Data' section of this guide.
Formats for preserving text data
Formats for preserving tabular data
Formats for preserving image data:
Formats for preserving audio data:
Format for preserving video data:
4TU.ResearchData have provided a 1-page guide on preferred file formats for long term access to files which is a useful reference point. For more detailed and updated guidance, visit the the Library of Congress Recommended Formats Statement. Every year the LOC release a statement of recommendations for digital file formats, guiding the community on the text, tabular and audiovisual file formats that are preferred and which are accepted for long term access to the contents. There are further resources at the end of this page.
Data volume: For each type of data you are working with, you should give a clear estimate of the volume of that data in the DMP. This will help you to plan how much storage capacity you’re going to need while the project is active, and how much storage capacity you'll need to preserve the data at the end of the study.
Some research (e.g. research using wearable devices) will produce lots of data over an extended period, and will require a lot of storage space. Other research will produce a relatively small amount of data, in terms of bytes, and the cloud storage provided by RCSI will be more than sufficient to store this volume of data.
In general, data volume should be expressed in bytes such as:
If you cannot express the volume in terms of bytes, alternative ways to express data volume include:
Data source | Type | Format | Volume |
---|---|---|---|
Demographic data from survey of patients |
Tabular data | MS Excel (.xls) | ~50GB |
Images from experiments on cells | Image data | JPEG (.jpg) | 523GB |
Clinical data from medical records | Tabular data | CSV (.csv) | 83GB |
Qualitative data from interviews - recordings and transcripts |
Audio, Text data | WAV (.wav) Plain text (.txt) |
~200GB 10 interviews x 60mins |
Data description
File formats
5star Open Data
Clear and simple explanation on preferred formats for sharing file, why PDF is less preferable to Excel which is less preferable to CSV and so on.
Preferred File Formats from 4TU.ResearchData in the NL
One-page list of recommended formats to ensure that research data remain usable into the future (published 2023)
File volume
Image file size calculator from Omni Use this converter to estimate the size of an image file as you adjust the on-screen image size (in pixels), bit depth (8 bits per byte) and printed dots per inch (dpi).
This image file size calculator from Northern Arizona University will help you estimate the file size of an uncompressed raster image file, provided that you know the image's resolution and its bit depth.
This audio file size calculator from Omni will help you estimate how much space an uncompressed audio file will take up on your computer's storage.
This video file size calculator from Omni helps you estimate how much space a video takes up on your disk.
Advanced tools for format validation
OPF maintains a number of open source digital preservation tools that address common digital preservation challenges.