In the EU,"‘research data’ means documents in a digital form, other than scientific publications, which are collected or produced in the course of scientific research activities and are used as evidence in the research process, or are commonly accepted in the research community as necessary to validate research findings and results" Article 2 (9) directive 2019/1024. Directive (EU) 2019/1024 On Open Data and the Re-use of Public Sector Information.
According to Science Europe, when developing a data management plan, the first area you should address is "Data description and collection or re-use of existing data", which broadly encompasses two main questions:
1. What data will be collected or produced?
2. How will new data be collected or produced and/or how will existing data be re-used?
In this section of the Data Management Plan you should give a summary of the data you will collect or create, noting the content, coverage and type of the data you will collect or create. According to the guidance from DMPTool, data can come in many forms, including:
Why do you need to think about file formats? When choosing an electronic file format to create and store data, it's important to consider whether the format is open and/or ubiquitous. The format you use determines how accessible these data are to other users, as some files can only be opened when you have a license to use that software. File format also determine how accessible the data will be to yourself and others into the future - technology evolves quickly, and the software that you use today will become obsolete in time.
Why use open file formats? File formats that are open or non-proprietary will tend to remain accessible, even if the software that created them is no longer available. However, formats which are ubiquitous or have become the default standard within a discipline, whether proprietary or not, are also likely to be maintained into the future.
What if you have a preferred software? If you find it necessary or convenient to work with a proprietary format, it may be useful to store your data using that format for data collection and analysis, while also storing a copy in an open or accessible format for sharing or archiving once your project is complete.
Which format is best for FAIR data? Many data archives and repositories will already have recommended file formats based on best practice within the disciplines they support.
When choosing a file format you should consider the following:
File formats likely to be accessible into the future (from DMPTool Guidance):
Examples of preferred format choices (from DMPTool Guidance):
For more information on recommended formats, see the UK Data Service guidance on recommended formats.
Library of Congress Recommended Formats Statement The Library of Congress identified preferred and acceptable file formats for textual works and musical compositions, still image works, audio works, moving image works, software and electronic gaming and learning, datasets/databases and websites.
UK Data Service Recommended Formats Guidance on file formats recommended and accepted by the UK Data Service for data sharing, reuse and preservation.
UCD Digital Library Preferred Formats for Data Preferred formats identified by the UCD Digital Library and Repository which facilitate processing, storage, and dissemination of data, assuring both useability and longer-term durability of the data.
In the Data Management Plan, when you are asked to describe the data volume, you are describing how much data your project is likely to produce. Some research (e.g. research using wearable devices) will produce lots of data over an extended period, and will require a lot of storage space. Other research will produce a relatively small amount of data, in terms of bytes, and the cloud storage provided by RCSI will be more that sufficient. There is more on cloud storage at RCSI later on this LibGuide.
In the DMP you are planning how much data you expect to be working with, and what storage capacity you’re going to need while the project is active (how big is the storage capacity for projects at RCSI?), and when the project concludes (how much data does the data repository accept?).
Some questions on data volume to consider (from DMPTool Guidance):
You might find it helpful to present this information about data type, format and volume in your DMP by using a table. This can help to really clarify the various data that you will be working with, and you can add as much information (columns) as you need to help you with your planning. Below is an example based on a real study at RCSI.
Data type | Method | Data format | Data volume |
---|---|---|---|
Demographic data | Patient survey | Tabular data stored as Excel (.xls) and open format CSV (.csv) |
<200GB (~90 variables x 300 participants x 1 data collection wave) |
Developmental questionnaires | Standardised tests by hospital staff | Tabular data stored as Excel (.xls) and open format CSV (.csv) |
<100GB (~10 variables x 300 participants x 2 data collection waves) |
Clinical data | Clinical review |
Tabular data stored as Excel (.xls) and open format CSV (.csv) Text data stored as plain text files (.txt) |
<100GB (25 variables x 300 participants x 2 data collection waves) |
Qualitative data | Interviews with patients, care givers and staff |
Audio data stored as .wav files Text data stored as plain text files (.txt) |
<200GB 20 interviews x 1 hour in length |
According to the guidance from DMPTool, data can be grouped into four main categories, and these categories will affect the choices that you make throughout your data management plan.
1. Observational data
2. Experimental data
3. Simulation data
4. Derived / Compiled
Contemporary research is often collaborative, and reusing existing research data has become common practice in many disciplines. Although convenient and cost effective, when reusing existing research data it essential that you take the time to familiarise yourself with the data and check the accompanying documentation for collection procedures, data cleaning procedures, usage licenses and other technical information to make sure the data are suitable for your research. Reusing existing research data does also not make them exempt from GDPR and any other relevant regulatory and ethical policies that researchers must comply with.
If you plan to reuse data you have collected for a previous project, or access/purchase from a data repository or other source, you should explain in the DMP how issues such as ethics, copyright and IPR have been addressed. Are there any restrictions on how you may reuse these data?