At the outset of your study, even before any data collection has taken place, you should be planning what will happen to the data on completion of the study. For this reason, the DMP should address the following questions:
RCSI recognizes research data as a valuable institutional asset. Research data must be retained and disposed of securely according to the relevant retention and disposal schedule, in accordance with legal, ethical and research funder requirements, and with particular concern for the confidentiality and security of the data.
RCSI policy on preserving research data: Researchers are responsible for providing access to research data requested by third parties as freely and timely as possible, unless access to the data is restricted for legitimate reasons, which should be stated in the metadata description or research article.
View the RCSI Research Data Management Policy in full.
Who is responsible for preserving the research data from a project? The RCSI Research Data Management Policy applies to all college members engaged in research, including staff and research students, and those who are conducting research on behalf of the College, irrespective of funding. Researchers have the primary responsibility for ensuring research data will be managed in line with funder requirements as well as College policy and other relevant regulations and legislation.
Research data that underpins published results or is considered to have long-term value should be retained, subject to informed consent to do so, where relevant. The current RCSI REC guideline is that research data should be retained for 5-7 years and then destroyed. However, this retention time could be significantly less or more depending on the nature of the study being conducted.
The RCSI Research Data Management Policy states that in the absence of the other provisions, the default period for research data retention is 10 years from date of last requested access. Retained data must also be deposited in an appropriate national or international reputable data repository.
However, it is often advisable to retain research data/records for a longer period depending on the nature of the study and the data collected. For example, the Medical Research Council (UK) recommends the following retention schedule for various study designs.
- For basic research: Research data and related material should be retained for a minimum of 10 years after the study has been completed.
- For population health and clinical studies: Research data should be retained for 20 years after the study has been completed.
- For clinical studies: In some cases, such as for clinical studies involving pregnant participants and those who lack capacity to consent, it has been recommended that a minimum of 25 years may be more appropriate for data retention.
However, longer retention periods for both basic research and population health and clinical studies may be appropriate in some cases. For example:
Indicate where the data will be deposited. If no established repository is proposed, demonstrate in the DMP that the data can be curated effectively beyond the lifetime of the grant. It is recommended to demonstrate that the repositories policies and procedures (including any metadata standards, and costs involved) have been checked.
RCSI policy on preserving data in a data repository:
There are many benefits to putting your data in a data repository, and the repository can provide you with many of the following services:
Persistent identifier (such as a Digital Object Identifier (DOI)) assigned to your data
Assistance with metadata, for example the data repository will usually provide recommendations or templates for creating metadata about your data
In certain cases publishers or funders may specify which data repository you must use to deposit your data. However in most cases you will have to identify a suitable home for your data. Selecting the right data repository is crucial. Consider the following factors:
Questions to ask when evaluating a data repository:
If you do not have a suitable, discipline-specific repository for your data you can deposit your data in a generalist data repository. This type of repository will accept a variety of data types and file types, and most have the facility to assign a persistent identifier (PID) to published data.
Examples of generalist data repositories:
Dryad Digital Repository The Dryad Digital Repository is a curated resource that makes the data underlying scientific publications discoverable, freely reusable, and citable. Dryad accepts data from any field and in any format, and has dedicated curators to check your files before they are released, and help you follow best practices.
Figshare Figshare is a repository where users can easily upload files up to 5GB to make all of their research outputs available in a citable, shareable and discoverable manner. Any file format is accepted and DOIs are provided. The RCSI Repository uses Figshare and all entries to the Repository are automatically included as part of Figshare, with a 25GB default storage limit.
Zenodo Zenodo was built and is operated by CERN and OpenAIRE to ensure that everyone can join in Open Science. It welcomes research from all over the world, and from every discipline. Every upload is assigned a DOI, to make them citable and trackable.
If you research involved the use or development of new software, you should make the source code available on a Version Control System (VCS) such as GitHub or BitBucket. However, these sites do not support the preservation of your code nor citation, and you should upload a permanent, archived version of the source code to an approved repository. For example, GitHub is integrated with Zenodo, and Zenodo can provide a DOI registration for the archived source code.
'Sensitive data' is data that must be protected against unwanted disclosure, for legal or ethical reasons, for issues pertaining to personal privacy, or for proprietary considerations. At RCSI, many of our research projects work with sensitive data.
If you are handling and dealing with sensitive data, keep in mind that special attention should be given to collecting, processing, handling and storing data throughout the research process. If you wish to make these data available at the end of the project then you will need to consider this when you are designing your study. In particular, when you are collecting data you will need to ensure you are asking for informed consent to share the data at the end of the project. This might limit your data sharing opportunities, however you can publish a description of your data (metadata) without making the data itself openly accessible, and you can place conditions around access to published data if necessary. Sensitive data that has been properly anonymised can be shared without breaching data protection regulations.
Anonymisation irreversibly destroys any way of identifying the data subject. Personal data that has been rendered anonymous in such a way that the individual is not or no longer identifiable is no longer considered personal data. For data to be truly anonymised, the anonymisation must be irreversible. OpenAIRE provides researchers with a tool to anonymise data: Amnesia. The guide for which you can find here.
Pseudonymisation replaces any identifying characteristics of data with a pseudonym, a value which does not allow the data subject to be directly identified. The personal data can only be attributed to a specific data subject with the use of additional information, such as decryption key. This key should be kept separately, and be subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable individual. Pseudonymisation only provides limited protection for the identity of data subjects and in many cases as it still allows identification using indirect means.
You must comply with Irish State Law, please see the Data Protection Commission's Guidance on Anonymisation and Pseudonymisation for more information. Both the Australian National Data Service (ANDS) guidelines on Publishing and Sharing Sensitive Data and the OpenAire guide on How to Deal with Sensitive Data provide further information on dealing with and sharing sensitive data.
Open Data: Data that can be accessed by any user for any reason, including commercial. Data in this category should not contain personal information unless consent is given.
Safeguarded Data: Data that contain no personal information, but the data owner considers there to be a risk of disclosure resulting from linkage to other data
Controlled Data: for data that may be disclosive. Data are generally only available to users through a relevant Data Access Committee, which may mandate training or other protective measures as appropriate.
Additionally, most data repositories will allow you to place a temporary embargo on your data. During the embargo period, the description of the dataset is published, but not the actual data. The data themselves will become available to access after the embargo period ends.
Sometimes there are legitimate reasons for not sharing some or all research data generated by a project. Funders who require data sharing will generally ask that researchers justify this decision in their Data Management Plan (DMP).
It is generally possible to choose not to share research data using the following criteria, which have been adapted from the European Commission Guidelines on Open Access to Scientific Publications and Research Data in Horizon 2020.
Some reasons why it might not be possible to share data include:
Data are commercially sensitive
Data are confidential (due to third party obligation
Sharing data would break data protection regulations
Sharing would mean that the project's main aim might not be achieved
Data are generated under an industry funded or co-funded project
Sharing of the data may impact on future plans to protect intellectual property
Please see the sections on Ethical Considerations and Data protection for further information on the limitations of sharing research data, and the importance of informed consent and ethical approval.
Another mode of disseminating your research data is to publish an article in a data journal. Data journals are publications that allow researchers to document a published dataset. The primary purpose of a data journal is to showcase the data, allowing the author to use the entire article to describe the data, rather than the resulting analysis.
Data journals are useful as they:
Data journal articles are also a means to give credit to those who share data. They can be an important tool for crediting those who provide support to research work but may not qualify for authorship on a traditional article e.g. a data manager.
The journal Scientific Data is an example of a data journal. Scientific Data is a peer-reviewed open-access journal for descriptions of datasets and research that advances the sharing and reuse of research data. See: https://www.nature.com/sdata/
What does it mean to license your data? In the DMP template you might see a question such as,
How will other legal issues, such as intellectual property rights and ownership, be managed? (Science Europe DMP template)
A license agreement is a legal arrangement between the creator/depositor of the data and the data repository, stating clear re-use rights to help others understand what they are allowed to do with your data. To make re-use as likely as possible it is recommended you to choose a licence which:
To answer this question you need to think about who is owner of the research data from your study (the PI? the funder? a consortium? a third-part organisation?) and whether you have the ownership rights to make the data available to others. If you can make the data available to others, will you need to restrict how third-parties can use this data? For example, maybe the data can only be used for non-commercial purposes. Maybe you'd like to be cited as the origin of that data every time someone uses the data in the future. All of this can be clarified in the license agreement.
It is imperative that the intellectual property rights (IPR) pertaining to the data are established before any licensing takes place. If your research contains data from third parties (e.g., data from a health or hospital system) you should ensure you have the permission of the rights holder to share this data, or that the data is covered by licences that permit the sharing of data, before you put it in the data repository.
Creative Commons licenses are commonly applied to research data because
There are six different types of Creative Commons license, ranging from the most to least permissive. Creative Commons licenses allow the copyright holder to retain copyright ownership of their works while allowing others to use the work under certain conditions specified by the chosen licence. See a full description of these licences here
An open-source licence is a set of conditions that grants the users of your software certain rights to use, copy, modify, and possibly redistribute the source code or content of the software. It also asserts your authorship. There are several licensing options for open source software, including:
Additional information is available from the Software Sustainability Institute and Open Source Initiative.
Specific software: If potential users of your research data would need access to specific tools to be able to reuse the data, you should indicate that in your DMP and provide sufficient details on what software and what version woudl be required. If the software is availabel for download, provide information on where they can access a copy.
File formats: The ability to read your data in the future depends on the file format, so you are strongly encouraged to use standard, exchangeable or open file formats. You can store data in a proprietary format where it is the de facto format within your disciplinary area, or where the format is supported across a range of software (so you are not locked into one type of software). The go-to guidance on file formats is the Library of Congress (LOC) Recommended Format Statement which is updated each year, as this is a constantly evolving topic.
When choosing an electronic file format to create and store data, it's important to consider whether the format is open and/or ubiquitous. The format you use determines how accessible these data are to other users, as some files can only be opened when you have a license to use that software. File format also determine how accessible the data will be to yourself and others into the future - technology evolves quickly, and the software that you use today will become obsolete in time.
Why use open file formats? File formats that are open or non-proprietary will tend to remain accessible, even if the software that created them is no longer available. However, formats which are ubiquitous or have become the default standard within a discipline, whether proprietary or not, are also likely to be maintained into the future.
What if you have a preferred software? If you find it necessary or convenient to work with a proprietary format, it may be useful to store your data using that format for data collection and analysis, while also storing a copy in an open or accessible format for sharing or archiving once your project is complete.
Which format is best for FAIR data? Many data archives and repositories will already have recommended file formats based on best practice within the disciplines they support.
When choosing a file format you should consider the following:
File formats likely to be accessible into the future (from DMPTool Guidance):
Examples of preferred format choices (from DMPTool Guidance):
For more information on recommended formats, see the UK Data Service guidance on recommended formats.
Indicate whether a persistent identifier (PID) will be pursued for the data. Typically, a trustworthy, long-term repository will provide a persistent identifier.
Persistent identifiers or PIDS are the backbone of the data citation. If someone wants to replicate your analysis they will need to be able to find the correct copy of the data that you used. By including a persistent identifier in your data citation you enable readers to identify and navigate to the exact version of the data that you used in your research. The persistent identifier is preferable to a less stable reference point such as a URL (website) address, as persistent identifiers are slow to expire and the data is more likely to be findable for many years. There are several types of persistent identifier used to identify datasets, but DOI numbers are most commonly used. Please find more information on DOIs below.
A DOI number is a string of numbers, letters and symbols used to permanently identify an article or document and link to it on the web. DOIs are commonly used to identify a research data resource online, and their strength is that they provide a unique identifier for the file or collection of files and provide an easy way to locate these files online. They are superior to web address links (URLs) as while a web address (URL) might change, the DOI will never change, plus they tend to be shorter and easier to cite than a web address.
Here's an example of what a DOI looks like in a data citation:
Smith, J., and Jones, P. (2023). Environmental risk factors for autism [dataset]. Dryad Digital Repository [distributor]. doi: 10.1234/abcd123
Many data repositories will assign a persistent identifier to your data once you publish the dataset (or metadata about the dataset) on their platform. For example on Zenodo data uploads are made available online as soon as you hit publish, and your DOI is registered within seconds
If you research involved the use or development of new software, you should include a software availability statement. Your software availability statement should include the name of the repository where the source code at the time of publication (the archived version) is available, a DOI number for the archived software, and details of the license under which the software can be used. You should use an Open Source License (OSI) if possible, which allows software to be freely used, modified, and shared.
Now that you have reached the conclusion of your research study, to ensure your data are FAIR, you have:
Published your data in a repository / archive which has provided an identifier for the published data.
Added rich metadata about the data to the repository / archive.
Attached a license to your data, so it is clear how a new user can use the data in a new work
Clearly explained any access restriction in the metadata and given clear guidance on how to request access.
Provided a data citation in the metadata, including the important identifier and used this citation in your publications.
Provided a data availability statement in all of your publications.
Together, your data citation and data availability statement should look something like this:
Smith, J., and Jones, P. (2023). Environmental risk factors for autism [dataset]. Dryad Digital Repository [distributor]. doi: 10.1234/abcd123. The data that support the findings of this study are openly available in Dryad Digital Repository at doi: 10.1234/abcd123 under the terms of the Creative Commons Attribution 4.0 (CC-BY 4.0) license.
File formats for preservation
Licensing data
There are several free-to-use tools online to help you find a suitable license for your research data and/or software.
Finding a data repository
There are several resources to help you locate a suitable data repository:
See also the RCSI guide on "Where to submit data" created in collaboration with the Consortium of National and University Librarians (CONUL) for more information. https://drive.google.com/file/d/1S8Qc3cDdfziDdwW5ACRA59y2FQuMjMsm/view