Library Guides: Research Data Management: Data sharing and long-term preservation

How to describe your data preservation and sharing plan

At the outset of your study, even before any data collection has taken place, you should be planning what will happen to the data on completion of the study. For this reason, the DMP should address the following questions:

How will data for preservation be selected and where they be preserved?
How and when will data be shared? Will you need to restrict access to the data, or place an embargo on access to the data after you have deposited it in a data repository?
Will users need access to specific methods or software tools to re-use your data?
In order for the data to be findable, how will a unique and persistent identifier be attached to the data?

How will data for preservation be selected and where they be preserved?

RCSI recognizes research data as a valuable institutional asset. Research data must be retained and disposed of securely according to the relevant retention and disposal schedule, in accordance with legal, ethical and research funder requirements, and with particular concern for the confidentiality and security of the data.

RCSI policy on preserving research data: Researchers are responsible for providing access to research data requested by third parties as freely and timely as possible, unless access to the data is restricted for legitimate reasons, which should be stated in the metadata description or research article.

View the RCSI Research Data Management Policy in full.

Who is responsible for preserving the research data from a project? The RCSI Research Data Management Policy applies to all college members engaged in research, including staff and research students, and those who are conducting research on behalf of the College, irrespective of funding. Researchers have the primary responsibility for ensuring research data will be managed in line with funder requirements as well as College policy and other relevant regulations and legislation.

Research data that underpins published results or is considered to have long-term value should be retained, subject to informed consent to do so, where relevant. The current RCSI REC guideline is that research data should be retained for 5-7 years and then destroyed. However, this retention time could be significantly less or more depending on the nature of the study being conducted.

The RCSI Research Data Management Policy states that in the absence of the other provisions, the default period for research data retention is 10 years from date of last requested access. Retained data must also be deposited in an appropriate national or international reputable data repository.

However, it is often advisable to retain research data/records for a longer period depending on the nature of the study and the data collected. For example, the Medical Research Council (UK) recommends the following retention schedule for various study designs.

For basic research: Research data and related material should be retained for a minimum of 10 years after the study has been completed.

For population health and clinical studies: Research data should be retained for 20 years after the study has been completed.

For clinical studies: In some cases, such as for clinical studies involving pregnant participants and those who lack capacity to consent, it has been recommended that a minimum of 25 years may be more appropriate for data retention.

However, longer retention periods for both basic research and population health and clinical studies may be appropriate in some cases. For example:

For basic research – Retention periods of 10 years+ may be more appropriate where there is the potential for Intellectual Property to arise (e.g. laboratory notebooks could be retained indefinitely). Similarly, research data relating to studies which directly inform national policymaking should be considered for permanent preservation in an appropriate archive or repository.

Indicate where the data will be deposited. If no established repository is proposed, demonstrate in the DMP that the data can be curated effectively beyond the lifetime of the grant. It is recommended to demonstrate that the repositories policies and procedures (including any metadata standards, and costs involved) have been checked.

RCSI policy on preserving data in a data repository:

Retained data must be deposited in an appropriate national or international reputable data repository or as mandated by the funder. This may be specified by the funder or publisher.
When depositing research data into external data repositories, repositories that support Open Researcher and Contributor ID (ORCID)
should be chosen as far as is practical.
View the RCSI Research Data Management Policy in full

There are many benefits to putting your data in a data repository, and the repository can provide you with many of the following services:

Persistent identifier (such as a Digital Object Identifier (DOI)) assigned to your data
Assistance with metadata, for example the data repository will usually provide recommendations or templates for creating metadata about your data
Licencing of your data for example the data repository will usually provide recommendations or options for selecting a data licence
Long-term access to the data, in some cases, long-term preservation
Search and navigation tools, and sometimes visualisation tools for data, which can help with making your data findable.
Your data is more likely to reach a wide audience of new users from anywhere in the world
If the data is required controlled access, some repositories can manage access requests on behalf of the owner of the data
In other words, data repositories can help you to make your data more FAIR.

In certain cases publishers or funders may specify which data repository you must use to deposit your data. However in most cases you will have to identify a suitable home for your data. Selecting the right data repository is crucial. Consider the following factors:

Discipline Specificity: When possible, choose a repository that caters to your research discipline.
Data Types: Ensure the repository that can accept and support the types of data you wish to deposit.
Community Standards: Select a repository that adheres to relevant community standards and best practices.
Certification: Look for repositories with certifications such as CoreTrustSeal, indicating trustworthiness and reliability.
Cost: Check if the repository charges fees for data deposit or access.
Funder/Publisher Requirements: Some funders or publishers may specify preferred or mandatory repositories. Always check their guidelines first.

Questions to ask when evaluating a data repository:

What metadata standards does the repository support?
What data formats are accepted?
What are the repository's policies on data preservation and access?
Does the repository provide persistent identifiers for datasets?
What type of licensing options are available?

If you do not have a suitable, discipline-specific repository for your data you can deposit your data in a generalist data repository. This type of repository will accept a variety of data types and file types, and most have the facility to assign a persistent identifier (PID) to published data.

Examples of generalist data repositories:

Dryad Digital Repository The Dryad Digital Repository is a curated resource that makes the data underlying scientific publications discoverable, freely reusable, and citable. Dryad accepts data from any field and in any format, and has dedicated curators to check your files before they are released, and help you follow best practices.

Figshare Figshare is a repository where users can easily upload files up to 5GB to make all of their research outputs available in a citable, shareable and discoverable manner. Any file format is accepted and DOIs are provided. The RCSI Repository uses Figshare and all entries to the Repository are automatically included as part of Figshare, with a 25GB default storage limit.

Zenodo Zenodo was built and is operated by CERN and OpenAIRE to ensure that everyone can join in Open Science. It welcomes research from all over the world, and from every discipline. Every upload is assigned a DOI, to make them citable and trackable.

If you research involved the use or development of new software, you should make the source code available on a Version Control System (VCS) such as GitHub or BitBucket. However, these sites do not support the preservation of your code nor citation, and you should upload a permanent, archived version of the source code to an approved repository. For example, GitHub is integrated with Zenodo, and Zenodo can provide a DOI registration for the archived source code.

How and when will data be shared?

'Sensitive data' is data that must be protected against unwanted disclosure, for legal or ethical reasons, for issues pertaining to personal privacy, or for proprietary considerations. At RCSI, many of our research projects work with sensitive data.

If you are handling and dealing with sensitive data, keep in mind that special attention should be given to collecting, processing, handling and storing data throughout the research process. If you wish to make these data available at the end of the project then you will need to consider this when you are designing your study. In particular, when you are collecting data you will need to ensure you are asking for informed consent to share the data at the end of the project. This might limit your data sharing opportunities, however you can publish a description of your data (metadata) without making the data itself openly accessible, and you can place conditions around access to published data if necessary. Sensitive data that has been properly anonymised can be shared without breaching data protection regulations.

Anonymisation irreversibly destroys any way of identifying the data subject. Personal data that has been rendered anonymous in such a way that the individual is not or no longer identifiable is no longer considered personal data. For data to be truly anonymised, the anonymisation must be irreversible. OpenAIRE provides researchers with a tool to anonymise data: Amnesia. The guide for which you can find here.

Pseudonymisation replaces any identifying characteristics of data with a pseudonym, a value which does not allow the data subject to be directly identified. The personal data can only be attributed to a specific data subject with the use of additional information, such as decryption key. This key should be kept separately, and be subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable individual. Pseudonymisation only provides limited protection for the identity of data subjects and in many cases as it still allows identification using indirect means.

You must comply with Irish State Law, please see the Data Protection Commission's Guidance on Anonymisation and Pseudonymisation for more information. Both the Australian National Data Service (ANDS) guidelines on Publishing and Sharing Sensitive Data and the OpenAire guide on How to Deal with Sensitive Data provide further information on dealing with and sharing sensitive data.

In your DMP you should indicate whether data will be shared via a repository, requests handled directly, or whether another mechanism will be used?

Sensitive and confidential data can be safeguarded by regulating or restricting access to and use of the data. Access controls should always be proportionate to the kind of data and level of confidentiality involved. When regulating access, consider who would be able to access your data, what they are able to do with it, whether any specific use restrictions are required, and for how long you want the data to be available.

The three levels of data access, according to the UK Data Service, are:

Open Data: Data that can be accessed by any user for any reason, including commercial. Data in this category should not contain personal information unless consent is given.

Safeguarded Data: Data that contain no personal information, but the data owner considers there to be a risk of disclosure resulting from linkage to other data

Controlled Data: for data that may be disclosive. Data are generally only available to users through a relevant Data Access Committee, which may mandate training or other protective measures as appropriate.

Additionally, most data repositories will allow you to place a temporary embargo on your data. During the embargo period, the description of the dataset is published, but not the actual data. The data themselves will become available to access after the embargo period ends.

Sometimes there are legitimate reasons for not sharing some or all research data generated by a project. Funders who require data sharing will generally ask that researchers justify this decision in their Data Management Plan (DMP).

It is generally possible to choose not to share research data using the following criteria, which have been adapted from the European Commission Guidelines on Open Access to Scientific Publications and Research Data in Horizon 2020.

Some reasons why it might not be possible to share data include:

Data are commercially sensitive
Data are confidential (due to third party obligation
Sharing data would break data protection regulations
Sharing would mean that the project's main aim might not be achieved
Data are generated under an industry funded or co-funded project
Sharing of the data may impact on future plans to protect intellectual property

Please see the sections on Ethical Considerations and Data protection for further information on the limitations of sharing research data, and the importance of informed consent and ethical approval.

Data journals comic

Another mode of disseminating your research data is to publish an article in a data journal. Data journals are publications that allow researchers to document a published dataset. The primary purpose of a data journal is to showcase the data, allowing the author to use the entire article to describe the data, rather than the resulting analysis.

Data journals are useful as they:

can be used to raise awareness of a new dataset and highlight opportunities for that data to be reused in new ways
provide valuable context about the data that may not be covered in the data repository metadata or traditional publications associated with that data, e.g. they can provide much more detail about where the data came from and what it contains
can draw more attention to the publication containing the findings - research by McGillivray et al (2022) show mean citation counts for research articles are higher if they have an associated data paper.

Data journal articles are also a means to give credit to those who share data. They can be an important tool for crediting those who provide support to research work but may not qualify for authorship on a traditional article e.g. a data manager.

The journal Scientific Data is an example of a data journal. Scientific Data is a peer-reviewed open-access journal for descriptions of datasets and research that advances the sharing and reuse of research data. See: https://www.nature.com/sdata/

What does it mean to license your data? In the DMP template you might see a question such as,

How will other legal issues, such as intellectual property rights and ownership, be managed? (Science Europe DMP template)

A license agreement is a legal arrangement between the creator/depositor of the data and the data repository, stating clear re-use rights to help others understand what they are allowed to do with your data. To make re-use as likely as possible it is recommended you to choose a licence which:

Makes data available to the widest audience possible
Makes the widest range of uses possible

To answer this question you need to think about who is owner of the research data from your study (the PI? the funder? a consortium? a third-part organisation?) and whether you have the ownership rights to make the data available to others. If you can make the data available to others, will you need to restrict how third-parties can use this data? For example, maybe the data can only be used for non-commercial purposes. Maybe you'd like to be cited as the origin of that data every time someone uses the data in the future. All of this can be clarified in the license agreement.

It is imperative that the intellectual property rights (IPR) pertaining to the data are established before any licensing takes place. If your research contains data from third parties (e.g., data from a health or hospital system) you should ensure you have the permission of the rights holder to share this data, or that the data is covered by licences that permit the sharing of data, before you put it in the data repository.

Creative Commons licenses are commonly applied to research data because

a CC license gives you a way to grant others permission to use your data under copyright law, and
a CC license gives clarity to new users of the data what they are allowed to do with the data

There are six different types of Creative Commons license, ranging from the most to least permissive. Creative Commons licenses allow the copyright holder to retain copyright ownership of their works while allowing others to use the work under certain conditions specified by the chosen licence. See a full description of these licences here

CC-BY Attribution
Users can distribute, remix, tweak, and build upon a work, even commercially, as long as they give credit to the original creator of the work.
CC-SA Share-Alike
Users can remix, tweak, and build upon a work even for commercial purposes, as long as they credit the original creator and license any new creations under identical terms.
CC BY-ND Attribution-NoDerivs
Users can copy and redistribute the material in any medium or format for any purpose, even commercially, as long as it is passed along unchanged and credit is given to the original creator.
CC BY-NC Attribution-NonCommercial
Users can copy and redistribute the material in any medium or format and remix, transform, and build upon the material but any new works must be non-commercial and give credit to the original creator.
CC BY-NC-SA Attribution-NonCommercial-ShareAlike
Users can copy and redistribute the material in any medium or format and remix, transform, and build upon the material but any new works must be non-commercial, give credit to the original creator and be licensed under identical terms.
CC BY-NC-ND Attribution-NonCommercial-NoDerivs
This licence is the most restrictive. Users can copy and redistribute the material, but they must credit the original creator and cannot change the work in any way or use it commercially.

An open-source licence is a set of conditions that grants the users of your software certain rights to use, copy, modify, and possibly redistribute the source code or content of the software. It also asserts your authorship. There are several licensing options for open source software, including:

MIT License – permits any person to use, copy, modify, merge, publish distribute, sublicense, and/or sell copies of the software as long as a copy of the license notification is included with any reuse
GNU General Public License - users can copy, distribute, and modify the software as long as any modifications are also licensed under the GPL
Apache license 2.0 - allows users to use the software for any purpose, to distribute it, to modify it, and to distribute modified versions of the software as long as a copy of the license is redistributed with any modified software

Additional information is available from the Software Sustainability Institute and Open Source Initiative.

Will users need specific software to re-use the data?

Ensuring your data can be opened
Formats for preservation

Specific software: If potential users of your research data would need access to specific tools to be able to reuse the data, you should indicate that in your DMP and provide sufficient details on what software and what version woudl be required. If the software is availabel for download, provide information on where they can access a copy.

File formats: The ability to read your data in the future depends on the file format, so you are strongly encouraged to use standard, exchangeable or open file formats. You can store data in a proprietary format where it is the de facto format within your disciplinary area, or where the format is supported across a range of software (so you are not locked into one type of software). The go-to guidance on file formats is the Library of Congress (LOC) Recommended Format Statement which is updated each year, as this is a constantly evolving topic.

When choosing an electronic file format to create and store data, it's important to consider whether the format is open and/or ubiquitous. The format you use determines how accessible these data are to other users, as some files can only be opened when you have a license to use that software. File format also determine how accessible the data will be to yourself and others into the future - technology evolves quickly, and the software that you use today will become obsolete in time.

Why use open file formats? File formats that are open or non-proprietary will tend to remain accessible, even if the software that created them is no longer available. However, formats which are ubiquitous or have become the default standard within a discipline, whether proprietary or not, are also likely to be maintained into the future.

What if you have a preferred software? If you find it necessary or convenient to work with a proprietary format, it may be useful to store your data using that format for data collection and analysis, while also storing a copy in an open or accessible format for sharing or archiving once your project is complete.

Which format is best for FAIR data? Many data archives and repositories will already have recommended file formats based on best practice within the disciplines they support.

When choosing a file format you should consider the following:

How you plan to analyse your data
Which software and file formats you and your colleagues have used in the past
Any discipline specific norms or technical standards
Whether file formats are at risk of obsolescence because of their dependence on a particular technology.
Which formats are best to use for the long-term preservation of data
Whether important information might be lost by converting between different formats

File formats likely to be accessible into the future (from DMPTool Guidance):

Non-proprietary
Open, with documented standards
In common usage by the research community
Using standard character encodings (i.e., ASCII, UTF-8)
Uncompressed (space permitting)

Examples of preferred format choices (from DMPTool Guidance):

Image: JPEG, JPG-2000, PNG, TIFF
Text: plain text (TXT), HTML, XML, PDF/A
Audio: AIFF, WAVE
Containers: TAR, GZIP, ZIP
Databases: prefer XML or CSV to native binary formats

For more information on recommended formats, see the UK Data Service guidance on recommended formats.

How will a unique and persistent identifier be attached to the data?

Identifying your data (PIDs)
Software availability statement

Indicate whether a persistent identifier (PID) will be pursued for the data. Typically, a trustworthy, long-term repository will provide a persistent identifier.

Persistent identifiers or PIDS are the backbone of the data citation. If someone wants to replicate your analysis they will need to be able to find the correct copy of the data that you used. By including a persistent identifier in your data citation you enable readers to identify and navigate to the exact version of the data that you used in your research. The persistent identifier is preferable to a less stable reference point such as a URL (website) address, as persistent identifiers are slow to expire and the data is more likely to be findable for many years. There are several types of persistent identifier used to identify datasets, but DOI numbers are most commonly used. Please find more information on DOIs below.

A DOI number is a string of numbers, letters and symbols used to permanently identify an article or document and link to it on the web. DOIs are commonly used to identify a research data resource online, and their strength is that they provide a unique identifier for the file or collection of files and provide an easy way to locate these files online. They are superior to web address links (URLs) as while a web address (URL) might change, the DOI will never change, plus they tend to be shorter and easier to cite than a web address.

Here's an example of what a DOI looks like in a data citation:
Smith, J., and Jones, P. (2023). Environmental risk factors for autism [dataset]. Dryad Digital Repository [distributor]. doi: 10.1234/abcd123

Many data repositories will assign a persistent identifier to your data once you publish the dataset (or metadata about the dataset) on their platform. For example on Zenodo data uploads are made available online as soon as you hit publish, and your DOI is registered within seconds

If you research involved the use or development of new software, you should include a software availability statement. Your software availability statement should include the name of the repository where the source code at the time of publication (the archived version) is available, a DOI number for the archived software, and details of the license under which the software can be used. You should use an Open Source License (OSI) if possible, which allows software to be freely used, modified, and shared.

Putting it all together

Now that you have reached the conclusion of your research study, to ensure your data are FAIR, you have:

Published your data in a repository / archive which has provided an identifier for the published data.

Added rich metadata about the data to the repository / archive.

Attached a license to your data, so it is clear how a new user can use the data in a new work

Clearly explained any access restriction in the metadata and given clear guidance on how to request access.

Provided a data citation in the metadata, including the important identifier and used this citation in your publications.

Provided a data availability statement in all of your publications.

Together, your data citation and data availability statement should look something like this:
Smith, J., and Jones, P. (2023). Environmental risk factors for autism [dataset]. Dryad Digital Repository [distributor]. doi: 10.1234/abcd123. The data that support the findings of this study are openly available in Dryad Digital Repository at doi: 10.1234/abcd123 under the terms of the Creative Commons Attribution 4.0 (CC-BY 4.0) license.

Further resources

File formats for preservation

Library of Congress Recommended Formats Statement The Library of Congress identified preferred and acceptable file formats for textual works and musical compositions, still image works, audio works, moving image works, software and electronic gaming and learning, datasets/databases and websites.
UK Data Service Recommended Formats Guidance on file formats recommended and accepted by the UK Data Service for data sharing, reuse and preservation.
UCD Digital Library Preferred Formats for Data Preferred formats identified by the UCD Digital Library and Repository which facilitate processing, storage, and dissemination of data, assuring both useability and longer-term durability of the data.

Licensing data

There are several free-to-use tools online to help you find a suitable license for your research data and/or software.

Creative Commons
Chosealicence.com
License Selector
The Digital Curation Centre (DCC) has also created a guide on "How to License Research Data"

Finding a data repository

There are several resources to help you locate a suitable data repository:

re3data.org The Registry of Research Data Repositories is a directory of more than 2,000 data repositories that meet established standards. recommended by Horizon Europe for locating an optimal repository for your data.
https://fairsharing.org/ FAIRSharing gathers details about repositories, which you can filter by subject, domain and taxonomy.
http://www.researchpipeline.com/ Research Pipeline is a privately-maintained list of repositories, including 140 disciplinary databases. This site is updated less often than the above.

See also the RCSI guide on "Where to submit data" created in collaboration with the Consortium of National and University Librarians (CONUL) for more information. https://drive.google.com/file/d/1S8Qc3cDdfziDdwW5ACRA59y2FQuMjMsm/view