At the outset of your study, even before any data collection has taken place, you should be planning what will happen to the data on completion of the study. For this reason, the DMP should address the following questions:
RCSI recognizes research data as a valuable institutional asset. Research data must be retained and disposed of securely according to the relevant retention and disposal schedule, in accordance with legal, ethical and research funder requirements, and with particular concern for the confidentiality and security of the data.
RCSI policy on preserving research data: Researchers are responsible for providing access to research data requested by third parties as freely and timely as possible, unless access to the data is restricted for legitimate reasons, which should be stated in the metadata description or research article.
View the RCSI Research Data Management Policy in full.
Who is responsible for preserving the research data from a project? The RCSI Research Data Management Policy applies to all college members engaged in research, including staff and research students, and those who are conducting research on behalf of the College, irrespective of funding. Researchers have the primary responsibility for ensuring research data will be managed in line with funder requirements as well as College policy and other relevant regulations and legislation.
How long should I retain research data? Research data that underpins published results or is considered to have long-term value should be retained, subject to informed consent to do so, where relevant. The current RCSI REC guideline is that research data should be retained for 5-7 years and then destroyed. However, this retention time could be significantly less or more depending on the nature of the study being conducted.
The RCSI Research Data Management Policy states that in the absence of the other provisions, the default period for research data retention is 10 years from date of last requested access. Retained data must also be deposited in an appropriate national or international reputable data repository.
However, it is often advisable to retain research data/records for a longer period depending on the nature of the study and the data collected. For example, the Medical Research Council (UK) recommends the following retention schedule for various study designs.
- For basic research: Research data and related material should be retained for a minimum of 10 years after the study has been completed.
- For population health and clinical studies: Research data should be retained for 20 years after the study has been completed.
- For clinical studies: In some cases, such as for clinical studies involving pregnant participants and those who lack capacity to consent, it has been recommended that a minimum of 25 years may be more appropriate for data retention.
However, longer retention periods for both basic research and population health and clinical studies may be appropriate in some cases. For example:
Indicate where the data will be deposited. If no established repository is proposed, demonstrate in the DMP that the data can be curated effectively beyond the lifetime of the grant. It is recommended to demonstrate that the repositories policies and procedures (including any metadata standards, and costs involved) have been checked.
RCSI policy on preserving data in a data repository:
There are many benefits to putting your data in a data repository, and the repository can provide you with many of the following services:
Persistent identifier (such as a Digital Object Identifier (DOI)) assigned to your data
Assistance with metadata, for example the data repository will usually provide recommendations or templates for creating metadata about your data
When choosing a data repository, always start by looking for broadly recognised, discipline-specific or certified repository in your scientific field. If you cannot find such a repository, or if you're unsure of whether you've found a good home for your data, you can use the following assessment criteria, which we have adapted from Science Europe's Practical Guide to the International Alignment of Research Data Management - Extended Edition.
In certain cases publishers or funders may specify which data repository you must use to deposit your data. However in most cases you will have to identify a suitable home for your data.
There are several resources to help you locate a suitable data repository:
As you review potential repositories, ask the following questions to assess their suitability:
See also the RCSI guide on "Where to submit data" created in collaboration with the Consortium of National and University Librarians (CONUL) for more information. https://drive.google.com/file/d/1S8Qc3cDdfziDdwW5ACRA59y2FQuMjMsm/view
If you do not have a suitable, discipline-specific repository for your data you can deposit your data in a generalist data repository. This type of repository will accept a variety of data types and file types, and most have the facility to assign a persistent identifier (PID) to published data.
Examples of generalist data repositories:
Dryad Digital Repository The Dryad Digital Repository is a curated resource that makes the data underlying scientific publications discoverable, freely reusable, and citable. Dryad accepts data from any field and in any format, and has dedicated curators to check your files before they are released, and help you follow best practices.
Figshare Figshare is a repository where users can easily upload files up to 5GB to make all of their research outputs available in a citable, shareable and discoverable manner. Any file format is accepted and DOIs are provided. The RCSI Repository uses Figshare and all entries to the Repository are automatically included as part of Figshare, with a 25GB default storage limit.
Zenodo Zenodo was built and is operated by CERN and OpenAIRE to ensure that everyone can join in Open Science. It welcomes research from all over the world, and from every discipline. Every upload is assigned a DOI, to make them citable and trackable.
If you research involved the use or development of new software, you should make the source code available on a Version Control System (VCS) such as GitHub or BitBucket. However, these sites do not support the preservation of your code nor citation, and you should upload a permanent, archived version of the source code to an approved repository. For example, GitHub is integrated with Zenodo, and Zenodo can provide a DOI registration for the archived source code.
Explain the foreseeable research uses (and/ or users) for the retained data
'Sensitive data' is data that must be protected against unwanted disclosure, for legal or ethical reasons, for issues pertaining to personal privacy, or for proprietary considerations. At RCSI, many of our research projects work with sensitive data.
If you are handling and dealing with sensitive data, keep in mind that special attention should be given to collecting, processing, handling and storing data throughout the research process. If you wish to make these data available at the end of the project then you will need to consider this when you are designing your study. In particular, when you are collecting data you will need to ensure you are asking for informed consent to share the data at the end of the project. This might limit your data sharing opportunities, however you can publish a description of your data (metadata) without making the data itself openly accessible, and you can place conditions around access to published data if necessary. Sensitive data that has been properly anonymised can be shared without breaching data protection regulations.
Anonymisation irreversibly destroys any way of identifying the data subject. Personal data that has been rendered anonymous in such a way that the individual is not or no longer identifiable is no longer considered personal data. For data to be truly anonymised, the anonymisation must be irreversible. OpenAIRE provides researchers with a tool to anonymise data: Amnesia. The guide for which you can find here.
Pseudonymisation replaces any identifying characteristics of data with a pseudonym, a value which does not allow the data subject to be directly identified. The personal data can only be attributed to a specific data subject with the use of additional information, such as decryption key. This key should be kept separately, and be subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable individual. Pseudonymisation only provides limited protection for the identity of data subjects and in many cases as it still allows identification using indirect means.
You must comply with Irish State Law, please see the Data Protection Commission's Guidance on Anonymisation and Pseudonymisation for more information. Both the Australian National Data Service (ANDS) guidelines on Publishing and Sharing Sensitive Data and the OpenAire guide on How to Deal with Sensitive Data provide further information on dealing with and sharing sensitive data.
Open Data: Data that can be accessed by any user for any reason, including commercial. Data in this category should not contain personal information unless consent is given.
Safeguarded Data: Data that contain no personal information, but the data owner considers there to be a risk of disclosure resulting from linkage to other data
Controlled Data: for data that may be disclosive. Data are generally only available to users through a relevant Data Access Committee, which may mandate training or other protective measures as appropriate.
Additionally, most data repositories will allow you to place a temporary embargo on your data. During the embargo period, the description of the dataset is published, but not the actual data. The data themselves will become available to access after the embargo period ends.
Do you need exclusive use of the data while you finalise your publication? Will you need to embargo access to the data for a period of time?
Sometimes there are legitimate reasons for not sharing some or all research data generated by a project. Funders who require data sharing will generally ask that researchers justify this decision in their Data Management Plan (DMP).
It is generally possible to choose not to share research data using the following criteria, which have been adapted from the European Commission Guidelines on Open Access to Scientific Publications and Research Data in Horizon 2020.
Some reasons why it might not be possible to share data include:
Data are commercially sensitive
Data are confidential (due to third party obligation
Sharing data would break data protection regulations
Sharing would mean that the project's main aim might not be achieved
Data are generated under an industry funded or co-funded project
Sharing of the data may impact on future plans to protect intellectual property
Please see the sections on Ethical Considerations and Data protection for further information on the limitations of sharing research data, and the importance of informed consent and ethical approval.
What does it mean to license your data? In the DMP template you might see a question such as,
How will other legal issues, such as intellectual property rights and ownership, be managed? (Science Europe DMP template)
A license agreement is a legal arrangement between the creator/depositor of the data and the data repository, stating clear re-use rights to help others understand what they are allowed to do with your data. To make re-use as likely as possible it is recommended you to choose a licence which:
To answer this question you need to think about who is owner of the research data from your study (the PI? the funder? a consortium? a third-part organisation?) and whether you have the ownership rights to make the data available to others. If you can make the data available to others, will you need to restrict how third-parties can use this data? For example, maybe the data can only be used for non-commercial purposes. Maybe you'd like to be cited as the origin of that data every time someone uses the data in the future. All of this can be clarified in the license agreement.
It is imperative that the intellectual property rights (IPR) pertaining to the data are established before any licensing takes place. If your research contains data from third parties (e.g., data from a health or hospital system) you should ensure you have the permission of the rights holder to share this data, or that the data is covered by licences that permit the sharing of data, before you put it in the data repository.
Creative Commons licenses are commonly applied to research data because
There are six different types of Creative Commons license, ranging from the most to least permissive. Creative Commons licenses allow the copyright holder to retain copyright ownership of their works while allowing others to use the work under certain conditions specified by the chosen licence. See a full description of these licences here
An open-source licence is a set of conditions that grants the users of your software certain rights to use, copy, modify, and possibly redistribute the source code or content of the software. It also asserts your authorship. There are several licensing options for open source software, including:
Additional information is available from the Software Sustainability Institute and Open Source Initiative.
Specific software: If potential users of your research data would need access to specific tools to be able to reuse the data, you should indicate that in your DMP and provide sufficient details on what software and what version woudl be required. If the software is availabel for download, provide information on where they can access a copy.
File formats: The ability to read your data in the future depends on the file format, so you are strongly encouraged to use standard, exchangeable or open file formats. You can store data in a proprietary format where it is the de facto format within your disciplinary area, or where the format is supported across a range of software (so you are not locked into one type of software). The go-to guidance on file formats is the Library of Congress (LOC) Recommended Format Statement which is updated each year, as this is a constantly evolving topic.
When choosing an electronic file format to create and store data, it's important to consider whether the format is open and/or ubiquitous. The format you use determines how accessible these data are to other users, as some files can only be opened when you have a license to use that software. File format also determine how accessible the data will be to yourself and others into the future - technology evolves quickly, and the software that you use today will become obsolete in time.
Why use open file formats? File formats that are open or non-proprietary will tend to remain accessible, even if the software that created them is no longer available. However, formats which are ubiquitous or have become the default standard within a discipline, whether proprietary or not, are also likely to be maintained into the future.
What if you have a preferred software? If you find it necessary or convenient to work with a proprietary format, it may be useful to store your data using that format for data collection and analysis, while also storing a copy in an open or accessible format for sharing or archiving once your project is complete.
Which format is best for FAIR data? Many data archives and repositories will already have recommended file formats based on best practice within the disciplines they support.
When choosing a file format you should consider the following:
File formats likely to be accessible into the future (from DMPTool Guidance):
Examples of preferred format choices (from DMPTool Guidance):
For more information on recommended formats, see the UK Data Service guidance on recommended formats.
Indicate whether a persistent identifier (PID) will be pursued for the data. Typically, a trustworthy, long-term repository will provide a persistent identifier.
Persistent identifiers or PIDS are the backbone of the data citation. If someone wants to replicate your analysis they will need to be able to find the correct copy of the data that you used. By including a persistent identifier in your data citation you enable readers to identify and navigate to the exact version of the data that you used in your research. The persistent identifier is preferable to a less stable reference point such as a URL (website) address, as persistent identifiers are slow to expire and the data is more likely to be findable for many years. There are several types of persistent identifier used to identify datasets, but DOI numbers are most commonly used. Please find more information on DOIs below.
A DOI number is a string of numbers, letters and symbols used to permanently identify an article or document and link to it on the web. DOIs are commonly used to identify a research data resource online, and their strength is that they provide a unique identifier for the file or collection of files and provide an easy way to locate these files online. They are superior to web address links (URLs) as while a web address (URL) might change, the DOI will never change, plus they tend to be shorter and easier to cite than a web address.
Here's an example of what a DOI looks like in a data citation:
Smith, J., and Jones, P. (2023). Environmental risk factors for autism [dataset]. Dryad Digital Repository [distributor]. doi: 10.1234/abcd123
Many data repositories will assign a persistent identifier to your data once you publish the dataset (or metadata about the dataset) on their platform. For example on Zenodo data uploads are made available online as soon as you hit publish, and your DOI is registered within seconds
RCSI policy on data availability statement:
A statement describing how and on what terms any supporting data may be accessed must be included in published research outputs.
View the RCSI Research Data Management Policy in full
A data availability statement is a short statement at the end of a research article that describes how, where, and under what conditions the data associated with the research article can be accessed. All research articles should include a data availability statement, even when there is no data associated with the article (more on this below) as this an important step in giving credit to data creators, and in supporting the reproducibility of research.
In journal publications, the data availability statement usually appears at the end of a journal article before the ‘references’ section. The author(s) of the article write the data availability statement, and you should always include this statement in your article prior to submission for publication.
The data availability statement provides clear information on where the data can be accessed, and whether access to the data is open or restricted in some way. It should also provide a digital reference or link to where the data can be found online. Statements to the effect of "data available from authors" or "data will be made available on request" are not acceptable as a data availability statement, as they do not provide sufficient information to genuinely enable access to the data.
A data citation is an entry for a dataset within the reference list of an article, book, conference proceeding, or other document. Data citations are captured by standard citation counting methods if they are included in the reference list. However it is unfortunately still common practice for researchers to not cite data correctly in their reference list, or not to include sufficient information on the source of their research data.
It's important to cite data in your publications, in just the same way you would articles, books, images and websites, as a dataset is a source of evidence to support your argument. The UK Data Service have provided a useful video summarising why it is important to cite data correctly.
The UK Data Service highlights the following benefits of data citation to researchers and to science in general:
Transparency: Citing data is a way of clearly showing exactly which version of which dataset has underpinned or influenced research, as well as crediting those who have made the work possible by collecting the data.
Reproducibility: It helps future researchers to find out which data the researcher has used and enable the research to be reproduced to assess its integrity. Louise Corti, Director of Collections Development and Data Publishing for the UK Data Service, has written a great blog about research reproducibility in qualitative research: Show Me the Data.
Helping track the use of the data: Researchers who [share data] want to know that the data is being used, just like any other researchers want to know that their book or article has been used to support others’ research. In addition, bodies that fund the collection of this data want to know that their funding has produced value. It can also help researchers in gaining further funding for future data collection and analysis. Susan Noble wrote a great post looking at finding out what people have done with data we provide and its impact.
Measuring impact: Researchers want their books, articles and data to be make a difference to others, whether this is on future research, influencing policy or positively changing the lives of individuals, communities or society. Citing data, like citing any other research helps [repositories] in measuring and reporting on this impact.
Source: Spotlight on #CiteTheData: Make the data count – Data Impact blog (ukdataservice.ac.uk)
According to the ICPSR, the elements of a data citation are:
These are the minimum elements required for dataset identification and retrieval. Fewer or additional elements may be requested by author guidelines or style manuals. Be sure to include as many elements as needed to precisely identify the dataset you have used.
Example of published dataset citation with an archive number:
|
TILDA. (2019). The Irish Longitudinal study on Ageing (TILDA) Wave 4, 2016. [dataset]. Version 4.0. Irish Social Science Data Archive. SN:0053-05. www.ucd.ie/issda/data/tilda/wave3
|
Example of published dataset citation with a DOI number:
|
Smith, J., and Jones, P. (2023). Environmental risk factors for autism [dataset]. Dryad Digital Repository [distributor]. doi: 10.1234/abcd123
|
Example of an unpublished dataset:
|
Smith, J., and Jones, P. (2023). Environmental risk factors for autism [unpublished raw dataset]. Royal College of Surgeons in Ireland.
|
Example of published dataset citation from an organisation or research group:
|
Health Service Executive. (2019). General Referrals by Hospital, Department and Year 2019. [dataset]. HSE Open Data [distributor].
|
Example of published dataset from individual authors:
|
Smith, J., and Jones, P. (2023). Environmental risk factors for autism [dataset]. Dryad Digital Repository [distributor]. doi: 10.1234/abcd123
|
For a deep dive into Data Citation see: Ball, A. & Duke, M. (2015). ‘How to Cite Datasets and Link to Publications’. DCC How-to Guides. Edinburgh: Digital Curation Centre. Available online: /resources/how-guides
You should include the following three pieces of information in your data availability statement:
Use the following examples to guide you in constructing a data availability statement. Rememember to include at a minimum the following three pieces of information:
How accessible are the data? | What to say in your data availability statement: | Example text: |
---|---|---|
Data are openly accessible in data repository. | The data that support the findings of this study are openly available in [insert repository name] at http://doi.org/ [insert DOI number], dataset reference number [insert reference number]. |
Example 1: The data that support the findings of this study are openly available in Zenodo.org at 10.5281/zenodo.3723939 under the terms of the Creative Commons Attribution 4.0 (CC-BY 4.0) license.
|
Data are openly available in a repository that does not issue DOIs. | The data that support the findings of this study are openly available in [insert repository name] at [insert URL], reference number [insert reference number assigned to this dataset by the repository]. |
Example 1: The data that support the findings of this study are openly available in GEO DataSets at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE68849, GEO accession number GDS5660. Data are available under the terms of the Creative Commons Attribution 4.0 (CC-BY 4.0) license. Example 2: NCBI Gene: Ihe1 intestinal helminth expulsion 1 [Mus musculus (house mouse)]. Accession number 107537. Data are available under the terms of the Creative Commons Zero “No rights reserved” data waiver (CC0 1.0 Public domain dedication). |
Data are derived from public domain resources.
|
The data that support the findings of this study are available in [insert repository name] at [insert URL or DOI], reference number [insert reference number]. |
Example: The datasets that support the findings of this study are openly available in Data.gov.ie under the terms of the Creative Commons Attribution 4.0 (CC-BY 4.0) license at the following locations: COVID-19 HSE Weekly Booster Vaccination Figures: https://data.gov.ie/dataset/covid-19-hse-weekly-booster-vaccination-figures2?package_type=dataset Pobal HP - Deprivation Index Scores - 2016: https://data.gov.ie/dataset/hp-deprivation-index-scores-2016/resource/6480bb69-023c-47f2-813f-8689bacafa54
|
Data were generated at a central, large-scale facility, available upon request. | Raw data were generated at [insert facility name]. Derived data supporting the findings of this study are available from [describe procedure for applying for access to the data]. |
Example: Raw data were generated at FutureNeuro at RCSI and Trinity College Dublin. Derived data supporting the findings of this study are available from the corresponding author [G.C.] on request. |
Data are not publicly available, but available to researchers with appropriate credentials in line with consent agreed with respondents. |
Due to confidentiality agreements, access to the data that support the findings of this study is restricted to bona fide researchers and is subject to a non-disclosure agreement. Details of the data and how to request access are available from [insert repository where data reside / name of data manager at host institution].
|
Example: The Anonymised Microdata Files (AMF) for the Growing Up in Ireland Child Cohort (9 years) data is available via the Irish Social Science Data Archive, ISSDA for bona fide research purposes only and is subject to an end user agreement. Details of the data and how to request access are available at https://www.ucd.ie/issda/data/growingupinirelandgui/ |
Data are not publicly available to protect anonymity of participants, although some controlled access is allowed. |
The data that support the findings of this study are not publicly available due to [describe reason for access restriction, and procedure for applying for access to the data and the conditions under which access will be granted].
|
Example: The data that support the findings of this study are not publicly available due to restrictions outlined in consent agreements with participants and the identifying nature of the data. Data can be made available upon reasonable request and in line with the consent agreed with participants, by contacting the authors [C.G. and P. O'H.] |
Data are not publicly available but is available on request, due to privacy/ethical restrictions. |
The data that support the findings of this study are not publicly available due to [describe reason for non-sharing of data]. | Example: Given the sensitive and identifying nature of the data, and in line with the consent agreed with participants, the data that support the findings of this study are not publicly available. |
Data are currently embargoed due to commercial restrictions (e.g. to allow time for commercialization). |
The data that support the findings will be available in [repository name] at [URL / DOI link] following a [6 month] embargo from the date of publication to allow for commercialization of research findings. | Example: The data that support the findings of this study will be available in Zenodo.org at at 10.5281/zenodo.3723939 from early 2023, following a 6 month embargo from the date of completion of the study, to allow for commercialization of research findings. |
Data are restricted by commercial, industry, patent, government policies, regulations, or laws. |
Due to the nature of the research, due to [ethical/legal/commercial] supporting data is not available. [If known, describe procedure for applying for access to the data and the conditions under which access will be granted.]
|
Example: Due to commercial restrictions, the Drug Distribution Dataset used in this study is not publicly available. Access to the data can be requested by completing the Data Request form at www.allianceheathcaresample.com/data. |
Data are available within the article or its supplementary materials. |
The authors confirm that the data supporting the findings of this study are available within the article [and/or] its supplementary materials. |
Example 1: The data supporting the findings of this study are available in the supplementary material (Appendix A) of this article.
|
Data are subject to third party restrictions. |
The data that support the findings of this study are available from [third party]. Restrictions apply to the availability of these data, which were used under license for this study. Data are available from [the authors / at URL] [describe procedure you used to access the data]
|
Example: The Health data from the Quarterly National Household Survey Q3-2010 are made available by the Central Statistics Office. Restrictions apply to the availability of QNHS data, which were used under license for this study. Data are available from the Irish Social Science Data Archive at https://www.ucd.ie/issda/data/qnhsmodules/, ISSDA study number 00041-00. Access can be requested by completing an ISSDA Data Request Form for Research. |
Publication did not use any data. |
It's important to include this information, even if there is no data underpinning the article, for clarity
|
Example 1: No data was used for the research described in the article. Example 2: No data are associated with this article. |
For advice on constructing the data availability statement for data types that are commonly used in the health sciences (e.g., 3D-printable models, chemical and macromolecular structures, neuroimaging data, sequence and 'omics data) please view the author guidance from Health Open Research: https://healthopenresearch.org/for-authors/data-guidelines
If you research involved the use or development of new software, you should include a software availability statement. Your software availability statement should include the name of the repository where the source code at the time of publication (the archived version) is available, a DOI number for the archived software, and details of the license under which the software can be used. You should use an Open Source License (OSI) if possible, which allows software to be freely used, modified, and shared.
Now that you have reached the conclusion of your research study, to ensure your data are FAIR, you have:
Published your data in a repository / archive which has provided an identifier for the published data.
Added rich metadata about the data to the repository / archive.
Attached a license to your data, so it is clear how a new user can use the data in a new work
Clearly explained any access restriction in the metadata and given clear guidance on how to request access.
Provided a data citation in the metadata, including the important identifier and used this citation in your publications.
Provided a data availability statement in all of your publications.
Together, your data citation and data availability statement should look something like this:
Smith, J., and Jones, P. (2023). Environmental risk factors for autism [dataset]. Dryad Digital Repository [distributor]. doi: 10.1234/abcd123. The data that support the findings of this study are openly available in Dryad Digital Repository at doi: 10.1234/abcd123 under the terms of the Creative Commons Attribution 4.0 (CC-BY 4.0) license.
File formats for preservation
Licensing data
There are several free-to-use tools online to help you find a suitable license for your research data and/or software.