LibGuides: Research Data Management: Ending your research project

Choosing a data repository

There are many benefits to putting your data in a data repository, and the repository can provide you with many of the following services:

Persistent identifier (such as a Digital Object Identifier (DOI)) assigned to your data
Assistance with metadata, for example the data repository will usually provide recommendations or templates for creating metadata about your data
Licencing of your data for example the data repository will usually provide recommendations or options for selecting a data licence
Long-term access to the data, in some cases, long-term preservation
Search and navigation tools, and sometimes visualisation tools for data, which can help with making your data findable.
Your data is more likely to reach a wide audience of new users from anywhere in the world
If the data is required controlled access, some repositories can manage access requests on behalf of the owner of the data
In other words, data repositories can help you to make your data more FAIR.

When choosing a data repository, always start by looking for broadly recognised, discipline-specific or certified repository in your scientific field. If you cannot find such a repository, or if you're unsure of whether you've found a good home for your data, you can use the following assessment criteria, which we have adapted from Science Europe's Practical Guide to the International Alignment of Research Data Management - Extended Edition.

In certain cases publishers or funders may specify which data repository you must use to deposit your data. However in most cases you will have to identify a suitable home for your data.

There are several resources to help you locate a suitable data repository:

re3data.org The Registry of Research Data Repositories is a directory of more than 2,000 data repositories that meet established standards. recommended by Horizon Europe for locating an optimal repository for your data.
https://fairsharing.org/ FAIRSharing gathers details about repositories, which you can filter by subject, domain and taxonomy.
http://www.researchpipeline.com/ Research Pipeline is a privately-maintained list of repositories, including 140 disciplinary databases. This site is updated less often than the above.

As you review potential repositories, ask the following questions to assess their suitability:

Is it reputable? For example, is it listed in Re3data thereby meeting their conditions of inclusion?
Is it appropriate to my discipline?
Does it accept the type of data I want to deposit?
Is there a size limit on how much data I can deposit?
Is there a charge to deposit – even a one off fee?
Will it provide a persistent identifier for my data such as a DOI number?
Does it provide guidance to new users on how the data should be cited?
Does it provide access control for my research data?
Does it ensure the data will be preserved long term (for the foreseeable) or is there a time limit on the repository?
Does it provide expert help e.g. metadata provision, curation?

See also the RCSI guide on "Where to submit data" created in collaboration with the Consortium of National and University Librarians (CONUL) for more information. https://drive.google.com/file/d/1S8Qc3cDdfziDdwW5ACRA59y2FQuMjMsm/view

If you do not have a suitable, discipline-specific repository for your data you can deposit your data in a generalist data repository. This type of repository will accept a variety of data types and file types, and most have the facility to assign a Persistent Identifier (PID) to published data.

Examples of multidisciplinary data repositories:

Dryad Digital Repository The Dryad Digital Repository is a curated resource that makes the data underlying scientific publications discoverable, freely reusable, and citable. Dryad accepts data from any field and in any format, and has dedicated curators to check your files before they are released, and help you follow best practices.

Figshare Figshare is a repository where users can easily upload files up to 5GB to make all of their research outputs available in a citable, shareable and discoverable manner. Any file format is accepted and DOIs are provided. The RCSI Repository uses Figshare and all entries to the Repository are automatically included as part of Figshare, with a 25GB default storage limit.

Zenodo Zenodo was built and is operated by CERN and OpenAIRE to ensure that everyone can join in Open Science. It welcomes research from all over the world, and from every discipline. Every upload is assigned a DOI, to make them citable and trackable.

If you research involved the use or development of new software, you should make the source code available on a Version Control System (VCS) such as GitHub or BitBucket. However, these sites do not support the preservation of your code nor citation, and you should upload a permanent, archived version of the source code to an approved repository. For example, GitHub is integrated with Zenodo, and Zenodo can provide a DOI registration for the archived source code.

Licencing your data

What does it mean to license your data? In the Data Management Plan you might see a question such as,

How will other legal issues, such as intellectual property rights and ownership, be managed? (Science Europe DMP template)

A license agreement is a legal arrangement between the creator/depositor of the data and the data repository, stating clear re-use rights to help others understand what they are allowed to do with your data. To make re-use as likely as possible it is recommended you to choose a licence which:

Makes data available to the widest audience possible
Makes the widest range of uses possible

To answer this question you need to think about who is owner of the research data from your study (the PI? the funder? a consortium? a third-part organisation?) and whether you have the ownership rights to make the data available to others. If you can make the data available to others, will you need to restrict how third-parties can use this data? For example, maybe the data can only be used for non-commercial purposes. Maybe you'd like to be cited as the origin of that data every time someone uses the data in the future. All of this can be clarified in the license agreement.

It is imperative that the intellectual property rights (IPR) pertaining to the data are established before any licensing takes place. If your research contains data from third parties (e.g., data from a health or hospital system) you should ensure you have the permission of the rights holder to share this data, or that the data is covered by licences that permit the sharing of data, before you put it in the data repository.

Creative Commons licenses are commonly applied to research data because

a CC license gives you a way to grant others permission to use your data under copyright law, and
a CC license gives clarity to new users of the data what they are allowed to do with the data

There are six different types of Creative Commons license, ranging from the most to least permissive. Creative Commons licenses allow the copyright holder to retain copyright ownership of their works while allowing others to use the work under certain conditions specified by the chosen licence. See a full description of these licences here

CC-BY Attribution
Users can distribute, remix, tweak, and build upon a work, even commercially, as long as they give credit to the original creator of the work.
CC-SA Share-Alike
Users can remix, tweak, and build upon a work even for commercial purposes, as long as they credit the original creator and license any new creations under identical terms.
CC BY-ND Attribution-NoDerivs
Users can copy and redistribute the material in any medium or format for any purpose, even commercially, as long as it is passed along unchanged and credit is given to the original creator.
CC BY-NC Attribution-NonCommercial
Users can copy and redistribute the material in any medium or format and remix, transform, and build upon the material but any new works must be non-commercial and give credit to the original creator.
CC BY-NC-SA Attribution-NonCommercial-ShareAlike
Users can copy and redistribute the material in any medium or format and remix, transform, and build upon the material but any new works must be non-commercial, give credit to the original creator and be licensed under identical terms.
CC BY-NC-ND Attribution-NonCommercial-NoDerivs
This licence is the most restrictive. Users can copy and redistribute the material, but they must credit the original creator and cannot change the work in any way or use it commercially.

There are several free-to-use tools online to help you find a suitable license for your research data and/or software:

The Digital Curation Centre (DCC) has also created a guide on "How to License Research Data"

An open-source licence is a set of conditions that grants the users of your software certain rights to use, copy, modify, and possibly redistribute the source code or content of the software. It also asserts your authorship. There are several licensing options for open source software, including:

MIT License – permits any person to use, copy, modify, merge, publish distribute, sublicense, and/or sell copies of the software as long as a copy of the license notification is included with any reuse
GNU General Public License - users can copy, distribute, and modify the software as long as any modifications are also licensed under the GPL
Apache license 2.0 - allows users to use the software for any purpose, to distribute it, to modify it, and to distribute modified versions of the software as long as a copy of the license is redistributed with any modified software

Additional information is available from the Software Sustainability Institute and Open Source Initiative.

Data citation

A data citation is an entry for a dataset within the reference list of an article, book, conference proceeding, or other document. Data citations are captured by standard citation counting methods if they are included in the reference list. However it is unfortunately still common practice for researchers to not cite data correctly in their reference list, or not to include sufficient information on the source of their research data.

Why does data citation matter?

It's important to cite data in your publications, in just the same way you would articles, books, images and websites, as a dataset is a source of evidence to support your argument. The UK Data Service have provided a useful video summarising why it is important to cite data correctly.

The UK Data Service highlights the following benefits of data citation to researchers and to science in general:

Transparency: Citing data is a way of clearly showing exactly which version of which dataset has underpinned or influenced research, as well as crediting those who have made the work possible by collecting the data.

Reproducibility: It helps future researchers to find out which data the researcher has used and enable the research to be reproduced to assess its integrity. Louise Corti, Director of Collections Development and Data Publishing for the UK Data Service, has written a great blog about research reproducibility in qualitative research: Show Me the Data.

Helping track the use of the data: Researchers who [share data] want to know that the data is being used, just like any other researchers want to know that their book or article has been used to support others’ research. In addition, bodies that fund the collection of this data want to know that their funding has produced value. It can also help researchers in gaining further funding for future data collection and analysis. Susan Noble wrote a great post looking at finding out what people have done with data we provide and its impact.

Measuring impact: Researchers want their books, articles and data to be make a difference to others, whether this is on future research, influencing policy or positively changing the lives of individuals, communities or society. Citing data, like citing any other research helps [repositories] in measuring and reporting on this impact.

Source: Spotlight on #CiteTheData: Make the data count – Data Impact blog (ukdataservice.ac.uk)

According to the ICPSR, the elements of a data citation are:

Author: Name(s) of each individual or organizational entity responsible for the creation of the dataset.
Date of Publication: Year the dataset was published or disseminated.
Title: Complete title of the dataset, including the edition or version number, if applicable.
Publisher and/or Distributor: Organizational entity that makes the dataset available by archiving, producing, publishing, and/or distributing the dataset.
Electronic Location or Identifier: Web address or unique, persistent, global identifier used to locate the dataset (such as a DOI). Append the date retrieved if the title and locator are not specific to the exact instance of the data you used.

These are the minimum elements required for dataset identification and retrieval. Fewer or additional elements may be requested by author guidelines or style manuals. Be sure to include as many elements as needed to precisely identify the dataset you have used.

Example of published dataset citation with an archive number.	TILDA. (2019). The Irish Longitudinal study on Ageing (TILDA) Wave 4, 2016. [dataset]. Version 4.0. Irish Social Science Data Archive. SN:0053-05. www.ucd.ie/issda/data/tilda/wave3
Example of published dataset citation with a DOI number:	Smith, J., and Jones, P. (2023). Environmental risk factors for autism [dataset]. Dryad Digital Repository [distributor]. doi: 10.1234/abcd123
Example of an unpublished dataset:	Smith, J., and Jones, P. (2023). Environmental risk factors for autism [unpublished raw dataset]. Royal College of Surgeons in Ireland.
Example of published dataset citation from an organisation or research group:	Health Service Executive. (2019). General Referrals by Hospital, Department and Year 2019. [dataset]. HSE Open Data [distributor].
Example of published dataset from individual authors:	Smith, J., and Jones, P. (2023). Environmental risk factors for autism [dataset]. Dryad Digital Repository [distributor]. doi: 10.1234/abcd123

For a deep dive into Data Citation see: Ball, A. & Duke, M. (2015). ‘How to Cite Datasets and Link to Publications’. DCC How-to Guides. Edinburgh: Digital Curation Centre. Available online: /resources/how-guides

Examples of data citations:

Persistent identifiers are the backbone of the data citation. If someone wants to replicate your analysis they will need to be able to find the correct copy of the data that you used. By including a persistent Identifier in your data citation you enable readers to identify and navigate to the exact version of the data that you used in your research. The persistent identifier is preferable to a less stable reference point such as a URL (website) address, as persistent identifiers are slow to expire and the data is more likely to be findable for many years. There are several types of persistent identifier used to identify datasets, but DOI numbers are most commonly used.

Linking your publication and your data

Once your article or report is published, you should update your repository record with the DOI of your publication. This way both the published research article and the underlying data in the repository will be linked, and reciprocally connected.

Data availability statement

A data availability statement is a short statement at the end of a research article that describes how, where, and under what conditions the data associated with the research article can be accessed. All research articles should include a data availability statement, even when there is no data associated with the article as this an important step in giving credit to data creators, and in supporting the reproducibility of research (more on this below).

In journal publications, the data availability statement usually appears at the end of a journal article before the ‘references’ section. The author(s) of the article write the data availability statement, and you should always include this statement in your article prior to submission for publication.

The data availability statement provides clear information on where the data can be accessed, and whether access to the data is open or restricted in some way. It should also provide a digital reference or link to where the data can be found online. Statements to the effect of "data available from authors" or "data will be made available on request" are not acceptable, as they do not provide sufficient information to genuinely enable access to the data.

You should include the following three pieces of information in your data availability statement:

Location of data: If your study involved collecting or producing new data, you should upload this data to a suitable online data repository. All of the data should be stored together as a single dataset, ideally in a domain-specific repository for your area of research. In your data availability statement, you then name the repository where the data is located. If your study involved re-using data that was collected or produced by a third party, you should provide information on where this data can be accessed.
Identifier for data: Ideally, you should provide a persistent identifier (PID) which is a long-lasting digital reference to a document, file, web page, or other object online, and is more stable than a URL. When you provide a persistent identifier, such as DOI number, it is much easier for the reader to locate your data online. Usually once you upload your data to data repository and hit the 'publish' button, a unique and persistent identifier is assigned to the dataset. It's important to include a persistent identifier in your data availability statement, as this helps the reader find the exact dataset you're referring to.
License information: It's important to apply a license to your research data, as this makes it clear what somebody else can do with this data. Data repositories often prompt you to choose from a range of Creative Commons license options. For example, if you to enable others to use, adapt, or build on your work, while giving you appropriate credit for the data, then you might apply a Creative Commons Attribution (CC-BY) license. If you want to enable others to use your data, but don't want it to be used commercially, you might apply a Creative Commons Non Commercial (CC BY-NC) license. For the full list of options see https://creativecommons.org/

How accessible are the data?	What to say in your data availability statement:	Example text:
Data are openly accessible in data repository.	The data that support the findings of this study are openly available in [insert repository name] at http://doi.org/ [insert DOI number], dataset reference number [insert reference number].	Example 1: The data that support the findings of this study are openly available in Zenodo.org at 10.5281/zenodo.3723939 under the terms of the Creative Commons Attribution 4.0 (CC-BY 4.0) license. Example 2: Repository: An atom-efficient, single-source precursor route to plasmonic CuS quantum dots. https://doi.org/10.5256/repository.4591.d34639. Data are available under the terms of the Creative Commons Zero “No rights reserved” data waiver (CC0 1.0 Public domain dedication).
Data are openly available in a repository that does not issue DOIs.	The data that support the findings of this study are openly available in [insert repository name] at [insert URL], reference number [insert reference number assigned to this dataset by the repository].	Example 1: The data that support the findings of this study are openly available in GEO DataSets at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE68849, GEO accession number GDS5660. Data are available under the terms of the Creative Commons Attribution 4.0 (CC-BY 4.0) license. Example 2: NCBI Gene: Ihe1 intestinal helminth expulsion 1 [Mus musculus (house mouse)]. Accession number 107537. Data are available under the terms of the Creative Commons Zero “No rights reserved” data waiver (CC0 1.0 Public domain dedication).
Data are derived from public domain resources.	The data that support the findings of this study are available in [insert repository name] at [insert URL or DOI], reference number [insert reference number].	Example: The datasets that support the findings of this study are openly available in Data.gov.ie under the terms of the Creative Commons Attribution 4.0 (CC-BY 4.0) license at the following locations: COVID-19 HSE Weekly Booster Vaccination Figures: https://data.gov.ie/dataset/covid-19-hse-weekly-booster-vaccination-figures2?package_type=dataset Pobal HP - Deprivation Index Scores - 2016: https://data.gov.ie/dataset/hp-deprivation-index-scores-2016/resource/6480bb69-023c-47f2-813f-8689bacafa54
Data were generated at a central, large-scale facility, available upon request.	Raw data were generated at [insert facility name]. Derived data supporting the findings of this study are available from [describe procedure for applying for access to the data].	Example: Raw data were generated at FutureNeuro at RCSI and Trinity College Dublin. Derived data supporting the findings of this study are available from the corresponding author [G.C.] on request.
Data are not publicly available, but available to researchers with appropriate credentials in line with consent agreed with respondents.	Due to confidentiality agreements, access to the data that support the findings of this study is restricted to bona fide researchers and is subject to a non-disclosure agreement. Details of the data and how to request access are available from [insert repository where data reside / name of data manager at host institution].	Example: The Anonymised Microdata Files (AMF) for the Growing Up in Ireland Child Cohort (9 years) data is available via the Irish Social Science Data Archive, ISSDA for bona fide research purposes only and is subject to an end user agreement. Details of the data and how to request access are available at https://www.ucd.ie/issda/data/growingupinirelandgui/
Data are not publicly available to protect anonymity of participants, although some controlled access is allowed.	The data that support the findings of this study are not publicly available due to [describe reason for access restriction, and procedure for applying for access to the data and the conditions under which access will be granted].	Example: The data that support the findings of this study are not publicly available due to restrictions outlined in consent agreements with participants and the identifying nature of the data. Data can be made available upon reasonable request and in line with the consent agreed with participants, by contacting the authors [C.G. and P. O'H.]
Data are not publicly available but is available on request, due to privacy/ethical restrictions.	The data that support the findings of this study are not publicly available due to [describe reason for non-sharing of data].	Example: Given the sensitive and identifying nature of the data, and in line with the consent agreed with participants, the data that support the findings of this study are not publicly available.
Data are currently embargoed due to commercial restrictions (e.g. to allow time for commercialization).	The data that support the findings will be available in [repository name] at [URL / DOI link] following a [6 month] embargo from the date of publication to allow for commercialization of research findings.	Example: The data that support the findings of this study will be available in Zenodo.org at at 10.5281/zenodo.3723939 from early 2023, following a 6 month embargo from the date of completion of the study, to allow for commercialization of research findings.
Data are restricted by commercial, industry, patent, government policies, regulations, or laws.	Due to the nature of the research, due to [ethical/legal/commercial] supporting data is not available. [If known, describe procedure for applying for access to the data and the conditions under which access will be granted.]	Example: Due to commercial restrictions, the Drug Distribution Dataset used in this study is not publicly available. Access to the data can be requested by completing the Data Request form at www.allianceheathcaresample.com/data.
Data are available within the article or its supplementary materials.	The authors confirm that the data supporting the findings of this study are available within the article [and/or] its supplementary materials.	Example 1: The data supporting the findings of this study are available in the supplementary material (Appendix A) of this article. Example 2: All data underlying the results are available as part of the article and no additional source data are required.
Data are subject to third party restrictions.	The data that support the findings of this study are available from [third party]. Restrictions apply to the availability of these data, which were used under license for this study. Data are available from [the authors / at URL] [describe procedure you used to access the data]	Example: The Health data from the Quarterly National Household Survey Q3-2010 are made available by the Central Statistics Office. Restrictions apply to the availability of QNHS data, which were used under license for this study. Data are available from the Irish Social Science Data Archive at https://www.ucd.ie/issda/data/qnhsmodules/, ISSDA study number 00041-00. Access can be requested by completing an ISSDA Data Request Form for Research.
Publication did not use any data.	It's important to include this information, even if there is no data underpinning the article, for clarity	Example 1: No data was used for the research described in the article. Example 2: No data are associated with this article.

For advice on constructing the data availability statement for data types that are commonly used in the health sciences (e.g., 3D-printable models, chemical and macromolecular structures, neuroimaging data, sequence and 'omics data) please view the author guidance from Health Open Research: https://healthopenresearch.org/for-authors/data-guidelines

If you research involved the use or development of new software, you should include a software availability statement. Your software availability statement should include the name of the repository where the source code at the time of publication (the archived version) is available, a DOI number for the archived software, and details of the license under which the software can be used. You should use an Open Source License (OSI) if possible, which allows software to be freely used, modified, and shared.

Putting it all together

Now that you have reached the conclusion of your research study, to ensure your data are FAIR, you have:

published your data in a repository / archive which has provided an identifier for the published data.
added rich metadata about the data to the repository / archive

attached a license to your data, so it is clear how a new user can use the data in a new work
clearly explained any access restriction in the metadata and given clear guidancne on how to request access
provided a data citation in the metadata, including the important identifier and used this citation in your publications
provided a data availability statement in all of your publications.

Together, your data citation and data availability statement should look something like this:
Smith, J., and Jones, P. (2023). Environmental risk factors for autism [dataset]. Dryad Digital Repository [distributor]. doi: 10.1234/abcd123
The data that support the findings of this study are openly available in Dryad Digital Repository at doi: 10.1234/abcd123 under the terms of the Creative Commons Attribution 4.0 (CC-BY 4.0) license.