Next in your DMP, you should describe how the data will be organised and quality assured, so that it is genuinely usable. You should describe the metadata that will be stored with the data, which ensures the data can be understood and used into the future. There are two questions to be addressed in this section:
When you open a file containing data, there are several pieces of information that you would need to fully understand what you are looking at. For example, you might like to know:
You might also benefit from knowing a little about the context within which the data were created, why these data were captured, and what tools were used to generate them. All of the above information, and more, is typically captured in the metadata. Metadata is a structured way of describing something and is an invaluable tool for making digital resources findable and understandable.
When you add descriptive information to your research data, this ensures that others can understand the data, and re-use it in new research or to replicate or validate your results. Metadata can also help you to navigate and understand your own data files, as a large amount of contextual information can be lost to the creators of the data in the months and years following data collection. So it is really important to capture and store this contextual information with your data, to ensure the accuracy of your own research.
To summarise, metadata provides a structured description of something, usually a digital resource. Metadata ensures data remain usable, as it:
What information should I capture as metadata? The following is adapted from DMPTool lists some of the key information to capture as metadata about your research data:
What to document about each dataset (adapted from DMPTool):
When should I create the metadata? When it comes to adding metadata it's a good idea to start early. Do not leave the metadata for the very end of your project but start to plan how you’ll document the data as early as possible. Remember to include procedures for documentation in your data management planning.
Metadata should be open access
Even when you cannot share data openly in a data repository, the metadata should be openly accessible online. You can provide lots of descriptive information about the qualities of your dataset, without the need to share the data publicly. By sharing the metadata openly, even controlled data can be FAIR data.
Is data documentation the same as metadata?
Documentation is indeed one approach to creating metadata, as it involves describing or annotating your data. Below are two common examples of data documentation used to describe tabular research data.
Data dictionary: This document outlines the structure, content, and meaning of each variable in your dataset. This includes what type of data is being collected (e.g. free text, numerical, categorical or group data), the full wording of a question used to generate the data, allowable values and what those values mean (e.g. 0 = no high blood pressure diagnosis, 1 = borderline high blood pressure, 2 = high blood pressure). For example, in REDCap, the data dictionary is a CSV file containing information on the variables and the structure of the REDCap database (adapted from NLM definition of data dictionary).
Data codebook: Similar to the data dictionary, the data codebook is a human readable document that provides information on each data element including - variable name, variable label, values, value labels, summary statistics and missing data. Many statistical software tools like SPSS can generate a codebook at the touch of a button.
A simple way to start creating metadata is to write a readme file which is the most basic form of metadata. The readme file should be saved as a plain text file alongside your data.e:
What do I include in my readme file? The readme file should contain key information to enable the use and interpretation of the data, including provenance information, and any license information on how the data may be used. DMPTool provide the following list of information to include as metadata:
General Overview
Title: Name of the dataset or research project that produced it
Creator: Names and addresses of the organizations or people who created the data; preferred format for personal names is surname first (e.g., Smith, Jane)
Identifier: Unique number used to identify the data, even if it is just an internal project reference number
Date: Key dates associated with the data, including: project start and end date; release date; time period covered by the data; and other dates associated with the data lifespan, such as maintenance cycle, update schedule; preferred format is yyyy-mm-dd, or yyyy.mm.dd-yyyy.mm.dd for a range
Method: How the data were generated, listing equipment and software used (including model and version numbers), formulae, algorithms, experimental protocols, and other things one might include in a lab notebook
Processing: How the data have been altered or processed (e.g., normalized)
Source: Citations to data derived from other sources, including details of where the source data is held and how it was accessed
Funder: Organizations or agencies who funded the research
Content Description
Subject: Keywords or phrases describing the subject or content of the data
Place: All applicable physical locations
Language: All languages used in the dataset
Variable list: All variables in the data files, where applicable
Code list: Explanation of codes or abbreviations used in either the file names or the variables in the data files (e.g. "999 indicates a missing value in the data")
Technical Description
File inventory: All files associated with the project, including extensions (e.g. "NWPalaceTR.WRL", "stone.mov")
File formats: Formats of the data, e.g., FITS, SPSS, HTML, JPEG, etc.
File structure: Organization of the data file(s) and layout of the variables, where applicable
Version: Unique date/time stamp and identifier for each version
Checksum: A digest value computed for each file that can be used to detect changes; if a recomputed digest differs from the stored digest, the file must have changed
Necessary software: Names of any special-purpose software packages required to create, view, analyze, or otherwise use the data
Access
Rights: Any known intellectual property rights, statutory rights, licenses, or restrictions on use of the data
Access information: Where and how your data can be accessed by other researchers
What about providing specific information about a single file? You might want to create a readme file to provide metadata at the file level or object level (for example to describe a single database, text file, image or audio file.
Information to include about a tabular data file:
Information to include about a text file:
Data type, file type, file format and software (including version), size, information on processing or preparation of the file such as approach to anonymisation of text
Relevant contextual information about the interview/focus group with due consideration for participant confidentiality.
Metadata is usually structured as a set of metadata elements - such as 'title', 'creator', 'date' and 'keywords'.
A metadata schema is a formalized collection of required and optional metadata elements that can help standardize how people and institutions describe information resources. Employing a metadata schema can help ease the process of searching for resources and sharing information about resources.
Knowing what metadata to capture to make your data as useful as possible can be a challenge. But many disciplinary communities have a formal standard, i.e. an agreed method for documenting data from that discipline. The value of following a metadata standard when you create your documentation is that you can be confident that you are providing the essential information with your data to maximise it's re-use potential.
Researchers can document their data according to various metadata standards. Some metadata standards are designed for the purpose of documenting the contents of files, others for documenting the technical characteristics of files, and yet others for expressing relationships between files within a set of data. If you want to be able to share or publish your data, the DataCite metadata standard is of particular signficiance.
You can go one step further and prepare your metadata in a structured form so that it is "machine-readable". This means that computers can also understand and process the metadata. If you use a trusted data repository to store and share your data, in many cases, the repository will provide you with a template (or form) to complete. The content that you add to this template will be made available by the repository in a machine-readable format. At a minimum, the data repository will guide you on the required metadata standard to deposit data with them, so it can be a good place to start if you're not quite sure what standard to follow.
An alternative approach is to browse a directory of metadata standards by discipline. On the next tab there is a list of metadata directory resources.
You should always aim to prepare metadata in a standard that is suitable for the type of research data that you generate. This means that you will provide the right type of information with the data, and enough specificity, to ensure it is genuinely reusable.
The following tools can be useful for identifying a suitable metadata standard for the research you want to describe.
A detailed list of discipline-specific metadata standards has been compiled by the Digital Curation Centre (DCC).
The RDA Metadata Standards Directory contains widely used metadata standards in the Arts & Humanities, Engineering, Life Sciences, Physical Sciences & Mathematics, Social & Behavioral Sciences and General Research Data.
FAIRsharing is a curated, informative and educational resource on data and metadata standards, inter-related to databases and data policies.
In addition, it is also recommended to use lowercase letters and avoid spaces when naming files. For more information see the CESSDA Data Management Expert Guide: File naming and folder structure.
Good practice in file and folder naming, according to the UC Santa Barbara Data Management Best Practices Evaluation Checklist includes the following attributes:
Similar to consistent file naming conventions, a meaningful folder structure is a key element of project and data management and will make it much easier for you to locate and organise relevant documents. This is particularly important if you are working as part of a larger research group where many people will be accessing the files over the course of the project.
The folder structure strategy you implement will depend on the plan and organisation of the project, in addition to your own personal preferences. All material relevant to the data should be entered into the data folders, including detailed information on the data collection and data processing procedures. It is recommended to limit the level of folders to
three or four deep and to limit the number of items in each list to less than ten.
Good practices in folder organisation according to the UC Santa Barbara Data Management Best Practices Evaluation Checklist are:
For additional guidance see the CESSDA Data Management Expert Guide: File naming and folder structure.
Managing different versions of your data can be tricky, but version control is a key step in good research data management, and project management overall. You should always keep original versions of data files, or keep documentation that allows the reconstruction of original files.
All changes to the original versions should be documented, and this can be achieved in several ways - chose the options that work best for your research data:
For more information see the CESSDA Data Management Expert Guide: File naming and folder structure.
Metadata
DataCite is a metadata schema designed specifically to describe data: http://schema.datacite.org
Guidance on creating a ReadMe file for metadata