When developing your Data Management Plan, the second topic you should address is "Documentation and data quality". This broadly encompasses two main questions:
1. What metadata and documentation will accompany data?
2. What data quality control measures will be used?
What is documentation? According to the National Library of Medicine:
Data description and annotation refers to the process of explaining, contextualizing, and documenting data. This process is vital to ensuring that other researchers can properly understand the data to use it for secondary analysis or replication. Data description and annotation can be accomplished in a series of ways. For example, one can describe data using metadata and formal metadata schemas, like the DataCite Metadata Schema. One can also describe or annotate data by creating documentation, such as a data dictionary.
According to the National Library of Medicine:
Metadata is information that describes, explains, locates, classifies, contextualizes, or documents an information resource. It is what enables you to search for books in your local library catalog, videos on YouTube, or find journal articles through PubMed. It is also what can help manage data, by tracking attributes like data provenance and versioning. Metadata can be used to describe all types of information sources.
In research data management, metadata is used to describe the content, formats, and internal relationships of a dataset. Metadata can also be used to make the data findable and citable by others.
What is metadata? Sarah Morgan, Scientific Training Coordinator at EMBL-EBI, discusses what metadata is and why it is important to keep track of this information in biological experiments. Watch here
The different types of metadata: There are lots of different types of metadata and each type support different use cases. Some times all of these types of metadata will be included with a single dataset. The below table from the NICO Metadata Primer provides a description of each.
Metadata Type | Example Properties | Primary Uses |
Descriptive metadata |
Title Author Subject Genre Publication date |
Discovery Display Interoperability |
Technical metadata |
File type File size Creation date/time Compression scheme |
Interoperability Digital object management Preservation |
Preservation metadata |
Checksum Preservation event |
Interoperability Digital object management Preservation |
Rights metadata |
Copyright status License terms Rights holder |
Interoperability Digital object management |
Structural metadata |
Sequence Place in hierarchy |
Navigatio |
Markup languages |
Paragraph Heading List Name Date |
Navigation Interoperability |
Metadata should be open: To comply with the FAIR Principles, metadata should be accessible, even if the data themselves cannot be shared openly. You can provide lots of descriptive information about the qualities of your dataset, without the need to share the data publicly.
Start early! Do not leave the metadata for the very end of your project but start to plan how you’ll document the data as early as possible. Remember to include procedures for documentation in your data management planning.
In the next tab there is some guidance on how to start creating your metadata.
A simple way to start creating metadata is to create a ReadMe file which you save alongside your data. ReadMe files are the most basic form of metadata and at a minimum should be included at folder level, ideally at file level if appropriate. Where no appropriate standard exists, for internal use, writing “readme” style metadata is an appropriate strategy.You should save this file in an open text format, such as a text file (.txt format), so that it's accessible long-term. The ReadMe file should contain key information about the research study, including the provenance of the data, and any license information on how the data may be used. DMPTool provide a list of general information that you should document as metadata.
General Overview
Title: Name of the dataset or research project that produced it
Creator: Names and addresses of the organizations or people who created the data; preferred format for personal names is surname first (e.g., Smith, Jane)
Identifier: Unique number used to identify the data, even if it is just an internal project reference number
Date: Key dates associated with the data, including: project start and end date; release date; time period covered by the data; and other dates associated with the data lifespan, such as maintenance cycle, update schedule; preferred format is yyyy-mm-dd, or yyyy.mm.dd-yyyy.mm.dd for a range
Method: How the data were generated, listing equipment and software used (including model and version numbers), formulae, algorithms, experimental protocols, and other things one might include in a lab notebook
Processing: How the data have been altered or processed (e.g., normalized)
Source: Citations to data derived from other sources, including details of where the source data is held and how it was accessed
Funder: Organizations or agencies who funded the research
Content Description
Subject: Keywords or phrases describing the subject or content of the data
Place: All applicable physical locations
Language: All languages used in the dataset
Variable list: All variables in the data files, where applicable
Code list: Explanation of codes or abbreviations used in either the file names or the variables in the data files (e.g. "999 indicates a missing value in the data")
Technical Description
File inventory: All files associated with the project, including extensions (e.g. "NWPalaceTR.WRL", "stone.mov")
File formats: Formats of the data, e.g., FITS, SPSS, HTML, JPEG, etc.
File structure: Organization of the data file(s) and layout of the variables, where applicable
Version: Unique date/time stamp and identifier for each version
Checksum: A digest value computed for each file that can be used to detect changes; if a recomputed digest differs from the stored digest, the file must have changed
Necessary software: Names of any special-purpose software packages required to create, view, analyze, or otherwise use the data
Access
Rights: Any known intellectual property rights, statutory rights, licenses, or restrictions on use of the data
Access information: Where and how your data can be accessed by other researchers
You might also want to create metadata at the file-level, to provide key information about each data file. File-level or object-level metadata provides information at the level of individual files such as a database of tabular data, a text file of an interview transcript, or an image or audio file. The following provides an example of the content you might include with each research data file.
Quantitative data files e.g. spreadsheets:
Information about the file: Data type, file type, file format and software (including version), size, data processing scripts and algorithms used to transform data where relevant.
Information about the variables: Names, labels and descriptions of variables, their values, a description of derived variables etc. Variable labels should be brief and indicate the unit of measurement if appropriate.
Qualitative data files e.g. text transcripts:
Information about the file: Data type, file type, file format and software (including version), size, information on processing or preparation of the file such as approach to anonymisation of text
Information about the interview: relevant contextual information about the interview/focus group with due consideration for participant confidentiality.
Guidance on creating a ReadMe file
Knowing what metadata to capture to make your data as useful as possible can be a challenge. But many disciplinary communities have a formal standard, i.e. an agreed method for documenting data from that discipline. The value of following a metadata standard when you create your documentation is that you can be confident that you are providing the essential information with your data to maximise it's re-use potential.
Researchers can document their data according to various metadata standards. Some metadata standards are designed for the purpose of documenting the contents of files, others for documenting the technical characteristics of files, and yet others for expressing relationships between files within a set of data. If you want to be able to share or publish your data, the DataCite metadata standard is of particular signficiance.
You can go one step further and prepare your metadata in a structured form so that it is "machine-readable". This means that computers can also understand and process the metadata. If you use a trusted data repository to store and share your data, in many cases, the repository will provide you with a template (or form) to complete. The content that you add to this template will be made available by the repository in a machine-readable format. At a minimum, the data repository will guide you on the required metadata standard to deposit data with them, so it can be a good place to start if you're not quite sure what standard to follow.
An alternative approach is to browse a directory of metadata standards by discipline. On the next tab there is a list of metadata directory resources.
You should always aim to prepare metadata in a standard that is suitable for the type of research data that you generate. This means that you will provide the right type of information with the data, and enough specificity, to ensure it is genuinely reusable.
The following tools can be useful for identifying a suitable metadata standard for the research you want to describe.
A detailed list of discipline-specific metadata standards has been compiled by the Digital Curation Centre (DCC).
The RDA Metadata Standards Directory contains widely used metadata standards in the Arts & Humanities, Engineering, Life Sciences, Physical Sciences & Mathematics, Social & Behavioral Sciences and General Research Data.
FAIRsharing is a curated, informative and educational resource on data and metadata standards, inter-related to databases and data policies.
In addition, it is also recommended to use lowercase letters and avoid spaces when naming files. For more information see the CESSDA Data Management Expert Guide: File naming and folder structure.
Good practice in file and folder naming (from the UC Santa Barbara Data Management Best Practices Evaluation Checklist)
Similar to consistent file naming conventions, a meaningful folder structure is a key element of project and data management and will make it much easier for you to locate and organise relevant documents. This is particularly important if you are working as part of a larger research group where many people will be accessing the files over the course of the project.
The folder structure strategy you implement will depend on the plan and organisation of the project, in addition to your own personal preferences. All material relevant to the data should be entered into the data folders, including detailed information on the data collection and data processing procedures. It is recommended to limit the level of folders to three or four deep and to limit the number of items in each list to less than ten.
Good practice in folder organisation (from the UC Santa Barbara Data Management Best Practices Evaluation Checklist)
For additional guidance see the CESSDA Data Management Expert Guide: File naming and folder structure.
Managing different versions of your data can be tricky, but version control is a key step in good research data management, and project management overall. You should always keep original versions of data files, or keep documentation that allows the reconstruction of original files.
Good practice in version control: All changes to the original versions should be documented, and this can be achieved in several ways - chose the options that work best for your research data:
For more information see the CESSDA Data Management Expert Guide: File naming and folder structure.