Library Guides: Research Data Management: Documentation and data quality

How to describe your documentation

Next in your DMP, you should describe the metadata that will be stored with the data, which ensures the data can be understood and used into the future. There are two questions to consider:

What metadata and documentation will accompany data?
What data quality control measures will be used?

A simple way to start creating metadata is to write a readme file, which is the most basic form of metadata that you can provide with your research data. A readme file is a text file that introduces and explains the context and structure of a folder of research data.

You can download the readme file templates below to start creating metadata about your research data today.

Metadata template
Use this template to create metadata about your study data, where there is no disciplinary standard available. The first part of this document is based on Dublin Core Metadata. You can save the completed form as a README.txt file with the data.
Metadata template for qualitative data
Use this template to create metadata about data from qualitative research (such as interviews, focus groups, site observation or documentary analysis). The first part of this document is based on Dublin Core Metadata. You can save the completed form as a README.txt file with the data.

What metadata and documentation will accompany data?

When you open a file containing data, there are several pieces of information that you would need to fully understand what you are looking at. For example, you might like to know:

What is the file called? (title)
What type of file is it and what software do you need to open it? (format)
Who created the data inside? (author / creator)
When were these data created? (creation date)
Are you allowed to use these data and what can you do with them? (license)

You might also benefit from knowing a little about the context within which the data were created, why these data were captured, and what tools were used to generate them. All of the above information, and more, is typically captured in the metadata. Metadata is a structured way of describing something and is an invaluable tool for making digital resources findable and understandable.

When you add descriptive information to your research data, this ensures that others can understand the data, and re-use it in new research or to replicate or validate your results. Metadata can also help you to navigate and understand your own data files, as a large amount of contextual information can be lost to the creators of the data in the months and years following data collection. So it is really important to capture and store this contextual information with your data, to ensure the accuracy of your own research.

To summarise, metadata provides a structured description of something, usually a digital resource. Metadata ensures data remain usable, as it:

captures information on the content, formats, and internal relationships of the data
explains and provides context to the data so they can be properly understood, interpreted, and used
enables others to find, use, and properly cite published data.

What information should I capture as metadata? The following is adapted from DMPTool lists some of the key information to capture as metadata about your research data:

Rationale and context for data collection
Data collection methods
Structure and organization of data files
Data sources used (see citing data)
Data validation and quality assurance
Transformations of data from the sanitized data through analysis
Information on confidentiality, access and use conditions

What to document about each dataset (adapted from DMPTool):

Variable names and descriptions
Explanation of codes and classification schemes used
Algorithms used to transform data (may include computer code)
File format and software (including version) used

When should I create the metadata? When it comes to adding metadata it's a good idea to start early. Do not leave the metadata for the very end of your project but start to plan how you’ll document the data as early as possible. Remember to include procedures for documentation in your data management planning.

Think about the information that is needed in order to understand the data. What will other users of the data need in order to understand your data?
Create a metadata file that includes the basic information about the data. You can also create similar files for each dataset (see following guidance on project-level and data file-level metadata).
Plan where to deposit the data after the completion of the project. The data repository probably follows a specific metadata standard that you can adopt.
Document consistently throughout the project. Metadata gives contextual information about your dataset(s). It specifies the aims and objectives of the original project and harbours explanatory material including the data source, data collection methodology and process, dataset structure and technical information.

Metadata should be open access

Even when you cannot share data openly in a data repository, the metadata should be openly accessible online. You can provide lots of descriptive information about the qualities of your dataset, without the need to share the data publicly. By sharing the metadata openly, even controlled data can be FAIR data.

Is data documentation the same as metadata?

Documentation is indeed one approach to creating metadata, as it involves describing or annotating your data. Below are two common examples of data documentation used to describe tabular research data.

Data dictionary: This document outlines the structure, content, and meaning of each variable in your dataset. This includes what type of data is being collected (e.g. free text, numerical, categorical or group data), the full wording of a question used to generate the data, allowable values and what those values mean (e.g. 0 = no high blood pressure diagnosis, 1 = borderline high blood pressure, 2 = high blood pressure). For example, in REDCap, the data dictionary is a CSV file containing information on the variables and the structure of the REDCap database (adapted from NLM definition of data dictionary).

Data codebook: Similar to the data dictionary, the data codebook is a human readable document that provides information on each data element including - variable name, variable label, values, value labels, summary statistics and missing data. Many statistical software tools like SPSS can generate a codebook at the touch of a button.

A simple way to start creating metadata is to write a readme file which is the most basic form of metadata.

A good introduction to readme files is provided at https://www.makeareadme.com/

The readme file should be saved as a plain text file alongside your data.

Create the readme file as a plain text file (.txt format) or .md if writing in Markdown. If you need to retain text formatting in your readme file, you can save it as a PDF format (.pdf format).
Name the file README (not readme, read_me, ABOUT etc) and save it with the data (in the same folder).
Create one readme file per dataset. Where possible provide a readme file at file level describing the contents of that file. You can write a readme file describing an entire collection of file that are logically grouped together for use (such as a folder of interview transcripts from a single study) (adapted from the 4TU Guidelines for creating a README file).

What do I include in my readme file? The readme file should contain key information to enable the use and interpretation of the data, including provenance information, and any license information on how the data may be used. DMPTool provide the following list of information to include as metadata:

General Overview

Title: Name of the dataset or research project that produced it
Creator: Names and addresses of the organizations or people who created the data; preferred format for personal names is surname first (e.g., Smith, Jane)
Identifier: Unique number used to identify the data, even if it is just an internal project reference number
Date: Key dates associated with the data, including: project start and end date; release date; time period covered by the data; and other dates associated with the data lifespan, such as maintenance cycle, update schedule; preferred format is yyyy-mm-dd, or yyyy.mm.dd-yyyy.mm.dd for a range
Method: How the data were generated, listing equipment and software used (including model and version numbers), formulae, algorithms, experimental protocols, and other things one might include in a lab notebook
Processing: How the data have been altered or processed (e.g., normalized)
Source: Citations to data derived from other sources, including details of where the source data is held and how it was accessed
Funder: Organizations or agencies who funded the research

Content Description

Subject: Keywords or phrases describing the subject or content of the data
Place: All applicable physical locations
Language: All languages used in the dataset
Variable list: All variables in the data files, where applicable
Code list: Explanation of codes or abbreviations used in either the file names or the variables in the data files (e.g. "999 indicates a missing value in the data")

Technical Description

File inventory: All files associated with the project, including extensions (e.g. "NWPalaceTR.WRL", "stone.mov")
File formats: Formats of the data, e.g., FITS, SPSS, HTML, JPEG, etc.
File structure: Organization of the data file(s) and layout of the variables, where applicable
Version: Unique date/time stamp and identifier for each version
Checksum: A digest value computed for each file that can be used to detect changes; if a recomputed digest differs from the stored digest, the file must have changed
Necessary software: Names of any special-purpose software packages required to create, view, analyze, or otherwise use the data

Access

Rights: Any known intellectual property rights, statutory rights, licenses, or restrictions on use of the data
Access information: Where and how your data can be accessed by other researchers

What about providing specific information about a single file? You might want to create a readme file to provide metadata at the file level or object level (for example to describe a single database, text file, image or audio file.

Information to include about a tabular data file:

Data type, file type, file format and software (including version), size, data processing scripts and algorithms used to transform data where relevant.
Variable names, labels and descriptions of variables, their values, a description of derived variables etc. Variable labels should be brief and indicate the unit of measurement if appropriate.

Information to include about a text file:

Data type, file type, file format and software (including version), size, information on processing or preparation of the file such as approach to anonymisation of text
Relevant contextual information about the interview/focus group with due consideration for participant confidentiality.

Metadata is usually structured as a set of metadata elements - such as 'title', 'creator', 'date' and 'keywords'.

A metadata schema is a formalized collection of required and optional metadata elements that can help standardize how people and institutions describe information resources. Employing a metadata schema can help ease the process of searching for resources and sharing information about resources.

Knowing what metadata to capture to make your data as useful as possible can be a challenge. But many disciplinary communities have a formal standard, i.e. an agreed method for documenting data from that discipline. The value of following a metadata standard when you create your documentation is that you can be confident that you are providing the essential information with your data to maximise it's re-use potential.

Researchers can document their data according to various metadata standards. Some metadata standards are designed for the purpose of documenting the contents of files, others for documenting the technical characteristics of files, and yet others for expressing relationships between files within a set of data. If you want to be able to share or publish your data, the DataCite metadata standard is of particular signficiance.

You can go one step further and prepare your metadata in a structured form so that it is "machine-readable". This means that computers can also understand and process the metadata. If you use a trusted data repository to store and share your data, in many cases, the repository will provide you with a template (or form) to complete. The content that you add to this template will be made available by the repository in a machine-readable format. At a minimum, the data repository will guide you on the required metadata standard to deposit data with them, so it can be a good place to start if you're not quite sure what standard to follow.

An alternative approach is to browse a directory of metadata standards by discipline. On the next tab there is a list of metadata directory resources.

You should always aim to prepare metadata in a standard that is suitable for the type of research data that you generate. This means that you will provide the right type of information with the data, and enough specificity, to ensure it is genuinely reusable.

The following tools can be useful for identifying a suitable metadata standard for the research you want to describe.

DCC list of discipline-specific metadata standards
A detailed list of discipline-specific metadata standards has been compiled by the Digital Curation Centre (DCC).
Research Data Alliance Metadata Standards Directory
The RDA Metadata Standards Directory contains widely used metadata standards in the Arts & Humanities, Engineering, Life Sciences, Physical Sciences & Mathematics, Social & Behavioral Sciences and General Research Data.
FAIRsharing

FAIRsharing is a curated, informative and educational resource on data and metadata standards, inter-related to databases and data policies.

What data quality control measures will be used?

Quality control: Data quality checks are the processes for validating the key characteristics of a dataset to detect errors and ensure that it is accurate. It is an essential step to ensure the data are in good order before they are used in any analysis. In scientific research, typical data quality processes are

calibration
repeated samples or measurements
spot checks, or eye balling the dataset
read through of interview transcripts to ensure the transcription makes sense
data entry validation
peer review of data
standardised data capture
representation with controlled vocabularies.

How you ensure the quality of your data will depend on the type of data you are working with, and within your disciplinary area there may be common approaches to data quality control. In your DMP, you should clearly describe the approach you will take to ensure quality control in the collection of data during the lifetime of the project. You should also note where data quality control activities will be documented, for example, in the metadata accompanying your data.
You should consider how data minimisation, pseudonymisation or anonymisation will affect data quality. For example, anonymization of health care data can affect the quality of resulting data sets due to information loss.

File naming: In your DMP, your method for keeping the data organised should be stated, for example, the file naming conventions, file architecture etc should be referenced. This is especially helpful for group projects, where all of the team members should understand and take the same approach to file naming and folder organisation. The DMP can act as a record of this approach.

Research data files and folders should be labelled and organised in a systematic and consistent way so that they are easy to find, both for you and others. It’s generally recommended for file and folder names to be concise, but informative enough to detail the contents of the file.

When it comes to choosing filenames, here are some common elements to consider:

Project number or acronym
Description of content
Version number
Date of creation (date format should be YYYY-MM-DD)
Name or initials of creator
Status information (e.g. draft)

In addition, it is also recommended to use lowercase letters and avoid spaces when naming files. For more information see the CESSDA Data Management Expert Guide: File naming and folder structure.

Good practice in file and folder naming, according to the UC Santa Barbara Data Management Best Practices Evaluation Checklist includes the following attributes:

Uniquely name each file.
Be consistent and include similar information in all file names of the same file type.
Consider sorting order (usually lexicographic) and logical hierarchies in file directories.
Avoid ambiguous and confusing names, such as 'MyData' or 'sample'
Derivatives and versions should have similar (but differentiated) names to keep them co-located but still uniquely identified.
Names should reflect the contents of the file and/or the stage of development.
When using dates, if you want the files to sort chronologically, put the year first and use numerical two-digit months and days (YYYY-MM-DD). (Example: March 7, 2004 would be written '2004-03-07'.)
Use only alphanumeric characters but use dashes (-) or underscores (_) instead of spaces; avoid special characters such as colons (:) and slashes (/).
Avoid using case differences to distinguish between files: ‘Record’, ‘record’, and ‘RECORD’ may be three different file names or the same file name, depending on the operating system.

Folder structure: Similar to consistent file naming conventions, a meaningful folder structure is a key element of project and data management and will make it much easier for you to locate and organise relevant documents. This is particularly important if you are working as part of a larger research group where many people will be accessing the files over the course of the project.

The folder structure strategy you implement will depend on the plan and organisation of the project, in addition to your own personal preferences. All material relevant to the data should be entered into the data folders, including detailed information on the data collection and data processing procedures. It is recommended to limit the level of folders to
three or four deep and to limit the number of items in each list to less than ten.

Good practices in folder organisation according to the UC Santa Barbara Data Management Best Practices Evaluation Checklist are:

A logical and organised folder structure can make it easier to keep track of project information.
Avoid complex directory hierarchies and consider that folder names will sort alphabetically.
Avoid keeping duplicate working copies of files (backup copies are not considered duplicates in this context).
Develop a file and folder naming convention and document it so all team members can follow it.

For additional guidance see the CESSDA Data Management Expert Guide: File naming and folder structure.

Versioning: Managing different versions of your data can be tricky, but version control is a key step in good research data management, and project management overall. You should always keep original versions of data files, or keep documentation that allows the reconstruction of original files.

All changes to the original versions should be documented, and this can be achieved in several ways - chose the options that work best for your research data:

Use a systematic naming convention to identify different file versions
Record the date within the file: 20190902_documentation_for_my_data
Include a version number in the file name: Documentation_v2
Include information about the status of the file, e.g. "draft" or "final," as long as you don't end up with confusing names like "final2" or "final_revised".
Include information about what changes were made, e.g. "cropped" or "normalized".
Use version control facilities within the software you use
Use file-sharing services with incorporated version control e.g. GitHub
Design and use a version control table

For more information see the CESSDA Data Management Expert Guide: File naming and folder structure.

Further resources

Metadata

What is metadata? Sarah Morgan, Scientific Training Coordinator at EMBL-EBI, discusses what metadata is and why it is important to keep track of this information in biological experiments. Watch here
Christmas themed exploration of metadata, published by the Scientific American What is metadata?
NIH definition of Data Description and Annotation
NISO Metadata Primer provides introductory overview of metadata tools, best practices, and resources.
DublinCore is a widely used metadata schema for describing library resources: https://www.dublincore.org
DataCite is a metadata schema designed specifically to describe data: http://schema.datacite.org

Guidance on creating a ReadMe file for metadata

Short guide for creating ReadMe metadata from Cornell University.
Guidance on how to create a ReadMe file from TU Delft
Markdown is a lightweight markup language that you can use to add formatting to plain text text documents. This can be useful in documenting research data. Learn more and access markdown cheat sheet at markdownguide.org