Skip to Main Content

Research Data Management

How to describe your documentation

Next in your DMP, you should describe how the data will be organised and quality assured, so that it is genuinely usable. You should describe the metadata that will be stored with the data, which ensures the data can be understood and used into the future. There are two questions to be addressed in this section:  

  1. What metadata and documentation will accompany data?
  2. What data quality control measures will be used?​​​​

1. What metadata and documentation will accompany data?

When you open a file containing data, there are several pieces of information that you would need to fully understand what you are looking at. For example, you might like to know: 

  • What is the file called? (title)
  • What type of file is it and what software do you need to open it? (format)
  • Who created the data inside? (author / creator)
  • When were these data created? (creation date)
  • Are you allowed to use these data and what can you do with them? (license)

You might also benefit from knowing a little about the context within which the data were created, why these data were captured, and what tools were used to generate them. All of the above information, and more, is typically captured in the metadata. Metadata is a structured way of describing something and is an invaluable tool for making digital resources findable and understandable. 

When you add descriptive information to your research data, this ensures that others can understand the data, and re-use it in new research or to replicate or validate your results. Metadata can also help you to navigate and understand your own data files, as a large amount of contextual information can be lost to the creators of the data in the months and years following data collection. So it is really important to capture and store this contextual information with your data, to ensure the accuracy of your own research.


To summarise, metadata provides a structured description of something, usually a digital resource. Metadata ensures data remain usable, as it: 

  • captures information on the content, formats, and internal relationships of the data 
  • explains and provides context to the data so they can be properly understood, interpreted, and used 
  • enables others to find, use, and properly cite published data.

 

What information should I capture as metadata? The following is adapted from DMPTool lists some of the key information to capture as metadata about your research data:  

  • Rationale and context for data collection
  • Data collection methods
  • Structure and organization of data files
  • Data sources used (see citing data)
  • Data validation and quality assurance
  • Transformations of data from the sanitized data through analysis
  • Information on confidentiality, access and use conditions

What to document about each dataset (adapted from DMPTool): 

  • Variable names and descriptions
  • Explanation of codes and classification schemes used
  • Algorithms used to transform data (may include computer code)
  • File format and software (including version) used

When should I create the metadata? When it comes to adding metadata it's a good idea to start early. Do not leave the metadata for the very end of your project but start to plan how you’ll document the data as early as possible. Remember to include procedures for documentation in your data management planning.

  • Think about the information that is needed in order to understand the data. What will other users of the data need in order to understand your data?
  • Create a metadata file that includes the basic information about the data. You can also create similar files for each dataset (see following guidance on project-level and data file-level metadata).
  • Plan where to deposit the data after the completion of the project. The data repository probably follows a specific metadata standard that you can adopt.
  • Document consistently throughout the project. Metadata gives contextual information about your dataset(s). It specifies the aims and objectives of the original project and harbours explanatory material including the data source, data collection methodology and process, dataset structure and technical information.

Metadata should be open access

Even when you cannot share data openly in a data repository, the metadata should be openly accessible online. You can provide lots of descriptive information about the qualities of your dataset, without the need to share the data publicly. By sharing the metadata openly, even controlled data can be FAIR data. 

Is data documentation the same as metadata? 

Documentation is indeed one approach to creating metadata, as it involves describing or annotating your data. Below are two common examples of data documentation used to describe tabular research data. 

Data dictionary: This document outlines the structure, content, and meaning of each variable in your dataset. This includes what type of data is being collected (e.g. free text, numerical, categorical or group data), the full wording of a question used to generate the data, allowable values  and what those values mean (e.g. 0 = no high blood pressure diagnosis, 1 = borderline high blood pressure, 2 = high blood pressure). For example, in REDCap, the data dictionary is a CSV file containing information on the variables and the structure of the REDCap database (adapted from NLM definition of data dictionary).

Data codebook: Similar to the data dictionary, the data codebook is a human readable document that provides information on each data element including - variable name, variable label, values, value labels, summary statistics and missing data. Many statistical software tools like SPSS can generate a codebook at the touch of a button. 

A simple way to start creating metadata is to write a readme file which is the most basic form of metadata. The readme file should be saved as a plain text file alongside your data.e

  1. Create the readme file as a plain text file (.txt format) or .md if writing in Markdown. If you need to retain text formatting in your readme file, you can save it as a PDF format (.pdf format). 
  2. Name the file README (not readme, read_me, ABOUT etc) and save it with the data (in the same folder).
  3. Create one readme file per dataset. Where possible provide a readme file at file level describing the contents of that file. You can write a readme file describing an entire collection of file that are logically grouped together for use  (such as a folder of interview transcripts from a single study)  (adapted from the 4TU Guidelines for creating a README file).

 

What do I include in my readme file? The readme file should contain key information to enable the use and interpretation of the data, including provenance information, and any license information on how the data may be used. DMPTool provide the following list of information to include as metadata:  

General Overview

  • Title: Name of the dataset or research project that produced it

  • Creator: Names and addresses of the organizations or people who created the data; preferred format for personal names is surname first (e.g., Smith, Jane)

  • Identifier: Unique number used to identify the data, even if it is just an internal project reference number

  • Date: Key dates associated with the data, including: project start and end date; release date; time period covered by the data; and other dates associated with the data lifespan, such as maintenance cycle, update schedule; preferred format is yyyy-mm-dd, or yyyy.mm.dd-yyyy.mm.dd for a range

  • Method: How the data were generated, listing equipment and software used (including model and version numbers), formulae, algorithms, experimental protocols, and other things one might include in a lab notebook

  • Processing: How the data have been altered or processed (e.g., normalized)

  • Source: Citations to data derived from other sources, including details of where the source data is held and how it was accessed

  • Funder: Organizations or agencies who funded the research

Content Description

  • Subject: Keywords or phrases describing the subject or content of the data

  • Place: All applicable physical locations

  • Language: All languages used in the dataset

  • Variable list: All variables in the data files, where applicable

  • Code list: Explanation of codes or abbreviations used in either the file names or the variables in the data files (e.g. "999 indicates a missing value in the data")

Technical Description

  • File inventory: All files associated with the project, including extensions (e.g. "NWPalaceTR.WRL", "stone.mov")

  • File formats: Formats of the data, e.g., FITS, SPSS, HTML, JPEG, etc.

  • File structure: Organization of the data file(s) and layout of the variables, where applicable

  • Version: Unique date/time stamp and identifier for each version

  • Checksum: A digest value computed for each file that can be used to detect changes; if a recomputed digest differs from the stored digest, the file must have changed

  • Necessary software: Names of any special-purpose software packages required to create, view, analyze, or otherwise use the data

Access

  • Rights: Any known intellectual property rights, statutory rights, licenses, or restrictions on use of the data

  • Access information: Where and how your data can be accessed by other researchers


What about providing specific information about a single file? You might want to create a readme file to provide metadata at the file level or object level (for example to describe a single database, text file, image or audio file. 

Information to include about a tabular data file:

  • Data type, file type, file format and software (including version), size, data processing scripts and algorithms used to transform data where relevant.
  • Variable names, labels and descriptions of variables, their values, a description of derived variables etc. Variable labels should be brief and indicate the unit of measurement if appropriate.

Information to include about a text file: 

  • Data type, file type, file format and software (including version), size, information on processing or preparation of the file such as approach to anonymisation of text

  • Relevant contextual information about the interview/focus group with due consideration for participant confidentiality.

Metadata is usually structured as a set of metadata elements - such as 'title', 'creator', 'date' and 'keywords'. 

A metadata schema is a formalized collection of required and optional metadata elements that can help standardize how people and institutions describe information resources. Employing a metadata schema can help ease the process of searching for resources and sharing information about resources.

Knowing what metadata to capture to make your data as useful as possible can be a challenge. But many disciplinary communities have a formal standard, i.e. an agreed method for documenting data from that discipline. The value of following a metadata standard when you create your documentation is that you can be confident that you are providing the essential information with your data to maximise it's re-use potential. 

Researchers can document their data according to various metadata standards. Some metadata standards are designed for the purpose of documenting the contents of files, others for documenting the technical characteristics of files, and yet others for expressing relationships between files within a set of data. If you want to be able to share or publish your data, the DataCite metadata standard is of particular signficiance.

You can go one step further and prepare your metadata in a structured form so that it is "machine-readable". This means that computers can also understand and process the metadata. If you use a trusted data repository to store and share your data, in many cases, the repository will provide you with a template (or form) to complete. The content that you add to this template will be made available by the repository in a machine-readable format. At a minimum, the data repository will guide you on the required metadata standard to deposit data with them, so it can be a good place to start if you're not quite sure what standard to follow. 

An alternative approach is to browse a directory of metadata standards by discipline. On the next tab there is a list of metadata directory resources. 


You should always aim to prepare metadata in a standard that is suitable for the type of research data that you generate. This means that you will provide the right type of information with the data, and enough specificity, to ensure it is genuinely reusable. 

The following tools can be useful for identifying a suitable metadata standard for the research you want to describe. 

2. What data quality control measures will be used?

In your DMP, your method for keeping the data organised should be stated, for example, the file naming conventions, file architecture etc should be referenced. This is especially helpful for group projects, where all of the team members should understand and take the same approach to file naming and folder organisation. The DMP can act as a record of this approach.

Research data files and folders should be labelled and organised in a systematic and consistent way so that they are easy to find, both for you and others. It’s generally recommended for file and folder names to be concise, but informative enough to detail the contents of the file. 


When it comes to choosing filenames, here are some common elements to consider:  

  • Project number or acronym
  • Description of content
  • Version number
  • Date of creation (date format should be YYYY-MM-DD)
  • Name or initials of creator
  • Status information (e.g. draft)

In addition, it is also recommended to use lowercase letters and avoid spaces when naming files. For more information see the CESSDA Data Management Expert Guide: File naming and folder structure.


Good practice in file and folder naming, according to the UC Santa Barbara Data Management Best Practices Evaluation Checklist includes the following attributes:  

  • Uniquely name each file.
  • Be consistent and include similar information in all file names of the same file type.
  • Consider sorting order (usually lexicographic) and logical hierarchies in file directories. 
  • Avoid ambiguous and confusing names, such as 'MyData' or 'sample'
  • Derivatives and versions should have similar (but differentiated) names to keep them co-located but still uniquely identified.
  • Names should reflect the contents of the file and/or the stage of development.
  • When using dates, if you want the files to sort chronologically, put the year first and use numerical two-digit months and days (YYYY-MM-DD). (Example: March 7, 2004 would be written '2004-03-07'.)
  • Use only alphanumeric characters but use dashes (-) or underscores (_) instead of spaces; avoid special characters such as colons (:) and slashes (/).
  • Avoid using case differences to distinguish between files: ‘Record’, ‘record’, and ‘RECORD’ may be three different file names or the same file name, depending on the operating system.

Similar to consistent file naming conventions, a meaningful folder structure is a key element of project and data management and will make it much easier for you to locate and organise relevant documents. This is particularly important if you are working as part of a larger research group where many people will be accessing the files over the course of the project.

The folder structure strategy you implement will depend on the plan and organisation of the project, in addition to your own personal preferences. All material relevant to the data should be entered into the data folders, including detailed information on the data collection and data processing procedures. It is recommended to limit the level of folders to
three or four deep and to limit the number of items in each list to less than ten.


Good practices in folder organisation according to the UC Santa Barbara Data Management Best Practices Evaluation Checklist are: 

  • A logical and organised folder structure can make it easier to keep track of project information.
  • Avoid complex directory hierarchies and consider that folder names will sort alphabetically.
  • Avoid keeping duplicate working copies of files (backup copies are not considered duplicates in this context).
  • Develop a file and folder naming convention and document it so all team members can follow it.

For additional guidance see the CESSDA Data Management Expert Guide: File naming and folder structure.

Managing different versions of your data can be tricky, but version control is a key step in good research data management, and project management overall. You should always keep original versions of data files, or keep documentation that allows the reconstruction of original files.

All changes to the original versions should be documented, and this can be achieved in several ways - chose the options that work best for your research data: 

  • Use a systematic naming convention to identify different file versions
  • Record the date within the file: 20190902_documentation_for_my_data
  • Include a version number in the file name: Documentation_v2
  • Include information about the status of the file, e.g. "draft" or "final," as long as you don't end up with confusing names like "final2" or "final_revised".
  • Include information about what changes were made, e.g. "cropped" or "normalized".
  • Use version control facilities within the software you use
  • Use file-sharing services with incorporated version control e.g. GitHub
  • Design and use a version control table

For more information see the CESSDA Data Management Expert Guide: File naming and folder structure.

More resources

Metadata

  • What is metadata? Sarah Morgan, Scientific Training Coordinator at EMBL-EBI, discusses what metadata is and why it is important to keep track of this information in biological experiments. Watch here
  • Christmas themed exploration of metadata,  published by the Scientific American What is metadata?
  • NIH definition of Data Description and Annotation
  • NISO Metadata Primer provides introductory overview of metadata tools, best practices, and resources.
  • DublinCore is a widely used metadata schema for describing library resources: https://www.dublincore.org 
  • DataCite is a metadata schema designed specifically to describe data: http://schema.datacite.org

Guidance on creating a ReadMe file for metadata

  • Short guide for creating ReadMe metadata from Cornell University
  • Guidance on how to create a ReadMe file from TU Delft
  • Markdown is a lightweight markup language that you can use to add formatting to plain text text documents. This can be useful in documenting research data. Learn more and access markdown cheat sheet at markdownguide.org