Skip to Main Content

Research Data Management

Introduction to documentation

When developing your Data Management Plan, the second topic you should address is "Documentation and data quality". This broadly encompasses two main questions: 


1. What metadata and documentation will accompany data?

  • ​​​​What metadata will you provide with your data to help others identify and discover the data? 
  • Researchers are strongly encouraged to use community metadata standards, where these exist. The data repository you are planning to deposit with may also provide guidance about appropriate metadata standards.
  • What other documentation is needed to enable reuse  information on the methodology used to collect the data, analytical and procedural information, definitions of variables, units of measurement, any assumptions made, the format and file type of the data and software used to collect and/or process the data.
  • How will you capture this information and where it will be recorded, e.g., in a database with links to each item, or in a ‘readme’ text file, or in file headers, etc.

2. What data quality control measures will be used?

  • How will the consistency and quality of data collection be controlled and documented? This may include processes such as calibration, repeat samples or measurements, standardised data capture, data entry validation, peer review of data or representation with controlled vocabularies.

 

What is documentation? According to the National Library of Medicine

Data description and annotation refers to the process of explaining, contextualizing, and documenting data. This process is vital to ensuring that other researchers can properly understand the data to use it for secondary analysis or replication. Data description and annotation can be accomplished in a series of ways. For example, one can describe data using metadata and formal metadata schemas, like the DataCite Metadata Schema. One can also describe or annotate data by creating documentation, such as a data dictionary.

What metadata will you add?

According to the National Library of Medicine: 

Metadata is information that describes, explains, locates, classifies, contextualizes, or documents an information resource. It is what enables you to search for books in your local library catalog, videos on YouTube, or find journal articles through PubMed. It is also what can help manage data, by tracking attributes like data provenance and versioning. Metadata can be used to describe all types of information sources. 

In research data management, metadata is used to describe the content, formats, and internal relationships of a dataset. Metadata can also be used to make the data findable and citable by others.


What is metadata? Sarah Morgan, Scientific Training Coordinator at EMBL-EBI, discusses what metadata is and why it is important to keep track of this information in biological experiments. Watch here


The different types of metadata: There are lots of different types of metadata and each type support different use cases. Some times all of these types of metadata will be included with a single dataset. The below table from the NICO Metadata Primer provides a description of each. 

 
Metadata Type Example Properties Primary Uses
Descriptive metadata

Title

Author

Subject

Genre

Publication date

Discovery

Display

Interoperability

Technical metadata

File type

File size

Creation date/time

Compression scheme

Interoperability

Digital object management

Preservation

Preservation metadata

Checksum

Preservation event

Interoperability

Digital object management

Preservation

Rights metadata

Copyright status

License terms

Rights holder

Interoperability

Digital object management

Structural metadata

Sequence

Place in hierarchy

Navigatio
Markup languages

Paragraph

Heading

List

Name

Date

Navigation

Interoperability

Metadata should be open: To comply with the FAIR Principles, metadata should be accessible, even if the data themselves cannot be shared openly. You can provide lots of descriptive information about the qualities of your dataset, without the need to share the data publicly.


Start early! Do not leave the metadata for the very end of your project but start to plan how you’ll document the data as early as possible. Remember to include procedures for documentation in your data management planning.

  • Think about the information that is needed in order to understand the data. What will other users of the data need in order to understand your data?
  • Create a metadata file that includes the basic information about the data. You can also create similar files for each dataset (see following guidance on project-level and data file-level metadata).
  • Plan where to deposit the data after the completion of the project. The data repository probably follows a specific metadata standard that you can adopt.
  • Document consistently throughout the project. Metadata gives contextual information about your dataset(s). It specifies the aims and objectives of the original project and harbours explanatory material including the data source, data collection methodology and process, dataset structure and technical information.

In the next tab there is some guidance on how to start creating your metadata. 

A simple way to start creating metadata is to create a ReadMe file which you save alongside your data. ReadMe files are the most basic form of metadata and at a minimum should be included at folder level, ideally at file level if appropriate. Where no appropriate standard exists, for internal use, writing “readme” style metadata is an appropriate strategy.You should save this file in an open text format, such as a text file (.txt format), so that it's accessible long-term. The ReadMe file should contain key information about the research study, including the provenance of the data, and any license information on how the data may be used. DMPTool provide a list of general information that you should document as metadata. 

 

General Overview

  • Title: Name of the dataset or research project that produced it

  • Creator: Names and addresses of the organizations or people who created the data; preferred format for personal names is surname first (e.g., Smith, Jane)

  • Identifier: Unique number used to identify the data, even if it is just an internal project reference number

  • Date: Key dates associated with the data, including: project start and end date; release date; time period covered by the data; and other dates associated with the data lifespan, such as maintenance cycle, update schedule; preferred format is yyyy-mm-dd, or yyyy.mm.dd-yyyy.mm.dd for a range

  • Method: How the data were generated, listing equipment and software used (including model and version numbers), formulae, algorithms, experimental protocols, and other things one might include in a lab notebook

  • Processing: How the data have been altered or processed (e.g., normalized)

  • Source: Citations to data derived from other sources, including details of where the source data is held and how it was accessed

  • Funder: Organizations or agencies who funded the research

 

Content Description

  • Subject: Keywords or phrases describing the subject or content of the data

  • Place: All applicable physical locations

  • Language: All languages used in the dataset

  • Variable list: All variables in the data files, where applicable

  • Code list: Explanation of codes or abbreviations used in either the file names or the variables in the data files (e.g. "999 indicates a missing value in the data")

 

Technical Description

  • File inventory: All files associated with the project, including extensions (e.g. "NWPalaceTR.WRL", "stone.mov")

  • File formats: Formats of the data, e.g., FITS, SPSS, HTML, JPEG, etc.

  • File structure: Organization of the data file(s) and layout of the variables, where applicable

  • Version: Unique date/time stamp and identifier for each version

  • Checksum: A digest value computed for each file that can be used to detect changes; if a recomputed digest differs from the stored digest, the file must have changed

  • Necessary software: Names of any special-purpose software packages required to create, view, analyze, or otherwise use the data

 

Access

  • Rights: Any known intellectual property rights, statutory rights, licenses, or restrictions on use of the data

  • Access information: Where and how your data can be accessed by other researchers


You might also want to create metadata at the file-level, to provide key information about each data file. File-level or object-level metadata provides information at the level of individual files such as a database of tabular data, a text file of an interview transcript, or an image or audio file. The following provides an example of the content you might include with each research data file. 

 

Quantitative data files e.g. spreadsheets:

  • Information about the file: Data type, file type, file format and software (including version), size, data processing scripts and algorithms used to transform data where relevant.

  • Information about the variables: Names, labels and descriptions of variables, their values, a description of derived variables etc. Variable labels should be brief and indicate the unit of measurement if appropriate.

 

Qualitative data files e.g. text transcripts:

  • Information about the file: Data type, file type, file format and software (including version), size, information on processing or preparation of the file such as approach to anonymisation of text

  • Information about the interview: relevant contextual information about the interview/focus group with due consideration for participant confidentiality.

Guidance on creating a ReadMe file

Knowing what metadata to capture to make your data as useful as possible can be a challenge. But many disciplinary communities have a formal standard, i.e. an agreed method for documenting data from that discipline. The value of following a metadata standard when you create your documentation is that you can be confident that you are providing the essential information with your data to maximise it's re-use potential. 

Researchers can document their data according to various metadata standards. Some metadata standards are designed for the purpose of documenting the contents of files, others for documenting the technical characteristics of files, and yet others for expressing relationships between files within a set of data. If you want to be able to share or publish your data, the DataCite metadata standard is of particular signficiance.

You can go one step further and prepare your metadata in a structured form so that it is "machine-readable". This means that computers can also understand and process the metadata. If you use a trusted data repository to store and share your data, in many cases, the repository will provide you with a template (or form) to complete. The content that you add to this template will be made available by the repository in a machine-readable format. At a minimum, the data repository will guide you on the required metadata standard to deposit data with them, so it can be a good place to start if you're not quite sure what standard to follow. 

An alternative approach is to browse a directory of metadata standards by discipline. On the next tab there is a list of metadata directory resources. 

You should always aim to prepare metadata in a standard that is suitable for the type of research data that you generate. This means that you will provide the right type of information with the data, and enough specificity, to ensure it is genuinely reusable. 

The following tools can be useful for identifying a suitable metadata standard for the research you want to describe. 

How will data be organised?

Research data files and folders should be labelled and organised in a systematic and consistent way so that they are easy to find, both for you and others in your research team. As research becomes more collaborative, it is essential to keep all file names consistent within a research project and to track of changes and edits to files via the file name. All researchers involved in a project should follow the same file naming conventions and file names should be independent of the location of the file on a computer. It’s generally recommended for file and folder names to be concise, but informative enough to detail the contents of the file. 


How to chose file names: Here are some common elements to consider when naming your files: 

  • Project number or acronym
  • Description of content
  • Version number
  • Date of creation (date format should be YYYY-MM-DD)
  • Name or initials of creator
  • Status information (e.g. draft)

In addition, it is also recommended to use lowercase letters and avoid spaces when naming files. For more information see the CESSDA Data Management Expert Guide: File naming and folder structure.


Good practice in file and folder naming (from the UC Santa Barbara Data Management Best Practices Evaluation Checklist

  • Uniquely name each file.
  • Be consistent and include similar information in all file names of the same file type.
  • Consider sorting order (usually lexicographic) and logical hierarchies in file directories. 
  • Avoid ambiguous and confusing names, such as 'MyData' or 'sample'
  • Derivatives and versions should have similar (but differentiated) names to keep them co-located but still uniquely identified.
  • Names should reflect the contents of the file and/or the stage of development.
  • When using dates, if you want the files to sort chronologically, put the year first and use numerical two-digit months and days (YYYY-MM-DD). (Example: March 7, 2004 would be written '2004-03-07'.)
  • Use only alphanumeric characters but use dashes (-) or underscores (_) instead of spaces; avoid special characters such as colons (:) and slashes (/).
  • Avoid using case differences to distinguish between files: ‘Record’, ‘record’, and ‘RECORD’ may be three different file names or the same file name, depending on the operating system.

Similar to consistent file naming conventions, a meaningful folder structure is a key element of project and data management and will make it much easier for you to locate and organise relevant documents. This is particularly important if you are working as part of a larger research group where many people will be accessing the files over the course of the project.

The folder structure strategy you implement will depend on the plan and organisation of the project, in addition to your own personal preferences. All material relevant to the data should be entered into the data folders, including detailed information on the data collection and data processing procedures. It is recommended to limit the level of folders to three or four deep and to limit the number of items in each list to less than ten.


Good practice in folder organisation (from the UC Santa Barbara Data Management Best Practices Evaluation Checklist

  • A logical and organised folder structure can make it easier to keep track of project information.
  • Avoid complex directory hierarchies and consider that folder names will sort alphabetically.
  • Avoid keeping duplicate working copies of files (backup copies are not considered duplicates in this context).
  • Develop a file and folder naming convention and document it so all team members can follow it.

For additional guidance see the CESSDA Data Management Expert Guide: File naming and folder structure.

Managing different versions of your data can be tricky, but version control is a key step in good research data management, and project management overall. You should always keep original versions of data files, or keep documentation that allows the reconstruction of original files.


Good practice in version control: All changes to the original versions should be documented, and this can be achieved in several ways - chose the options that work best for your research data: 

  • Use a systematic naming convention to identify different file versions
  • Record the date within the file: 20190902_documentation_for_my_data
  • Include a version number in the file name: Documentation_v2
  • Include information about the status of the file, e.g. "draft" or "final," as long as you don't end up with confusing names like "final2" or "final_revised".
  • Include information about what changes were made, e.g. "cropped" or "normalized".
  • Use version control facilities within the software you use
  • Use file-sharing services with incorporated version control e.g. GitHub
  • Design and use a version control table

For more information see the CESSDA Data Management Expert Guide: File naming and folder structure.