Skip to Main Content

Research Data Management

Documentation and Data Quality

According to Science Europe, when developing a data management plan, the second topic researchers are required to address is "Documentation and data quality", which broadly encompasses two main questions:


 What metadata and documentation (e.g. data collection methodology and data organisation) will accompany data

  • Indicate which metadata will be provided to help others identify and discover the data.
  • Indicate which metadata standards (for example DDI, TEI, EML, MARC, CMDI) will be used.
  • Use community metadata standards where these are in place.
  • Indicate how the data will be organised during the project, mentioning for example conventions, version control, and folder structures. Consistent, well-ordered research data will be easier to find, understand, and re-use.
  • Consider what other documentation is needed to enable re-use. This may include information on the methodology used to collect the data, analytical and procedural information, definitions of variables, units of measurement, and so on.
  • Consider how this information will be captured and where it will be recorded (for example in a database with links to each item, a ‘readme’ text file, file headers, code books, or lab notebooks).

 What data quality control measures will be used?

  • Explain how the consistency and quality of data collection will be controlled and documented. This may include processes such as calibration, repeated samples or measurements, standardised data capture, data entry validation, peer review of data, or representation with controlled vocabularies.

File Naming


Research data files and folders should be labelled and organised in a systematic and consistent way so that they are easy to find, both for you and others in your research team. As research becomes more collaborative, it is essential to keep all file names consistent within a research project and to track of changes and edits to files via the file name. All researchers involved in a project should follow the same file naming conventions and file names should be independent of the location of the file on a computer. It’s generally recommended for file and folder names to be concise, but informative enough to detail the contents of the file. Common elements that should be considered when naming files include:

  • Project number or acronym
  • Description of content
  • Version number
  • Date of creation (date format should be YYYY-MM-DD)
  • Name or initials of creator
  • Status information (e.g. draft)

In addition, it is also recommended to use lowercase letters and avoid spaces when naming files.

 

Folder Structure


Similar to consistent file naming conventions, a meaningful folder structure is a key element of project and data management and will make it much easier for you to locate and organise relevant documents. This is particularly important if you are working as part of a larger research group where many people will be accessing the files over the course of the project.

The folder structure strategy you implement will depend on the plan and organisation of the project, in addition to your own personal preferences. All material relevant to the data should be entered into the data folders, including detailed information on the data collection and data processing procedures. It is recommended to limit the level of folders to three or four deep and to limit the number of items in each list to less than ten.

 

Version Control


Managing different versions of your data can be tricky, but version control is a key step in good research data management, and project management overall. You should always keep original versions of data files, or keep documentation that allows the reconstruction of original files. All changes to the original versions should be documented, and this can be achieved in several ways:

  • Using a systematic naming convention to identify different file versions
  • Record the date within the file: 20190902_documentation_for_my_data
  • Include a version number in the file name: Documentation_v2
  • Include information about the status of the file, e.g. "draft" or "final," as long as you don't end up with confusing names like "final2" or "final_revised".
  • Include information about what changes were made, e.g. "cropped" or "normalized".
  • Using version control facilities within the software you use
  • Using file-sharing services with incorporated version control e.g. GitHub
  • Designing and using a version control table

For more information please see the CESSDA Data Management Expert Guide: File naming and folder structure.

Tips for Creating Documentation


  • Start early! Careful planning of your documentation at the beginning of your project helps you save time and effort. Do not leave the documentation for the very end of your project. Remember to include procedures for documentation in your data management planning.
  • Think about the information that is needed in order to understand the data. What will other researchers and re-users need in order to understand your data?
  • Create a separate documentation file for the data that includes the basic information about the data. You can also create similar files for each data set.
  • Plan where to deposit the data after the completion of the project. The repository probably follows a specific metadata standard that you can adopt.
  • Document consistently throughout the project. Data documentation gives contextual information about your dataset(s). It specifies the aims and objectives of the original project and harbours explanatory material including the data source, data collection methodology and process, dataset structure and technical information. 

For more information please see the CESSDA Data Management Expert Guide: Documentation and metadata.

 

Project-level Documentation


The project-level documentation explains the aims of the study, what the research questions/hypotheses are, what methodologies were being used, what instruments and measures were being used, etc. The questions that your project-level documentation should answer are:

  • For what purpose was the data created?
  • What does the dataset contain?
  • How was data collected?
  • Who collected the data and when?
  • How was the data processed?
  • What possible manipulations were done to the data?
  • What were the quality assurance procedures?
  • How can the data be accessed?

 

Data-level Documentation


Data-level or object-level documentation provides information at the level of individual objects such as pictures or interview transcripts or variables in a database. You can embed data-level information in data files. For example, in interviews, it is best to write down the contextual and descriptive information about each interview at the beginning of each file. And for quantitative data variable and value names can be embedded within the data file itself.


For quantitative data document the following information is needed:

  • Information about the data file: Data type, file type, and format, size, data processing scripts.
  • Information about the variables in the file: The names, labels and descriptions of variables, their values, a description of derived variables etc. Variable labels should be brief and indicate the unit of measurement if appropriate.


For qualitative data document the following information is needed:

  • Textual data file (for example, interview)
    • Key information of participants such as age, gender, occupation, location, relevant contextual information);
    • For qualitative data collections you may wish to provide a data list that provides information that enables the identifying and locating of relevant items within a data collection. The list contains key biographical characteristics and thematic features of participants such as age, gender, occupation or location, and identifying details of the data items.

For more information please see the CESSDA Data Management Expert Guide: Documentation and metadata.

Metadata


According to the UK Data Service: "metadata can describe the content, context and provenance of datasets in a standardised and structured manner,  typically describing the purpose, origin, temporal characteristics, geographic location, authorship, access conditions and terms of use of a dataset”. Rich metadata enhance the findability, interoperability and reusability of your data. To comply with the FAIR Principles metadata should be accessible wherever possible, even if the data themselves are not accessible. Metadata are intended to be machine-readable, but in many cases you do not need to generate this yourself. When you submit data to a trusted Data Repository or Archive, the archive will often generate machine-readable metadata for you, or provide you with a template or required standard you must use. If not you should follow relevant disciplinary standards and controlled vocabularies.

 

Finding a Metadata Standard for your Discipline


 

Dublin Core Metadata Standard (Example)


Dublin Core is a metadata standard comprised of 15 “core” metadata elements (outlined below). It is one of the simplest and most widely used metadata schema. Built into the Dublin Core standard are definitions of each metadata element that state what kinds of information should be recorded where and how.  Associated with many of the data elements are suggested controlled vocabulariesYou can create machine-readable metadata using the the Dublin Core Metadata Generator. This useful tool can create both simple and advanced metadata whihc are converted into a machine-readable file in *.xml.

 

Dublin Core Element Definition Example
Tile The name given to the resource. Typically, a Title will be a name by which the resource is formally known. A Nurse's Guide to Cancer Research
Creator An entity primarily responsible for making the content of the resource. Examples of a Creator include a person, an organization, or a service.  Murphy, Aine
Date

A date associated with an event in the life cycle of the resource. Typically, Date will be associated with the creation or availability of the resource. Recommended best practice for encoding the date value is defined in a profile of ISO 8601 [Date and Time Formats, W3C Note] and follows the YYYY-MM-DD format.

2020-12-01

Description An account of the content of the resource. Description may include but is not limited to: an abstract, table of contents, reference to a graphical representation of content or a free-text account of the content.

Illustrated guide to cancer research and funding, with particular reference to the role of nurses

Rights Information about rights held in and over the resource. Typically a Rights element will contain a rights management statement for the resource, or reference a service providing such information. Rights information often encompasses Intellectual Property Rights (IPR), Copyright, and various Property Rights.  Access limited to members
Type The nature or genre of the content of the resource. Type includes terms describing general categories, functions, genres, or aggregation levels for content. Recommended best practice is to select a value from a controlled vocabulary (for example, the DCMIType vocabulary ).  Text
Language A language of the intellectual content of the resource. Recommended best practice for the values of the Language element is defined by RFC 3066 which, in conjunction with ISO 639, defines two- and three-letter primary language tags with optional subtags. 

en-GB

Contributor

An entity responsible for making contributions to the content of the resource. Examples of a Contributor include a person, an organization or a service. 

Murphy, Aine
Relation A reference to a related resource. Recommended best practice is to reference the resource by means of a string or number conforming to a formal identification system.

2019 book "Oncology Nurse Navigation" by Lillie D. Shockney 

Source A Reference to a resource from which the present resource is derived. The present resource may be derived from the Source resource in whole or part. Recommended best practice is to reference the resource by means of a string or number conforming to a formal identification system. Interviews with Irish nurses between 2005-2015 in the Irish Social Science Data Archive (ISSDA)
Coverage

The extent or scope of the content of the resource. Coverage will typically include spatial location (a place name or geographic co-ordinates), temporal period (a period label, date, or date range) or jurisdiction (such as a named administrative entity). Recommended best practice is to select a value from a controlled vocabulary (for example, the Thesaurus of Geographic Names). 

Dublin, Ireland. 2005-2015
Subject The topic of the content of the resource. Typically, a Subject will be expressed as keywords or key phrases or classification codes that describe the topic of the resource. Recommended best practice is to select a value from a controlled vocabulary or formal classification scheme. Cancer research and funding
Identifier An unambiguous reference to the resource within a given context. Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system. Examples of formal identification systems include the Uniform Resource Identifier (URI) (including the Uniform Resource Locator (URL), the Digital Object Identifier (DOI) and the International Standard Book Number (ISBN). ISBN:0385424728
Format The physical or digital manifestation of the resource. Typically, Format may include the media-type or dimensions of the resource. Format may be used to determine the software, hardware or other equipment needed to display or operate the resource. Book, 1989 pages
Publisher The entity responsible for making the resource available. Examples of a Publisher include a person, an organization, or a service.  RCSI