LibGuides: Research Data Management: Data Collection and Description

Introduction to data description

According to Science Europe, when developing a data management plan, the first area you should address is "Data description and collection or re-use of existing data", which broadly encompasses two main questions:

1. What data will be collected or produced?

Give details on the kind of data: for example numeric (databases, spreadsheets), textual (documents), image, audio, video, and/or mixed media.
What data formats will be used (for example pdf, csv, txt or rdf)?
Justify the use of certain formats. For example, decisions may be based on staff expertise within the host organisation, a preference for open formats, standards accepted by data repositories, widespread usage within the research community, or on the software or equipment that will be used.
Give preference to open and standard formats as they facilitate sharing and long-term re-use of data.
Give details on the volumes (they can be expressed in storage space required (bytes), and/or in numbers of objects or files)

2. How will new data be collected or produced and/or how will existing data be re-used?

Explain which methodologies or software will be needed to create/process/visualise the data.
State any constraints on re-use of existing data if there are any.
Explain how data provenance will be documented.
Briefly state the reasons if the re-use of any existing data sources has been considered but discarded.

What data will be collected or produced?

In this section of the Data Management Plan you should give a summary of the data you will collect or create, noting the content, coverage and type of the data you will collect or create. According to the guidance from DMPTool, data can come in many forms, including:

Text: field or laboratory notes, survey responses
Numeric: tables, counts, measurements
Audiovisual: images, sound recordings, video
Models, computer code
Discipline-specific such as CIF in chemistry
Instrument-specific such outputs from lab equipment

Why do you need to think about file formats? When choosing an electronic file format to create and store data, it's important to consider whether the format is open and/or ubiquitous. The format you use determines how accessible these data are to other users, as some files can only be opened when you have a license to use that software. File format also determine how accessible the data will be to yourself and others into the future - technology evolves quickly, and the software that you use today will become obsolete in time.

Why use open file formats? File formats that are open or non-proprietary will tend to remain accessible, even if the software that created them is no longer available. However, formats which are ubiquitous or have become the default standard within a discipline, whether proprietary or not, are also likely to be maintained into the future.

What if you have a preferred software? If you find it necessary or convenient to work with a proprietary format, it may be useful to store your data using that format for data collection and analysis, while also storing a copy in an open or accessible format for sharing or archiving once your project is complete.

Which format is best for FAIR data? Many data archives and repositories will already have recommended file formats based on best practice within the disciplines they support.

When choosing a file format you should consider the following:

How you plan to analyse your data
Which software and file formats you and your colleagues have used in the past
Any discipline specific norms or technical standards
Whether file formats are at risk of obsolescence because of their dependence on a particular technology.
Which formats are best to use for the long-term preservation of data
Whether important information might be lost by converting between different formats

File formats likely to be accessible into the future (from DMPTool Guidance):

Non-proprietary
Open, with documented standards
In common usage by the research community
Using standard character encodings (i.e., ASCII, UTF-8)
Uncompressed (space permitting)

Examples of preferred format choices (from DMPTool Guidance):

Image: JPEG, JPG-2000, PNG, TIFF
Text: plain text (TXT), HTML, XML, PDF/A
Audio: AIFF, WAVE
Containers: TAR, GZIP, ZIP
Databases: prefer XML or CSV to native binary formats

For more information on recommended formats, see the UK Data Service guidance on recommended formats.

Library of Congress Recommended Formats Statement The Library of Congress identified preferred and acceptable file formats for textual works and musical compositions, still image works, audio works, moving image works, software and electronic gaming and learning, datasets/databases and websites.

UK Data Service Recommended Formats Guidance on file formats recommended and accepted by the UK Data Service for data sharing, reuse and preservation.

UCD Digital Library Preferred Formats for Data Preferred formats identified by the UCD Digital Library and Repository which facilitate processing, storage, and dissemination of data, assuring both useability and longer-term durability of the data.

In the Data Management Plan, when you are asked to describe the data volume, you are describing how much data your project is likely to produce. Some research (e.g. research using wearable devices) will produce lots of data over an extended period, and will require a lot of storage space. Other research will produce a relatively small amount of data, in terms of bytes, and the cloud storage provided by RCSI will be more that sufficient. There is more on cloud storage at RCSI later on this LibGuide.

In the DMP you are planning how much data you expect to be working with, and what storage capacity you’re going to need while the project is active (how big is the storage capacity for projects at RCSI?), and when the project concludes (how much data does the data repository accept?).

Some questions on data volume to consider (from DMPTool Guidance):

Are you manually collecting and recording data?
Are you using observational instruments and computers to collect data?
Is your data collection highly iterative?
How much data will you accumluate every month or every 90 days?
How much data do you anticipate collecting and generating by the end of your project?

You might find it helpful to present this information about data type, format and volume in your DMP by using a table. This can help to really clarify the various data that you will be working with, and you can add as much information (columns) as you need to help you with your planning. Below is an example based on a real study at RCSI.

Presenting the data description using a table
Data type	Method	Data format	Data volume
Demographic data	Patient survey	Tabular data stored as Excel (.xls) and open format CSV (.csv)	<200GB (~90 variables x 300 participants x 1 data collection wave)
Developmental questionnaires	Standardised tests by hospital staff	Tabular data stored as Excel (.xls) and open format CSV (.csv)	<100GB (~10 variables x 300 participants x 2 data collection waves)
Clinical data	Clinical review	Tabular data stored as Excel (.xls) and open format CSV (.csv) Text data stored as plain text files (.txt)	<100GB (25 variables x 300 participants x 2 data collection waves)
Qualitative data	Interviews with patients, care givers and staff	Audio data stored as .wav files Text data stored as plain text files (.txt)	<200GB 20 interviews x 1 hour in length

According to the guidance from DMPTool, data can be grouped into four main categories, and these categories will affect the choices that you make throughout your data management plan.

1. Observational data

Captured in real-time, typically outside the lab
Usually irreplaceable and therefore the most important to safeguard
Examples: Sensor readings, telemetry, survey results, images

2. Experimental data

Typically generated in the lab or under controlled conditions
Often reproducible, but can be expensive or time-consuming
Examples: gene sequences, chromatograms, magnetic field readings

3. Simulation data

Machine generated from test models
Likely to be reproducible if the model and inputs are preserved
Examples: climate models, economic models

4. Derived / Compiled

Generated from existing datasets
Reproducible, but can be very expensive and time-consuming
Examples: text and data mining, compiled database, 3D models

Contemporary research is often collaborative, and reusing existing research data has become common practice in many disciplines. Although convenient and cost effective, when reusing existing research data it essential that you take the time to familiarise yourself with the data and check the accompanying documentation for collection procedures, data cleaning procedures, usage licenses and other technical information to make sure the data are suitable for your research. Reusing existing research data does also not make them exempt from GDPR and any other relevant regulatory and ethical policies that researchers must comply with.

If you plan to reuse data you have collected for a previous project, or access/purchase from a data repository or other source, you should explain in the DMP how issues such as ethics, copyright and IPR have been addressed. Are there any restrictions on how you may reuse these data?