Skip to main content

Research Data Management for the Health Sciences: Data Documentation

Research data is becoming increasingly important. This Medical Library guide will introduce you to research data management skills, and connect you to relevant services and resources across the Yale Campus.

Best Practises for Data Documentation

  • Describe scientific context. Why data were collected (questions, hypotheses), environmental conditions during collection, where and when collected, spatial and temporal resolution of data, and standards or calibrations used.
  • Include critical information, such as date or location, in the data table, not just as metadata embedded in the file name.
  • Within the dataset, use one or more header rows that identify parameters, at the top of each file. Do not use spaces or special characters in headings, as many databases and applications do not allow this.
  • When creating datasets, also create a  data dictionary. This is a document that describes the contents of your data files: variables used (including formats), units of measure, and definitions of coded values (including missing values). The data dictionary can be included as a separate tab within your spreadsheet file, or as a companion text file with a similar name. 

Reasons to describe metadata: 

  • Allow your data to be find able by enabling database indexing
  • Increase understanding and reusability of data by providing descriptors and context
  • Make your data and associated research verifiable  

What is Metadata?

  • Metadata is simply data about data.
  • It is the information we create, store, and share to describe things, allow us to interact with these things to render meaning from data. 

Types of Metadata

  • Reagent Metadata: Information about the clinical samples, biological reagents (e.g. cell lines, antibodies, siRNAs), chemical reagents (e.g. drugs), etc. used to generate the data.
  • Technical Metadata: Information automatically generated by research instruments and associated software.
  • Experimental Metadata: Information about the experimental conditions (e.g. assay type, time points), the experimental protocol, and the equipment used to generate the data.
  • Analytical Metadata: Information about data analysis methods including software name and version, quality control parameters, and output file type details.
  • Dataset Level Metadata: Information about the objectives of the research project, participating investigators, relevant publications, and funding sources.

Where does Metadata Exist?

Where can metadata be collected?

  • Paper or electronic lab notebooks
  • Plain text README files describing the content of folders containing data files
  • Within the data file, e.g. the header information in a spreadsheet 
  • Web forms or Electronic Data Collection Systems

What Details Should I include in my Metadata?

  • The data creator
  • Data file continents 
  • Data creation times
  • Data creation locations 
  • Reasons why the data were created
  • Methods used to generate the data
  • Units
  • Instruments used

README Files

Learn about README files and how to create them by following the link below:

Data Dictionaries

Data dictionaries should contain all or most of the following:

  • A complete list of the parameter names used in the dataset. Use standardized naming across files and projects, when possible. Include any abbreviations for those variables in codebooks. Keep abbreviations for variable names consistent, including capitalization.
  • Description of each parameter. What quantity does the parameter represent? How was each measured or produced? If relevant or not mentioned elsewhere, when and where was the quantity measured?
  • Units of measurement (e.g., number per m^3, deaths per 10,000 individuals, % increase per year). When possible, use standards. If using abbreviations in the dataset, spell out the complete units in the documentation.
  • Description of what a missing value signifies and how missing values are represented (e.g., -9999, n/a, FALSE, NULL, NaN, nodata, None). Leaving an entry blank may cause misregistration of the data in many applications.
  • An attribute/variable that describes data quality or certainty using coded values. Describe precision, accuracy, and uncertainty, and the quality control methods used. Some repositories may have standardized data quality levels.