Skip to main content

Research Data Management for the Health Sciences: Data Quality

Research data is becoming increasingly important. This Medical Library guide will introduce you to research data management skills, and connect you to relevant services and resources across the Yale Campus.

Quality Assurance Guidelines and Frameworks

When searching for data created by others in repositories or databases, ask yourself the following questions:

  • Who created or funded the dataset?
  • What is included in the dataset - and what is not included?
  • What was the design of the research project that lead to the creation of this data?  
  • Is there information (metadata) included with the dataset that describes the exact meaning of the data fields?
  • How is this data licensed, is it available for use or reference in other research?

Ensuring the Quality of your Collected Data

Quality Assurance

  • Adequate financial and logistical resources to ensure follow-through (include these resources in your grant application) 
  • Document your processes and protocols completely 
  • Manually check 5 to 10% of your data records
  • Plot your data to find and assess out-of-range values
  • Map location data to find out-of-range values

Quality Control

  • Use a data entry programs to control vocabulary through multiple choice options 
  • Enforce double entry
  • Use standardized data formats:
    • ISO 860 Standard for data and Time - YYYMMDDThh:mmss.sTZD
    • Spatial Coordinates for Latitude/Longitude - +/- DD.DDDDD

Data Quality Criteria

Assessing the creation methods of a dataset are necessary to ensure your understanding of the data, the quality of your analysis and ultimately verify the quality of your research. You can use the factors in the table below to assess and describe the quality of your own data, or data from an external sources. 

 

Quality Factor Definition Example(s) 
Completeness

The proportion of stored data against the potential of "100% complete"

Percentage of patient records that have all minimum and core data elements populated with non-blank values
Uniqueness Nothing will be recorded more than once based on how that thing is identified

Percentage of unique (vs. duplicate) records within a data set represents the uniqueness of the records within a set of data. 

 

Timeliness The degree to which data represent reality from the required point in time Time difference between the event and and the information about this event being recorded. 
Validity The degree to which data represent reality from the required point in time
  1. Validity at data item level: type and severity of hearing loss should be chosen from a given list of allowable values.
  2. Validity at record level: for any patient, th date/time of hearing screening should be after the date/time of birth.
Accuracy The degree to which data correctly describes the "real world" object or event being described

Date time formats should be formatted based on the parameters of the data system, or standards of the project.

Assess the data against the actual thing it represents, e.g. visit the hospital and determine how birth and screening data are collected and entered into the system. Or assess the data against an authoritative reference data set. 

Consistency  The absence of difference, when comparing two or more representations of a thing against a definition Data in a given field should be collected or calculated in the same way across all records.

The Six Dimensions of EHDI Data Quality Assessment