Clinical data standards for large-scale linking of Big biomedical and health data

Data is the basic unit of information and in case of health and clinical data this comes in many shapes and sizes. The clinical data varies from numerical measurements and quantities, to image, documentation, digital codes and narrative text with facts and observations. This data becomes available to a learning or accountability system by various means. For example, it can be data acquisition from paper records, direct entry into a computer system, or reuse of data collected by others. The other aspect is the high throughput data generated from R&D. Thus, we can safely say that the data is Big, multi dimensional and multimodal. The individual data elements can be grouped based on a common criteria to form datasets e.g. vital measurement data from patients electronic health records.

The data from biological and health domain is growing at the fastest rate being >90% of the data generated each year. It needs no convincing that big data will increase efficiency and accountability in health care. However, the development of methods to analyse, link and generate knowledge from this heterogenous data lag behind data generation. This multimodal data from disparate sources would become transformative when linked at the lowest item level to magnify the hidden signals and extract holistic knowledge from it. Some common examples are how Google and Facebook personalise search results based user's preferences, search histories and making decisions given choices. Similarly, in healthcare linking big data will help clinicians and researchers to generate more robust and specific therapeutic solutions after studying a biological system comprehensively and holistically. This integrative approach also allows for testing new data-driven hypotheses. Data both from R&D and health care can be linked algorithmically to simultaneously draw statistical correlation and merge information from multiple resources to enhance hidden biological signal. Findings from such approaches would have least quantitative information loss common in analysing data from individual modalities separately. Moreover, the results would have more confidence in it due to agreement by multiple level of information from a biological system under study.

To merge data from disparate systems interoperability (i.e. the ability to receive and understand information from another system) of data is a minimum requirement. For example, to link patient data between different departments, or sites, or providers using a computer-based system data exchange standards would be required. These standards should allow to retain the original information as is and enable data exchange in real time. Another challenge in research and health care system is the use of non-standard terminologies and units across different institutions or nations. For example, disease terms may have a hierarchical relationship or there are more than one ways of writing a particular medical term (Fig 2). Similarly, a species name can be shortened (e.g. S. aureus), and we will have to provide this information to the algorithm we would use to process the data.

Figure 2: On the (Top) you can see hierarchical relationships of various cardiac disease terms. (Bottom): apparently there are more than 100 ways of writing blood pressure. (sources:)

What do I mean by clinical data standards?

Standards can apply to pretty much all domains of heath care and R&D, like instrument standards, templates for clinical information, documents structure and format standards, and data standards. Here, I am going to talk about about standards that apply to "data elements" and it refers to What we collect, How we represent the information we are collecting and how to encode this information for transmission. More specifically, I am talking about the message format standards to facilitate interoperability through the use of common encoding specifications and information models for defining relationships between data elements. The other aspect of data message standards is to normalise the clinical terminology.

Therefore, the data standards are about the methods, protocols, terminologies, and specification applied to data collection, exchange, storage, and retrieval of information from research and health care data.

I would specifically discuss these three sub-categories of data standards:

Standardised terminologies
Common data models (CDM)
Data exchange formats

Standardised terminologies:

Using the data that already exists in the system can be reused to develop an integrated health and learning system to support primary care, to keep people out of long term care, and to improve overall quality of treatment. The data that are captured in standardised terminologies are more accessible for reuse. Absence of consistent terminologies will make it difficult to create integrated picture of patient care. Some of the applications of applying standard clinical vocabularies are:

By attaching appropriate clinical terms and medical concepts to data can help to identify cohorts and investigate them.
It helps to filter data on certain criteria for analytics purposes. e.g. to understand why people are needing health care system such that appropriate steps can be taken to prevent or reduce the frequency in future.
It gives us a pathway to attach new datasets allowing generalisability of analytical models developing from the data.

The following five core terminologies were recommended by the National Committee on Vital and Health Statistics (NCVHS) in 2003:

SNOMED CT (Systematised nomenclature of human and veterinary medicine, Clinical Terms)
LOINC (Logical observation identifiers, names and codes)
RXNorm (normalised notation for clinical drugs
NDF RT (National drug ref terminology)
UMDNS (Universal Medical Device Nomenclature System)

There are other supplementary terminologies such as, MedRA and ICD

Figure 3: UMLS Metathesaurus integrates over 2 million names for about 900,000 concepts from more than 60 vocabularies. This can be accessed at http://umlsks.nlm.nih.gov

Common Data Models (CDM)

Generally, patients electronic health records (EHR) are stored as tables in a fixed relational database schema. These records contain information about patients demography, medication, diagnosis, interventions and other associated administrative data. Different providers develop these EHR data warehouses are built for different purposes and hence, can differ in their logic, variable names and terminologies used to describe medical products, and clinical conditions. In order to merge these EHR data with other type of information from the same or a different site or institution common data models were developed. CDMs would normalise how data is to be stored in tables.

Data exchange format

Health Level 7 and Clinical Document Architecture (HL7 CDA) is an international standard setting organisation. HL7 V3 and CDA formats were developed for clinical data exchange and interoperability. But these standards had some limitations in terms of their implementation and interpretability.

Sometime in 2013-14 Graham Grieve from Australia introduced FHIR ( spelled as "fire") that stands for Fast Health Interoperable Resources. FHIR has several attractive features to make it a favourable data encoding format for transmission:

FHIR consists of multiple linkable and extendable data structure specifications called resources, modelling concepts in healthcare scenarios such as patients, conditions, and clinical observations and reports.
Resources refer to each other by URLs, allowing web-based exchange of information.
Resources can be tailored to use cases by standardised sets of constraints and extensions called profiles.
Exchange resources between systems is possible in multiple ways,

Using RESTful APIs (web approach)
As bundle of resources (message documents)

It allows to share healthcare information electronically in "real-time"

FHIR covers a whole ecosystem of clinical concepts and associated administrative and financial "resources".

The format specification uses html, xml and Jason and appropriate security protocols (HTTPS, OAuth etc.) for secure data exchange. The format has human readable summary and a local extension HTML representation along with standard data definitions.

Images resource: https://www.hl7.org/

FHIR combined with CDMs would not only facilitate interoperability between different units and systems to make it easy to generate high quality and reproducible evidence from observational health and biomedical data, it will also allow real time data exchange from multiple access points through an integrated clinical support platform.

Search This Blog

AI in everything