Using Big Healthcare Data to Accelerate Medical Discovery

This blog is an excerpt of an article that is currently available on CIO Review.

Metabolic syndrome, which increases the risk of heart disease, stroke and diabetes, is a medical condition that affects the health of nearly 34 percent of Americans. After more than Ninety years of research, it’s now understood that the syndrome is caused by a cluster of conditions – increased blood pressure, high blood sugar, excess body fat around the waist and abnormal cholesterol levels.

This understanding has vastly impacted the treatment of metabolic syndrome, which has improved the quality of life of those who are affected. Ninety years is a long time. The process of medical discovery has historically been very slow. It typically starts with a small set of observations and many pre-clinical and clinical trials on different patient population cohorts. Heterogeneous environments, uncertainties in original hypotheses, the passage of time and accumulating costs make it a very complex process.

But the promise of big healthcare data is set to significantly pick up the pace, kicking off a new age of intelligent medicine where information from different medical resources will become integrated. When combined with clinical perspectives from medical care professionals, we will see the pace and reach of medical discovery change in ways that we can only now start to imagine.


Sharing Data Sets from Different Resources

Artificial intelligence and machine learning approaches hold the potential to reveal hidden information in biological and medical healthcare datasets. Combining observational data (medical records, biological data, physiological readings), medical and pharma trials data, medical literature, knowledge of key functions like metabolic and genetic pathways, and more will change the pace and outcome of medical discovery in the near future. This will lead to the development of novel diagnostic and prognostic tests as well as descriptive, predictive and prescriptive analytics that guide hypothesis generation.

By putting an infrastructure or a platform in place, it will be possible to achieve better treatment plans, more efficient preventions, explore new medications, and more. This also would drive the need for innovative business models around knowledge discovery necessary, reshaping the healthcare landscape.

It’s clear that inpatient electronic medical records (EMR) enable scientific discovery to some extent. But imagine the impact on medical discovery if, for example, we can combine ambulatory outpatient data and quantified-self (QS) devices, with inpatient EMR data. Data collected through QS devices provide more reliable source of information compare to inaccurate transferring of knowledge via conversations between patients and physicians. Sooner or later many of those conversations will be initiated and facilitated by patients-generated data. Integrating these data sets with inpatients and outpatients EMR improves the efficiency of care delivery. The large amounts of de-identified integrated data can be used to implement population health modeling, and develop new drugs and treatments.

While the possibilities are exciting, it’s important to note the limitations big healthcare data poses to the process of knowledge discovery. While rich healthcare datasets exist, including electronic medical records, large collections of complex physiological information, medical imaging data, genomics, as well as other socio-economic and behavioral data, it’s not easy for artificial intelligence and machine learning researchers to extract knowledge from them.   For example, no matter how good Electronic Medical Records (EMR) datasets are, medical data is noisy and biased in most cases. The complex nature of the factors involved in transforming the verbal information exchange between patients and physicians into written information on medical charts — and from there to International Classification of Disease (ICD) codes used in EMR data leads to enormous coding errors. In addition, coding standards are not universal, as different hospitals require unique-to-them coding quality standards. Medical claims are the backbone of EMR data, but they are collected for billing purposes, which brings yet another source of bias and noise into the data.

In order to perform data-driven analysis or build causal models using these datasets, challenges, such as integrating multiple data types, dealing with missing data and handling noisy and irregularly sampled data must be addressed.

Additionally, , clinical perspectives from medical care professionals are required to assure that advancements in healthcare data analysis result in positive impact to eventual point-of-care and outcome-based systems.

Case study: Knowledge discoveries in comorbidity analysis of autism

Comorbidities — cases in which patients have two or more chronic conditions — was named the 21st century challenge for healthy aging by the “White House Conference on Aging” in 2014. In developed nations, about one in four adults have at least two chronic conditions, and more than half of older adults (65+ years old) have three or more chronic conditions. In the United States, the $2 trillion healthcare industry spends 71¢ of every dollar on treating individuals with comorbidities. In Medicare spending, the amount rises to 93¢ of every dollar

Costs related to Autism spectrum disorder (ASD) were estimated to be $268 billion during 2015 in the United States. ASD refers to a group of complex brain-based development disorders characterized by challenges in behavior, social skills, and communication.  The communication impairments combined with an ambiguous presentation of symptoms, create a climate in which comorbidities in patients with autism are not always discovered or treated.

Clinicians who treat patients with comorbidities must be cognizant of many layers of care and complexities. An automated platform designed to integrate information from different medical resources could help clinicians better address this challenge, especially in patients with autism. It would benefit society by improving patient treatment and reducing the financial and emotional burden carried by families and caregivers.

What Big Data Allowed Me to See

As a scientist, I am always eager to get hold of interesting data sets. As part of my work on population health solutions, I recently gained access to a rich longitudinal inpatient EMR data set with more than nine million unique patients. I began researching comorbidities associated with autism, and how they evolve over time.

I considered age groups from 0 to 35 years old, and divided them into buckets of five years.

The next step was to choose the methodology. I chose to apply apriori, algorithm used for frequent item set mining and association rule learning over transactional databases. The algorithm was invented in 1994 and has demonstrated its practicality in many different industries. We applied the apriori algorithm to our data set, and were able to confirm some of the already-known knowledge about comorbidities of autism such as a higher prevalence of epilepsy. We also discovered some interesting insights that are not as well known such as an association of diabetes with autism. We observed how digestive problems in people with autism evolve from early age to adulthood, how obesity and diabetes as a comorbid condition to autism change over time, and how epilepsy and mental disorders progress in patients with autism over time. We plan to publish the results of this analysis in a scientific paper.

As mentioned earlier, big healthcare data medical data is noisy and biased in most cases. Because of these limitations, it’s clear that integrating different data resources – such as pharma data, clinical data, inpatient and outpatient data, quantified-self device readings, and insurance info – could add value and different perspectives to the knowledge discovery process. We used inpatient data sets. So our analysis focused on the very sick population of autism patients who visited the hospital due to a serious medical condition they had. In cases of autism, patients appear more often in ambulatory settings rather than hospitals. Therefore, more information can be captured if the ambulatory data is also available.

While this research is still in progress, we believe the results will prove invaluable to practicing physicians. But, the process of compiling, combining, researching, analyzing and distributing the data is complex – and the road to get there is undetermined at this point. But the implications of sharing data are far-reaching and can positively impact patient diagnosis and care, so the road is worth traveling.

Marzieh Nabi, a researcher at PARC, a Xerox company, offers insight on how electronic medical records (EMRs) can shed light on autism and it related chronic conditions.
Other PARC Collaborators include: Dr. Saeid Shahraz, Razieh Nabi, Gaurang Gavai.

Additional information

Focus Areas

Our work is centered around a series of Focus Areas that we believe are the future of science and technology.

Licensing & Commercialization Opportunities

We’re continually developing new technologies, many of which are available for Commercialization.


PARC scientists and staffers are active members and contributors to the science and technology communities.