Why Big Data Needs a Unified Theory of Everything
This blog is an excerpt of an article that currently appears on VentureBeat.com
As I learned from my work in flight dynamics, to keep an airplane flying safely, you have to predict the likelihood of equipment failure. And today we do that by combining various data sets with real-world knowledge, such as the laws of physics.
Integrating these two sets of information — data and human knowledge — automatically is a relatively new idea and practice. It involves combining human knowledge with a multitude of data sets via data analytics and artificial intelligence to potentially answer critical questions (such as how to cure a specific type of cancer). As a systems scientist who has worked in areas such as robotics and distributed autonomous systems, I see how this integration has changed many industries. And I believe there is a lot more we can do.
Take medicine, for example. The immense amount of patient data, trial data, medical literature, and knowledge of key functions like metabolic and genetic pathways could give us tremendous insight if it was available for mining and analysis. If we could overlay all of this data and knowledge with analytics and artificial intelligence (AI) technology, we could solve challenges that today seem out of our reach.
I’ve been exploring this frontier for quite a few years now – both personally and professionally. During my years of training and continuing into my early career, my father was diagnosed with a sequence of chronic conditions, starting with a brain tumor when he was only 40 years old. Later, a small but unfortunate car accident injured the same area of scalp that had been weakened by radio- and chemotherapy. Then he developed cardiovascular issues resulting from repeated use of anesthesia, and lastly he was diagnosed with chronic lymphocytic leukemia. This unique combination of conditions (comorbidities) meant it was extremely difficult to get insight into his situation. My family and I desperately wanted to learn more about his medical issues and to understand how others have dealt with similar diagnoses; we wanted to completely immerse ourselves in the latest medications and treatment options, learn the potential adverse and side effects of the medications, understand the interactions among the comorbidities and medications, and understand how new medical discoveries could be relevant to his conditions.
But the information we were looking for was difficult to source and didn’t exist in a form that could be readily analyzed.
Each of my father’s conditions was being treated in isolation, with no insight into drug interactions. A phenytoin-warfarin interaction was just one of the many potential dangers of this lack of insight. And doctors were unsure of how to adjust the dosages of each of my father’s medications to minimize their adverse and side effects, which turned out to be a big problem.
We also had no predictive knowledge of what to expect next.
My father’s situation is a frighteningly common one. Comorbidities — cases in which patients have two or more chronic conditions — was named the 21st century challenge for healthy aging by the “White House Conference on Aging” in 2014. In developed nations, about one in four adults have at least two chronic conditions, and more than half of older adults have three or more chronic conditions. In the United States, the $2 trillion healthcare industry spends 71¢ of every dollar on treating individuals with comorbidities. In Medicare spending, the amount rises to 93¢ of every dollar.
And comorbidities pose huge challenges to clinicians, who must be cognizant of many layers of care and complexities involved in treating these patients. These cohorts of patients are excluded from most clinical trials. In particular, it is quite difficult to design hypothesis tests due to the heterogeneity and diverse set of possibilities, and it is expensive to run the trials. So even the medical community must rely heavily on observational data and analytical tools from data mining and machine learning algorithms.
But what if we were able to form deep partnerships across medicine and data science in order to bring together the vast set of medical knowledge, patient data, and analytics? I wanted to find out.
As my family struggled to learn more about and track my father’s medical conditions, I was able to get hold of some public medical data. Putting my science hat on, I started to mine these data sets in my after-work hours and weekends using data analytics techniques. And before noticing it, this became my full-time profession at PARC. My work on comorbidities provides a view of how this new field of data analytics works, the partnerships that could arise, and the disruptive changes it will bring.
AI can integrate medical knowledge with data analytics
With help from new regulations and incentive programs as well as new technological advancements, we have access to more digital healthcare records than at any time in the past. Healthcare data sets consist of both structured and unstructured information. Rich Electronic Medical Record (EMR) data sets exist, which include personal and family medical history, treatments, procedures, laboratory tests, large collections of complex physiological information, medical imaging data, genomics, and socio-economic and behavioral data. The data captures a variety of layers — from molecular information and genomics to pathophysiologic responses to diagnoses and procedures to data from self-quantified devices.
Recently, I was fortunate enough to get access to a rich longitudinal inpatient EMR data set with more than nine million unique patients. I started by looking at what clusters of comorbidities co-occur, why, and how these clusters vary as a function of different patient populations and other covariates such as age, gender, ethnicity, environment, and socio-economic factors. I applied advanced statistical methods to create a map of causal relationships between different diseases. Leveraging temporal data led to developing mathematical disease progression models. But something was not quite right.
First, no matter how good EMR data is, medical data is noisy and biased in most cases. The complex nature of the factors involved in transforming the verbal information exchange between patients and physicians into written information on medical charts and from there to International Classification of Disease (ICD) codes used in EMR data leads to enormous coding errors. In addition, different hospitals have different coding quality standards. The medical claims are the backbone of EMR data, but they are collected for billing purposes, which brings yet another source of bias and noise into the data. Coders, hospital administrators, health providers, payers, and patients have different perspectives and expectations when it comes to the medical data. This multi-faceted nature of medical data makes a big impact on the way data is collected and how it will be mined. Inventing algorithms that measure and quantify the quality of data from different resources, and filtering noise and bias from the data will be an inevitable part of working with medical data.
Besides the quality of data, there was something more fundamentally wrong about using only EMR data. For example, my causal inference algorithms resulted in noisy and often invalidated relationships between comorbidities. I tried to validate and explain the results by talking to medical doctors and researchers as well as reviewing the extensive medical knowledge in literature and other databases.
Going through this process led me to a “eureka moment”: If we can automatically integrate our accumulated experiences around the globe together with the long history of medicine, we could:
1.Identify interesting and yet non-intuitive insights that would help health providers to effectively choose appropriate treatment plans
2.Generate hypotheses for medical researchers that would expedite knowledge discovery
3.Develop actionable information for patients and family members to efficiently manage the comorbidities.
Medicine has perhaps one of the longest histories among the different branches of science. The accumulated knowledge in literature and medical and pharma trials is enormous today. Medical knowledge will continue to expand. Big data in medicine can give us interesting insights only if it goes hand-in-hand with the medical knowledge. Looking for causal relationships between different diseases in big EMR data will lead to robust results only if the existing medical knowledge, e.g. causal relationships between diabetes and kidney diseases, is incorporated into our machine learning algorithms. This is all fantastic, but the challenge is that medical knowledge is captured in different ontologies and representations (text, pathways, images, etc.). Additionally, combining medical knowledge is complicated because each source describes a different level of the human system. Some may describe high-level functions, others may describe organ-level functions, and others may focus on the subcell level, describing DNA, RNA, and proteins. So an important part of this process is to inventing AI machinery that can assimilate all of this disparate information.
Consider the types of problems we could address from both a patient’s and scientist’s perspective:
Patient perspective: A tremendous amount of data from patient histories combined with medical knowledge can be used to identify clusters of comorbidities and their past and future progression trajectories. Then, patients can be classified based on the comorbidities and the trajectories they follow. This approach will help both patients and doctors to summarize experiences and figure out what to expect next and which treatment plan is the most effective.
Scientist perspective: We can exploit commonalities in trajectories to provide evidence for interactions between comorbidities and generate scientific hypotheses. The goal is to achieve meaningful and actionable insights through a successful marriage of artificial intelligence/machine learning and medicine. In order to perform data-driven analysis, we need to address such challenges as integrating multiple data types, dealing with missing data, and handling irregularly sampled and biased data. Automatic integration of data and medical knowledge is a challenging and yet promising scientific question. While these challenges need to be taken into account by computational scientists working with healthcare data, a larger problem involves how best to ensure the hypotheses posed and types of knowledge discoveries sought are relevant to the healthcare community.
Given the expanding reach of medical data, we are entering a new age of intelligent medicine. Machine learning is the core technology enabling this development, but it will be critical for domain experts to understand and trust the results of machine learning algorithms. Current machine learning techniques produce models that are opaque, non-intuitive, and difficult for experts to rely on in their decision processes. But if we can integrate medical data and human knowledge, we can deliver explainable/interpretable intelligence to health providers and medical researchers.
I hope we can start to leverage the power of the experiences of all patients combined with the long history of medical knowledge to improve the quality of care for individual patients. The process would have to begin with a new generation of partnerships between data science and the keepers of knowledge.
As I said above, this approach isn’t just relevant to the medical world. It could be used to solve complex problems across a variety of fields. When this happens, a new wave of disruption will take place in the form of data analytics married to human knowledge.
Our work is centered around a series of Focus Areas that we believe are the future of science and technology.
We’re continually developing new technologies, many of which are available for Commercialization.
PARC scientists and staffers are active members and contributors to the science and technology communities.