Turning a Sparse Data Problem into a Big Data Solution
Infectious disease epidemics can cause significant human and economic harm. The second-largest Ebola epidemic on record (with 2,309 cases reported by the Democratic Republic of the Congo Ministry of Health as of 27 May 2019) continues to rage, with no end in sight. In the United States, measles has reached its highest peak in 25 years, according to the US Centers for Disease Control and Prevention. Zika virus is estimated to have caused $18 billion in economic losses in Latin America and the Caribbean, based on estimates by the United Nations. Governments and corporations continue to face disease threats and need innovative strategies to prepare for, mitigate and manage the risk. Traditionally, epidemics pose a sparse data problem, but today’s technologies have made it possible to mine big data insights, even when starting out with little information.
During the course of an epidemic, the data types available include officially reported counts of cases and deaths at different time points and locations during an outbreak. However, a large amount of the data goes undetected and unreported. As a result, a large portion of the epidemic risk space remains unmeasured and unknown. This is even more acute for very rare, but highly catastrophic scenarios, which can be enormously devastating for countries’ and companies’ livelihoods.
While electronic data sources are extremely valuable, there are currently several challenges to realize their full potential. Electronic health records capture a wide variety of clinical and utilization data but can run afoul of privacy concerns. Similarly, cell phone location record data can be invaluable for estimating how people move, which is an important predictor for where an outbreak might go next. But these data often apply only to a subset of the population and are typically proprietary. So while the potential exists to use many different data sources, the reality is that it is very difficult to obtain and utilize them.
Given the need to fill in the knowledge gaps in the epidemic risk space, it is necessary to convert the sparse data problem into a big data solution through the use of synthetic data generation. The term “synthetic data” refers to data generated, for example, using complex mathematical simulation models. The term is achieving broader usage in the field of data science and is expected to gain even more traction in the future.
At Metabiota, we generate these data by feeding sparse data from real-world datasets, such as statistical distributions of disease-spread parameters, into computationally-intensive epidemic simulation models, which replicate the entire world -- all 7.5 billion people -- and estimate where an epidemic starts, how it spreads from person to person, and how it moves from place to place. These simulated epidemics are tracked on a daily time step until they burn themselves out or are successfully contained using different intervention measures such as quarantines and vaccines. We track hundreds of thousands of simulated epidemics in this way, which provides a tremendous big dataset from which to derive insights about potential impacts, such as numbers of infections, hospitalizations, deaths, employee absences, and monetary losses.
This rich synthetic data set allows us to explore a wider set of realistic scenarios that can provide insights about just how bad an epidemic could be, the likelihood of seeing an epidemic of such a size, and the most effective types of intervention measures. Ultimately, they can help countries and companies to make decisions on optimal risk mitigation plans for future outbreaks.
During the course of an epidemic, the data typically available include officially reported counts of cases and deaths at different time points and locations during an epidemic. However, a large amount of the data goes undetected and unreported
Selecting the right tools, infrastructure, and resources
Synthetic data generation can be a very computationally-intensive process. A data science department running large-scale computer simulations requires a significant amount of computation resources. For example, running the tens of millions of epidemic and pandemic simulations to date have required over 90,000 compute hours, 11.4 billion I/O requests, and resulted in over 100 terabytes of uncompressed data. This magnitude of computing power is made possible through the use of high-performance cloud computing. To increase cost savings, we utilize multiple cloud providers, select optimal class storage based on data access frequency, and use spot/preemptible instances whenever possible.
Once the simulations are completed, what is needed next is a way to mine the output and derive insights from the massive amounts of data. To do this, we need the right team and the right tools. First, we have assembled a high-functioning team featuring top talent and a collaborative team structure. Our team includes data analysts who do a lot of initial data collection and structuring, which is used to inform the models. Our data scientists are strong in three key areas: programming, statistics and subject-matter expertise (for example, epidemiology or actuarial science). In-house tools developed in R are used to generate and analyze the large data sets. However, it is also important to build in flexibility and be open to experimentation so as to not get locked into a single approach. We have experimented with other approaches and tools, although R continues to win out due to its flexibility and open-source nature. We also continue to explore new data sources, models, and collaboration opportunities.
Our business environment has been transformed by the ability to utilize large-scale computing resources and simulation modeling to generate massive quantities of simulated epidemic data. These data can be mined to gain deep insights about how epidemics can affect us all. With these insights, countries and companies can more effectively mitigate and manage epidemic risk and can improve the world’s resilience to epidemics
By Leni Kaufman, VP & CIO, Newport News Shipbuilding
By George Evans, CIO, Singing River Health System
By John Kamin, EVP and CIO, Old National Bancorp
By Elliot Garbus, VP-IoT Solutions Group & GM-Automotive...
By Gregory Morrison, SVP & CIO, Cox Enterprises
By Alberto Ruocco, CIO, American Electric Power
By Sam Lamonica, CIO & VP Information Systems, Rosendin...
By Sergey Cherkasov, CIO, PhosAgro
By Pascal Becotte, MD-Global Supply Chain Practice for the...
By Stephen Caulfield, Executive Director, Global Field...
By Shamim Mohammad, SVP & CIO, CarMax
By Ronald Seymore, Managing Director, Enterprise Performance...
By Brad Bodell, SVP and CIO, CNO Financial Group, Inc.
By Jim Whitehurst, CEO, Red Hat
By Clark Golestani, EVP and CIO, Merck
By Scott Craig, Vice President of Product Marketing, Lexmark...
By Dave Kipe, SVP, Global Operations, Scholastic Inc.
By Meerah Rajavel, CIO, Forcepoint
By Amit Bahree, Executive, Global Technology and Innovation,...
By Greg Tacchetti, CIO, State Auto Insurance