LLMLinelist: Enabling rapid outbreak analytics with large language models

Effective outbreak response requires rapid analysis of messy data. Advances in AI may be able to assist.

Jan 11, 2024

Infectious disease outbreaks require the rapid collection, processing, and analysis of data to determine the outbreak source, understand spread, and subsequently implement control measures to prevent further cases and mitigate the impact on public health. This is typically a slow, labour-intensive process, with lengthy interviews with cases and contacts which need to be distilled down and manually digitised into spreadsheets to enable epidemiological conclusions to be drawn from statistical analysis. Such a process takes time, a scarce resource in an outbreak situation, hindering effective disease control and situational awareness. However, advances in generative artificial intelligence offer a solution to this problem, streamlining the transcription and processing of epidemiological data to accelerate outbreak analytics and enable prompt response.

Lots of data, little time

Outbreak investigations typically begin with field epidemiologists and health workers interviewing cases to determine when they fell ill, where they’ve been and who they’ve been in contact with, with the aim of finding out who they may have been infected by, or possibly gone on to infect themselves. In its rawest form, this is free-form prose, either recorded as speech by a dictaphone or transcribed by hand by the interviewer. To enable an epidemiological overview of the outbreak, this data must be processed and aggregated into machine-readable data in the form of a linelist (in other words, a spreadsheet where each case is a row).

This requires someone listening to or reading such interviews one by one and manually picking out relevant pieces of information to type into a spreadsheet. This data often lacks structure, making it difficult to extract essential information quickly and accurately. Given the variation in the questions asked and responses given, interviews may fail to capture all relevant details, leading to incomplete or inaccurate data. Moreover, this method is susceptible to recall bias and subjective interpretations, potentially affecting data accuracy; the names of contacts or geographic locations may be misremembered or misspelt; times and dates may be confused, especially if events occurred some time ago. All of these can lead to inconsistencies and complicate data processing.

Response time is a critical concern in outbreak management, and this process isn’t fast. The cumulative delays in data collection, translation, and processing can result in substantial delays in response, limiting public health officials' ability to make informed decisions and act promptly in response to outbreaks.

The advent of large language models (LLMs) such as ChatGPT - artificial intelligence models trained on huge text datasets which learn the relationships between words, allowing users to receive bespoke generated output based on a given input by “predicting the next word” - may offer a solution. Given a set of specific instructions known as a “prompt” (e.g. “process this interview into JSON format, giving Name, Age, Sex, Address, Onset date”, etc.) and a report to be processed (“Elena (F, 28) met her friends Miguel (30, M), Sofia (26, F), and Diego (M, 29)...“), models such as ChatGPT can automatically pick out relevant data and record it in a specific machine-readable format, such as JSON or CSV. Following sense-checks, the data can then be immediately available for upload to a database and rapid analysis by epidemiologists whether they are in the field or based internationally, allowing for decisions to be made with substantially shorter response times.

There are challenges and drawbacks. One is that of input data quality. Data collected in interviews still suffer from the issues of data quality previously stated, and garbage in will always equal garbage out. If specific pieces of data are omitted, models can tend to “hallucinate” detail where there is none, such as inventing a symptom onset date, or a link between two cases. This can be somewhat mitigated through prompting (e.g. “if data is missing or unclear, return NA”), though is best addressed at source - one way could be for interviewers to ask a list of pre-specified questions, more like a survey, to ensure key data is collected.

However the most significant is that of data protection and privacy. Anything entered into the closed-source ChatGPT chatbox is uploaded to OpenAI servers to train future models; entering sensitive data into these models could compromise vulnerable individuals and violate data protection rules. OpenAI state that this is not the case for their paid API and that nothing is stored, though this still requires pinging OpenAI servers. The alternative is open-source LLMs - run locally and securely - which are improving by the day. These are currently trained on much smaller datasets than that of models offered by OpenAI, Google and the like, though are capable enough for many tasks. I found that the open-source model Mistral 7b was more than capable for the task of converting free text outbreak reports to a table, returning results in around 20 seconds on a 2021 Macbook Pro, depending on its length - not quite as fast as ChatGPT, but not too shabby. Local models also have the benefit of not requiring an internet connection, meaning they could be used in remote locations close to the outbreak epicentre.

Here I’ve described this method applied to one specific use case - infectious disease outbreaks - though there are many other potential applications. Another could be the processing of doctor’s notes and medical charts detailing the treatments and care given to patients in hospitals, facilitating evidence-based medicine and improving patient care.

Outbreaks require messy data to be clean and useful as soon as possible to allow for prompt response and to ensure containment. LLMs provide a way to transform such data into structured, machine-readable formats, significantly reducing processing time and facilitating rapid analysis by epidemiologists. Despite their potential, challenges remain, such as ensuring input data quality and addressing data protection and privacy concerns, particularly when sensitive information is involved. Nevertheless, innovations such as local open-source models indicate that such challenges are unlikely to hinder their usefulness for long.

You can try out the app here. Note that this is experimental, and may not always behave due to the random nature of large language models. Feedback welcome!

Thanks to: Sam Clifford, Noel Kennedy, Adam Kucharski, Stefan Flasche, Roz Eggo, Pratik Gupte, Chrissy Roberts, Michael Marks and others in CMMID/ LSHTM for their useful input and feedback.

Billy’s Substack

Discussion about this post