Name: Brandon Cummings
UMich E-mail: cummingb@umich.edu
I certify that the following paper represents my own independent work and conforms with the guidelines of academic honesty described in the UMich student handbook.
Author's note: The work contained in this document represents the final submission for the HS650 course at the University of Michigan, Ann Arbor. Copyright Brandon Cummings, 2018. No part of this work may be reproduced without the author's written permission.
Libraries Used:
# Data I/O
library('RPostgreSQL') # access MIMIC-III
library('reshape2') # melting dataframes
# Data preparation
library('psych') # descriptive stats
library('tm') # text mining
library('SnowballC') # text stemming
# Analytics
library('rpart') # rpart decision tree
library('caret') # confusion matrices
library('e1071') # naive Bayes
# Plots and tables
library('knitr') # knitting Rmd to HTML; kable() function
library('kableExtra') # extra formating options for knitr tables
library('DT') # dynamic tables
library('ggplot2') # 2d plotting
library('ggpubr') # extra formatting options for ggplot
library('wordcloud') # visualizing free text data
library('rattle') # fancyRPartPlot
Random Seed Set
set.seed(123456)
Introduction: Sepsis has recently been defined as a “dysregulated host response to infection” (Singer et al., 2016). Despite being “leading cause of death and long-term disability in developed countries” (Abe et. al, 2018), the medical and scientific community has struggled to enact a serviceable operational definition since the time of Hippocrates (Majno, 1991). Recent consensus definitions have attempted to implement scoring rubrics to guide diagnosis and treatment, and are increasingly focused on prediction mortality among septic patients (Bone et al., 1992 & Singer et al., 2016).
Materials and Methods: Using data from the Medical Information Mart for Intensive Care (MIMIC-III), I extract many of the physiologic parameters present in these scoring rubrics and subject them to predictive analytic models in an effort to predict which patients were able to leave the hospital. In addition, physician and nursing notes underwent text-mining to reveal commonly used words. An rpart
decision classification tree and Naive Bayesian classification model were selected as initial models for investigation due to their opposing classification strategies.
Results: In almost all cases, the text-derived predictors out-performed the quantitative predictors, with an accuracy approaching 80%. In general, the resulting models had higher sensitivity than specificity, indicating that such models may be of more use as an initial screening tool than a definitive prediction.
Conclusion: While further investigation is required to refine the data extraction and the predictive model, it appears as if there may be a signal present which allows for binary classification between survivors and non-survivors. Additionally, text-mining and natural language processing may be an under-utilized tool when predicting mortality in a population of septic ICU patients.
Modern sepsis is widely defined as a “dysregulated host response to infection” (Singer et al., 2016). It is commonly cited as a “leading cause of death and long-term disability in developed countries,” and recent international multicenter cohort studies have estimated the mortality rate of severe sepsis to range from 23% to as high as 49% (Abe et. al, 2018). Those who do survive often suffer long-term physical and cognitive impairments (Iwashyna, Ely, Smith & Langa, 2010). It is one of the four most costly conditions in the United States, and in 2011 accounted for 5.2% of total healthcare expenditures with a price tag of $20 billion (Torio & Andrews, 2013). And it is a growing issue: while estimates of sepsis prevalence can vary widely based on the criteria used, annual increases in severe sepsis are consistently estimated to be as high as 13% nationwide (Gaieski, Edwards, Kallan, & Carr, 2013).
The heterogeneity of sepsis makes it a difficult condition to define (Gül, Arslantaş, Cinel, & Kumar, 2017). Early definitions dating back to Hippocrates “claimed that sepsis (σήψις) was the process by which flesh rots, swamps generate foul airs, and wounds fester” (Angus & van der Poll, 2013; Majno, 1991). Later, work by Louis Pasteur and others led to the notion of “blood poisoning.” (Angus & van der Poll, 2013). This notion of sepsis as a systemic infection in which pathogenic microbes invade the bloodstream lasted until a 1991 meeting of the Critical Care Medicine Consensus Conference (Bone et al., 1992). New more potent antibiotics lead to the discovery that much sepsis-associated pathology could occur in patients who had been completely cleared of infection. Because of this discovery, the focus shifted from the microorganism and onto the host response to infection, which the conference referred to as the “Systemic Inflammatory Response Syndrome” or SIRS (Table 1).
Table 1: SIRS Criteria. From (Singer et al., 2016). In conjunction with a plausible source of infection, the systemic inflammatory response syndrome (SIRS) criteria formed the backbone of the sepsis definition until 2016.
The 1991 consensus definition was revised by a task force in 2001 but left largely unchanged (Levy et al., 2003). This operational definition (Sepsis-II) was the paradigm used during the data collection period of the MIMIC-III data set, and as such is particularly relevant. The condition of sepsis was conceptualized as a three-tiered hierarchy where each tier represents an increased risk of mortality (Figure 1):
In 2016 the definition was again revised drastically (Sepsis-III). This new paradigm defines sepsis as “life-threatening organ dysfunction caused by a dysregulated host response to infection” (Singer et al., 2016). The model adapts the previously established Systemic Organ Failure Assessment (SOFA) score to form an operational definition of sepsis (Table 2), and further characterizes septic shock as a “subset of sepsis in which particularly profound circulatory, cellular, and metabolic abnormalities are associated with a greater risk of mortality than with sepsis alone” (Singer et al., 2016). They further introduce a “quick SOFA” (qSOFA) score for rapid clinical identification (Table 3).
Table 2: SOFA Criteria. From (Singer et al., 2016). The Sequential [Sepsis-Related] Organ Failure Assessment (SOFA) score. In addition to a suspected source of infection, diagnosis under Sepsis-III requires of a score of at least two. Increases in SOFA scoring are attributed with a 10% increase in mortality.
Table 3: qSOFA Criteria. From (Singer et al., 2016). The “Quick SOFA” (qSOFA) score is a distilled version of the full organ dysfunction system, designed for quick evaluation by clinical staff at the point of care. A score of two or more qualifies the patient for further assessment.
Interestingly, one of the major focal points of this model is on using the SOFA score as a predictor of mortality among septic patients. Even though the SOFA score was not used to evaluate patients during the period in which the MIMIC-III data set was collected, and thus not all the information is readily available for all patients, this new-found emphasis on mortality prediction pushed me to further explore the issue. Existing models under both Sepsis-II and -III have been limited to what is “clinically feasible” - that is, not every test can be ran on every patient. However, the MIMIC-III database is greatly expanded. In addition to a great many physiologic parameters, the data set also includes a wealth of free-text data stored as notes which may be underutilized. Using this expanded data set, I wish to create a model (or rather, models) which can differentiate between survivors and non-survivors and compare it’s predictive value to a gold standard of SIRS.
Figure 1: Definitions Associated with Sepsis-II and Sepsis-III. Adapted from (Bone et al., 1992) and (Singer et al., 2016).
The Medical Information Mart for Intensive Care (MIMIC-III) is a data set developed by the Massachusetts Institute of Technology (MIT) which contains de-identified healthcare data from more than 40,000 intensive care unit (ICU) patients over the years 2001-2012 (Johnson et al., 2016). It includes most of the information found in an electronic health record (EHR) including demographics, diagnosis codes, procedures, lab values, vital signs, admission and discharge information, physician notes, and optionally waveforms. It is available free of charge, however, accessing the database requires completion of an online course on research ethics.
A local copy of the data was stored as a PostgreSQL database. This was my first time interacting with any SQL system and it was a fun challenge. To build the database, I followed the tutorial listed on the PhysioNet website and used the tools available in the mimic-code
GitHub repository (“Install MIMIC (Unix/Mac),” n.d.; Johnson, Stone, Celi, & Pollard, 2018).
Once the database was constructed, I used the RPostgreSQL
library for data import and manipulation. Note that due to the large size of the data set, querying the database can be computationally intensive. In some of the larger queries, the results are cached as an .RData
file and loaded in separately. This also occurs when processing the free-text document corpus, which is computationally intensive.
The first step is loading the database connection (stored in the variable con
) using the functions dbConnect()
and dbExcecute()
. This connection will be open until it is closed using the function dbDisconnent()
. Note that the invisible()
wrapper around the dbExecute()
function serves only to silence the output.
con <- dbConnect(dbDriver('PostgreSQL'),
dbname = 'mimic',
host = '127.0.0.1',
port = '5432',
user = 'postgres',
password = 'postgres')
invisible(dbExecute(con, paste("SET search_path TO ",
schema = 'mimiciii',
sep=" ")))
After acquiring the data, the first step was to extract septic patients. To do this, I use the International Classification of Disease (ICD-9) codes associated with the patient’s admission. The MIMIC-III database contains a table `D_ICD_DIAGNOSIS
which contains the interpretation of these codes.
d_icd_diagnosis <- dbGetQuery(con, "SELECT *
FROM D_ICD_DIAGNOSES")
kable(head(d_icd_diagnosis), caption="Sample of `D_ICD_DIAGNOSIS` SQL table") %>%
kable_styling(bootstrap_options='striped')
row_id | icd9_code | short_title | long_title |
---|---|---|---|
174 | 01166 | TB pneumonia-oth test | Tuberculous pneumonia [any form], tubercle bacilli not found by bacteriological or histological examination, but tuberculosis confirmed by other methods [inoculation of animals] |
175 | 01170 | TB pneumothorax-unspec | Tuberculous pneumothorax, unspecified |
176 | 01171 | TB pneumothorax-no exam | Tuberculous pneumothorax, bacteriological or histological examination not done |
177 | 01172 | TB pneumothorx-exam unkn | Tuberculous pneumothorax, bacteriological or histological examination unknown (at present) |
178 | 01173 | TB pneumothorax-micro dx | Tuberculous pneumothorax, tubercle bacilli found (in sputum) by microscopy |
179 | 01174 | TB pneumothorax-cult dx | Tuberculous pneumothorax, tubercle bacilli not found (in sputum) by microscopy, but found by bacterial culture |
From within these diagnosis identifiers, I search for the three diagnoses specified under the Bone et. al Sepsis-II criteria: Sepsis
, Severe sepsis
, and Septic shock
.
sepsis_codes <- d_icd_diagnosis[d_icd_diagnosis$short_title %in% c('Sepsis', 'Severe sepsis', 'Septic shock'), ]
kable(sepsis_codes, caption='ICD9 codes of interest') %>%
kable_styling(bootstrap_options='striped')
row_id | icd9_code | short_title | long_title | |
---|---|---|---|---|
10305 | 11403 | 99591 | Sepsis | Sepsis |
10306 | 11404 | 99592 | Severe sepsis | Severe sepsis |
13143 | 12991 | 78552 | Septic shock | Septic shock |
Under the MIMIC-III schema, encounters are identified by both a patient identifier (subject_id
) and by a hospital admission identifier (hadm_id
). In this format, each hospital admission is associated with one unique patient, but a given patient may have multiple admissions in which they have been diagnosed with sepsis. Since the primary variable of interest is mortality, I am mostly concerned with the last admission during which a patient was diagnosed with one of the above three ICD-9 codes. For the purposes of this analysis, I make the assumption that patients who do are discharged by the hospital and do not return within the study period have survived the disease. One limitation of this selection process is that patients may have been discharged to an outside system such as another hospital or to a long-term care facility, in which case our knowledge of the outcome is limited.
The MIMIC-III database contains a DIAGNOSES_ICD
table which contains the subject_id
, hadm_id
, and ICD-9 codes for each encounter. First, a list of all admissions which are associated with the sepsis ICD-9 codes is generated. I then add a severity
column to the sepsis_hadm
dataframe where:
diagnosis_icd <- dbGetQuery(con, "SELECT *
FROM DIAGNOSES_ICD")
sepsis_hadm <- diagnosis_icd[diagnosis_icd$icd9_code %in% sepsis_codes$icd9_code,]
sepsis_hadm$severity <- factor(sepsis_hadm$icd9_code,
levels = c(99591, 99592, 78552),
labels=c(1, 2, 3))
kable(head(sepsis_hadm), caption="Sample septic patients in `DIAGNOSES_ICD` SQL table") %>%
kable_styling(bootstrap_options='striped')
row_id | subject_id | hadm_id | seq_num | icd9_code | severity | |
---|---|---|---|---|---|---|
88 | 1547 | 117 | 164853 | 16 | 99592 | 2 |
145 | 1604 | 124 | 138376 | 6 | 99592 | 2 |
277 | 505 | 64 | 172056 | 3 | 99591 | 1 |
451 | 679 | 85 | 112077 | 18 | 99591 | 1 |
739 | 131 | 21 | 111970 | 2 | 78552 | 3 |
748 | 140 | 21 | 111970 | 11 | 99592 | 2 |
Due to the aggregate nature of the sepsis diagnoses, patients may have multiple sepsis-related ICD9 codes in the database (e.g. both sepsis and septic shock). Therefore, I filter this list to only include the most severe diagnoses: e.g. if a patient has been diagnosed with both sepsis and septic shock, only the “septic shock” row is preserved as it is assumed that all “septic shock” patients are also suffering form sepsis. Practically speaking, this is done by sorting the dataframe by severity
and then removing hadm_id
duplicates as they appear.
sepsis_hadm_sorted <- sepsis_hadm[order(sepsis_hadm$severity, decreasing=T),]
sepsis_hadm <- sepsis_hadm_sorted[!duplicated(sepsis_hadm_sorted$hadm_id),]
The next step is to remove duplicate patients so that only the last available encounter is preserved. To do this, the sepsis_hadm
identifiers filtered above are sorted based on admission time, and the duplicate subject_id
is removed. The result is a list of unique hadm_id
identifiers which represent a patient’s last sepsis-associated encounter.
admissions <- dbGetQuery(con, sprintf("SELECT *
FROM ADMISSIONS
WHERE hadm_id IN (%s)",
paste(as.character(sepsis_hadm[,'hadm_id']), collapse=", ")))
admissions_sorted <- admissions[order(admissions$admittime, decreasing=T),]
hadm_list <- admissions_sorted[!duplicated(admissions_sorted$subject_id), 'hadm_id']
The final admissions list contains 4689 unique hospital admission identifiers which can be used to extract data from each patient’s last sepsis-associated encounter.
The fact that this process worked as intended can be verified using subject 353. He or she was admitted on three separate occasions, and during one of which was diagnosed with both severe sepsis and septic shock.
kable(sepsis_hadm_sorted[sepsis_hadm_sorted$subject_id == 353, ], caption='Example: Patient 353 before sorting') %>%
kable_styling(bootstrap_options = 'striped')
row_id | subject_id | hadm_id | seq_num | icd9_code | severity | |
---|---|---|---|---|---|---|
4369 | 4145 | 353 | 112976 | 5 | 78552 | 3 |
4368 | 4144 | 353 | 112976 | 4 | 99592 | 2 |
4357 | 4133 | 353 | 108923 | 6 | 99591 | 1 |
4387 | 4163 | 353 | 131488 | 3 | 99591 | 1 |
Next, we see that after additional filtering, there is only one row listed for admission 112976
- septic shock (severity level 3). Please note that the times, while internally consistent within patients, have been randomized to protect confidentiality - hence why the year marking can be beyond the year 2100.
kable(admissions[admissions$subject_id == 353, 1:5], caption='Example: Patient 353 after removal of duplicate admissions') %>%
kable_styling(bootstrap_options = 'striped')
row_id | subject_id | hadm_id | admittime | dischtime | |
---|---|---|---|---|---|
479 | 448 | 353 | 108923 | 2151-03-28 16:01:00 | 2151-04-13 16:10:00 |
694 | 449 | 353 | 112976 | 2151-06-23 22:18:00 | 2151-07-04 13:22:00 |
1727 | 450 | 353 | 131488 | 2151-10-01 20:42:00 | 2151-10-20 15:20:00 |
Lastly, we confirm that there is only the latest identifier left in the final hadm_list
:
kable(cbind(admissions[admissions$subject_id == 353, 1:5],
admissions[admissions$subject_id == 353, 'hadm_id'] %in% hadm_list),
col.names=append(colnames(admissions)[1:5], 'in_hadm_list'),
caption='Admission identifiers and times included in final admission list') %>%
kable_styling(bootstrap_options = 'striped')
row_id | subject_id | hadm_id | admittime | dischtime | in_hadm_list | |
---|---|---|---|---|---|---|
479 | 448 | 353 | 108923 | 2151-03-28 16:01:00 | 2151-04-13 16:10:00 | FALSE |
694 | 449 | 353 | 112976 | 2151-06-23 22:18:00 | 2151-07-04 13:22:00 | FALSE |
1727 | 450 | 353 | 131488 | 2151-10-01 20:42:00 | 2151-10-20 15:20:00 | TRUE |
Before delving into the predictive modeling aspect, I wanted to provide some insight into the selected patient cohort via exploratory data analysis. To aid this, I first define several plotting functions which will make the following sections more convenient.
Pie Chart:
plotPieChart <- function(d, legend, title){
fig <- ggplot(d, aes(x=factor(1), fill=legend)) +
geom_bar(width=1) +
coord_polar('y') +
xlab('') + ylab('') + ggtitle(sprintf('Pie Chart of %s', title)) +
scale_fill_discrete(name=title) +
theme(axis.title.x=element_blank(), axis.title.y=element_blank(),
axis.text.x=element_blank(), axis.text.y=element_blank(),
axis.ticks.x=element_blank(), axis.ticks.y=element_blank(),
panel.background=element_blank(), panel.border=element_blank(),
panel.grid.major=element_blank(), panel.grid.minor=element_blank(),
plot.background=element_blank())
}
Bar Chart:
plotBarChart <- function(d, legend, title){
fig <- ggplot(d, aes(
x=factor(legend, levels=names(sort(table(legend), decreasing=TRUE))),
fill=factor(legend, levels=names(sort(table(legend), decreasing=TRUE))))) +
scale_fill_discrete(name=title) +
ggtitle(sprintf('Bar Chart of %s', title)) + xlab(title) +
geom_bar()
}
Histogram:
plotHistogram <- function(d, xdata, title){
fig <- ggplot(data=d, aes(x=xdata)) +
geom_histogram() +
ggtitle(sprintf('Histogram of %s', title)) + xlab(title)
}
First, and perhaps most pressingly, I wanted to examine the mortality rate in my cohort. To do this, I used the presence or absence of the deathtime
attribute in the ADMISSIONS
table.
admissions <- dbGetQuery(con, sprintf("SELECT *
FROM ADMISSIONS
WHERE HADM_ID IN (%s)",
paste(as.character(hadm_list), collapse=", ")))
print(plotPieChart(admissions, !is.na(admissions$deathtime), 'Patient Mortality'))
Of the 4689 patients in the selected cohort, 3027 (65%) patients survived to discharge while the remaining 1662 (35%) did not. This falls within the range of estimates cited by Abe et. al’s 2018 study and review, which reported mortality rates ranging from 23-49%. This high mortality rate is concerning!
Length of stay (LoS) was calculated for each patient as the difference between discharge and admission times. Note that for patients who did not survive, the dischtime
attribute is equal to deathtime
. The LoS is then visualized via histogram with summary statistics provided by the describe()
function in the psych
package. Additionally, LoS was calculated independently for patients who survived until discharge and those who did not.
admissions$length_of_stay <- difftime(admissions$dischtime, admissions$admittime, units='days')
ggarrange(
plotHistogram(admissions,
admissions$length_of_stay,
'LoS: All Patients (days)'),
plotHistogram(admissions[is.na(admissions$deathtime),],
admissions$length_of_stay[is.na(admissions$deathtime)],
'LoS: Discharged Patients (days)'),
plotHistogram(admissions[!is.na(admissions$deathtime),],
admissions$length_of_stay[!is.na(admissions$deathtime)],
'LoS: Deceased Patients (days)'))
Each of the above distributions is unimodal and is heavily right-skewed. Summary statistics are presented below.
tbl <- data.frame(All_Patients=t(describe(as.numeric(admissions$length_of_stay))),
Discharged_Patients=t(describe(as.numeric(admissions$length_of_stay[is.na(admissions$deathtime)]))),
Deceased_Patients=t(describe(as.numeric(admissions$length_of_stay[!is.na(admissions$deathtime)]))))
kable(tbl, caption='LoS: All Patients (days)', digits=2, col.names=c("All Patients", "Discharged Patients", "Deceased Patients")) %>%
kable_styling(bootstrap_options = 'striped')
All Patients | Discharged Patients | Deceased Patients | |
---|---|---|---|
vars | 1.00 | 1.00 | 1.00 |
n | 4689.00 | 3027.00 | 1662.00 |
mean | 14.71 | 15.16 | 13.87 |
sd | 16.28 | 15.25 | 17.99 |
median | 9.79 | 10.24 | 8.40 |
trimmed | 11.82 | 12.41 | 10.44 |
mad | 8.28 | 7.70 | 9.67 |
min | -0.32 | 0.16 | -0.32 |
max | 294.62 | 294.62 | 206.43 |
range | 294.94 | 294.46 | 206.75 |
skew | 4.18 | 4.59 | 3.70 |
kurtosis | 36.24 | 49.18 | 22.39 |
se | 0.24 | 0.28 | 0.44 |
Even though sepsis is a disease characterized by the host’s response, rather than a physical etiologic agent, it may be interesting to examine the types of organisms that are found during sepsis. To extract this information, I use the MICROBIOLOGYEVENTS
table which contains the results from microbiology blood cultures. Additionally, NA
values are interpreted as negative test results as per the MIMIC-III documentation. The 15 most common pathogens are reported below, along with their count and relative frequency.
microbiology_events <- dbGetQuery(con, sprintf("SELECT *
FROM MICROBIOLOGYEVENTS
WHERE hadm_id IN (%s) AND spec_type_desc = 'BLOOD CULTURE'",
paste(as.character(hadm_list), collapse=", ")))
microbiology_events$org_name[is.na(microbiology_events$org_name)] <- 'Negative'
tbl <- data.frame(name = sort(table(microbiology_events$org_name), decreasing=T),
count = sort(prop.table(table(microbiology_events$org_name)), decreasing=T) )
kable(tbl[1:15, c(1, 2, 4)], col.names=c("Organism", "Count", "Frequency"), caption="Most common microorganisms") %>%
kable_styling(bootstrap_options = 'striped')
Organism | Count | Frequency |
---|---|---|
Negative | 29568 | 0.5825059 |
ESCHERICHIA COLI | 4829 | 0.0951340 |
STAPHYLOCOCCUS, COAGULASE NEGATIVE | 4105 | 0.0808708 |
STAPH AUREUS COAG + | 3756 | 0.0739953 |
KLEBSIELLA PNEUMONIAE | 2071 | 0.0407998 |
ENTEROCOCCUS FAECIUM | 1053 | 0.0207447 |
PSEUDOMONAS AERUGINOSA | 720 | 0.0141844 |
PROTEUS MIRABILIS | 524 | 0.0103231 |
ENTEROBACTER CLOACAE | 421 | 0.0082939 |
ENTEROCOCCUS FAECALIS | 404 | 0.0079590 |
STREPTOCOCCUS PNEUMONIAE | 319 | 0.0062845 |
SERRATIA MARCESCENS | 279 | 0.0054965 |
KLEBSIELLA OXYTOCA | 247 | 0.0048660 |
ENTEROBACTER CLOACAE COMPLEX | 142 | 0.0027975 |
STAPHYLOCOCCUS EPIDERMIDIS | 140 | 0.0027581 |
According to these statistics, of the 116 unique microbiology results found, the most common finding is a negative test result. This isn’t surprising, given that only a fraction of bacteria can be cultured in the first place and that sepsis can continue long after the initial etiologic pathogen has been eradicated.
While some patients present to the hospital with full-blown sepsis, conditions such as pneumonia or urinary tract infections can also lead to sepsis. By examining the initial chief complaint from the ADMISSIONS
table, we can evaluate some of the most common conditions preceding sepsis.
tbl <- data.frame(name = sort(table(admissions$diagnosis), decreasing=T),
count = sort(prop.table(table(admissions$diagnosis)), decreasing=T) )
kable(tbl[1:15, c(1, 2, 4)], col.names=c("Initial Diagnosis", "Count", "Frequency"), caption="Most common initial diagnoses") %>%
kable_styling(bootstrap_options = 'striped')
Initial Diagnosis | Count | Frequency |
---|---|---|
SEPSIS | 712 | 0.1518447 |
PNEUMONIA | 369 | 0.0786948 |
FEVER | 150 | 0.0319898 |
ABDOMINAL PAIN | 127 | 0.0270847 |
HYPOTENSION | 114 | 0.0243122 |
ALTERED MENTAL STATUS | 87 | 0.0185541 |
UROSEPSIS | 60 | 0.0127959 |
CHOLANGITIS | 57 | 0.0121561 |
URINARY TRACT INFECTION;PYELONEPHRITIS | 54 | 0.0115163 |
CONGESTIVE HEART FAILURE | 51 | 0.0108765 |
CELLULITIS | 44 | 0.0093837 |
ACUTE RENAL FAILURE | 41 | 0.0087439 |
LIVER FAILURE | 39 | 0.0083173 |
PANCREATITIS | 33 | 0.0070377 |
GASTROINTESTINAL BLEED | 30 | 0.0063980 |
In addition to initial diagnosis, patients can enter the hospital system in a variety of ways. Among our cohort, by far the most common ingress point is through the emergency department. Elective procedures and urgent care also represented a fair number of patients. There was a small number of newborn patients in the cohort.
tbl <- data.frame(name = sort(table(admissions$admission_type), decreasing=T),
count = sort(prop.table(table(admissions$admission_type)), decreasing=T) )
kable(tbl[, c(1, 2, 4)], col.names=c("Admission Type", "Count", "Frequency"), caption="Most common admission types") %>%
kable_styling(bootstrap_options = 'striped')
Admission Type | Count | Frequency |
---|---|---|
EMERGENCY | 4463 | 0.9518021 |
ELECTIVE | 150 | 0.0319898 |
URGENT | 72 | 0.0153551 |
NEWBORN | 4 | 0.0008531 |
After exploring the selected cohort, I begin extracting features from the database. This was somewhat difficult and very time consuming, but the result was a data set containing 12 numerical predictors collected at the beginning and end of the patient’s stay. First, I create a clean slate by removing all of the excess data involved in the exploratory data analysis section.
rm(list=setdiff(ls(), c('hadm_list', 'con', 'plotHistogram', 'plotPieChart', 'plotBarChart')))
I then pre-allocate a dataframe with the admission and subject ID numbers, as well as the admission time, discharge time, and a Boolean value representing whether or not the patient was deceased.
admissions <- dbGetQuery(con, sprintf("SELECT *
FROM ADMISSIONS
WHERE HADM_ID IN (%s)",
paste(as.character(hadm_list), collapse=", ")))
data <- cbind(admissions[, c('subject_id', 'hadm_id', 'admittime', 'dischtime')], !is.na(admissions$deathtime))
colnames(data) <- c('subject_id', 'hadm_id', 'admittime', 'dischtime', 'death')
I also want to add some index of severity (sepsis vs. severe sepsis vs. septic shock). I also plot a pie chart showing the relative proportion of severity (1 = sepsis, 2 = severe sepsis, 3 = septic shock). As can be seen, almost half of the patients in the cohort have been diagnosed with septic shock.
diagnosis_icd <- dbGetQuery(con, sprintf("SELECT *
FROM DIAGNOSES_ICD
WHERE HADM_ID IN (%s)",
paste(as.character(hadm_list), collapse=", ")))
diagnosis_icd_sepsis <- diagnosis_icd[diagnosis_icd$icd9_code %in% c(99591, 99592, 78552),c('subject_id', 'hadm_id', 'icd9_code')]
diagnosis_icd_sepsis$severity <- factor(diagnosis_icd_sepsis$icd9_code,
levels = c(99591, 99592, 78552),
labels=c(1, 2, 3))
diagnosis_icd_sorted <- diagnosis_icd_sepsis[order(diagnosis_icd_sepsis$severity, decreasing=T),]
diagnosis_icd_sorted <- diagnosis_icd_sorted[!duplicated(diagnosis_icd_sorted$hadm_id),]
data <- merge(data, diagnosis_icd_sorted, by=c('hadm_id', 'subject_id'), all.x=T, all.y=F)
print(plotPieChart(data, data$severity, 'Severity'))
After preallocating the data table, I will define a function to extract data from the LABEVENTS
and CHARTEVENTS
portions of the database. Then, I can search for each variable and append them to the data frame constructed above. Due to the large size of the dataset, querying can be computationally intensive. Thus, I set up a rudimentary caching system which stores the query result as a more optimized .RData
file. Then, as subsequent changes are made to the document, these caches are loaded in rather than re-querying the database.
Additionally, many of the parameters are collected repeatedly during the stay of each patient. Because of the vast differences in LoS among patients (from 0 to almost 300 days) and the irregularity of the sampling times, I chose to avoid the longitudinal nature and instead focused on cross-sectional analytics. To this end, I extract the first data point upon admission for each parameter, and the last point before discharge or death. Models built using the initial set of parameters nearest admission would be the most clinically useful, while I expect that models built using data from near discharge or death will have better performance.
parse_data <- function(con, table, itemid, hadm_list, cache_fn, isCached, keepRange){
## Check if data is cached, and if not make the request to the database
if(isCached == F){
var_data <- dbGetQuery(con, sprintf("SELECT SUBJECT_ID, HADM_ID, CHARTTIME, VALUENUM, VALUEUOM
FROM %s
WHERE HADM_ID IN (%s)
AND ITEMID IN (%s)",
table,
paste(as.character(hadm_list), collapse=", "),
paste(as.character(itemid), collapse=", ")))
save(var_data, file=cache_fn)
}
if(isCached == T){
load(cache_fn)
}
# Remove NA and impossible values
var_data <- na.omit(var_data)
var_data <- var_data[(var_data$valuenum > keepRange[1]) & (var_data$valuenum < keepRange[2]),]
# Get first reading
var_sorted <- var_data[order(var_data$charttime, decreasing=F),]
var_first <- var_sorted[match(unique(var_sorted$hadm_id), var_sorted$hadm_id),]
# Get last reading
var_sorted <- var_data[order(var_data$charttime, decreasing=T),]
var_last <- var_sorted[match(unique(var_sorted$hadm_id), var_sorted$hadm_id),]
return(cbind(var_first, var_last))
}
Next, each of the SIRS parameters are isolated from the larger database. These are:
* Temperature
* Heart rate
* Respiratory rate
* Partial pressure of CO_2_ in arterial blood (PaCO_2_)
* White blood cell count (WBC)
The first item in the SIRS score table is temperature. This, along with most vitals, is stored in the CHARTEVENTS
table. However, entries into this table are coded by an ITEMID
code. To parse these codes, I load in the D_ITEMS
table, search for the requisite entries, and then use the codes to search CHARTEVENTS
. Because the data was collected using two different systems (CareVue and MetaVision), there are multiple ITEMID
codes.
d_items <- dbGetQuery(con, 'SELECT *
FROM D_ITEMS')
d_labitems <- dbGetQuery(con, 'SELECT *
FROM D_LABITEMS')
kable(d_items[d_items$itemid %in% c(676,677,3654,3655,6643,224027,223762),], caption='Temperature item identifiers of interest') %>%
kable_styling(bootstrap_options = 'striped')
row_id | itemid | label | abbreviation | dbsource | linksto | category | unitname | param_type | conceptid | |
---|---|---|---|---|---|---|---|---|---|---|
1447 | 627 | 676 | Temperature C | NA | carevue | chartevents | NA | NA | NA | NA |
1448 | 628 | 677 | Temperature C (calc) | NA | carevue | chartevents | NA | NA | NA | NA |
3896 | 2818 | 3654 | Temp Rectal [F] | NA | carevue | chartevents | NA | NA | NA | NA |
3897 | 2819 | 3655 | Temp Skin [C] | NA | carevue | chartevents | NA | NA | NA | NA |
5036 | 4298 | 6643 | Temp Rectal | NA | carevue | chartevents | NA | NA | NA | NA |
9307 | 13042 | 224027 | Skin Temperature | Skin Temp | metavision | chartevents | Skin - Assessment | NA | Text | NA |
12368 | 12758 | 223762 | Temperature Celsius | Temperature C | metavision | chartevents | Routine Vital Signs | ?C | Numeric | NA |
Next, the parse_data()
function is used to extract temperature data from the data set using the ITEMID
codes. Those temperatures with values in Fahrenheit are converted into Celsius. Then, these values are integrated into the data
object. The distributions of temperature on admission and discharge/death are plotted, and descriptive statistics are presented below.
# Load data
temperature <- parse_data(con, "CHARTEVENTS", c(676,677, 223761, 223762), hadm_list, '../cache/temperature.RData', T, c(10, 130))
temperature_first <- temperature[,1:5]
temperature_last <- temperature[,6:10]
# Convert F --> C
temperature_first[temperature_first$valueuom == '?F', 'valuenum'] <- as.numeric((as.numeric(temperature_first[temperature_first$valueuom == '?F', 'valuenum']) - 32)*5/9)
temperature_last[temperature_last$valueuom == '?F', 'valuenum'] <- as.numeric((as.numeric(temperature_last[temperature_last$valueuom == '?F', 'valuenum']) - 32)*5/9)
# Remove values outside range after conversion
temperature_first <- temperature_first[(temperature_first$valuenum > 20) & (temperature_first$valuenum < 50),]
temperature_last <- temperature_last[(temperature_last$valuenum > 20) & (temperature_last$valuenum < 50),]
# Add to data object
data <- merge(data, temperature_first[,!names(temperature_first) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
data <- merge(data, temperature_last[,!names(temperature_last) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
colnames(data) <- c(colnames(data)[1:(length(data)-4)], 'temperature_t1', 'temperature_1', 'tempearture_t2', 'temperature_2')
# Plot histograms
data_melt <- melt(data[,c('temperature_1', 'temperature_2')], na.rm=TRUE)
fig <- ggplot(data=data_melt, aes(x=value, fill=variable)) +
geom_density(alpha=0.5) +
scale_fill_discrete(name='Time', labels=c('Admission', 'Discharge/Death')) +
ggtitle('Distribution of Temperature in Septic Patients') +
ylab('Kernel Density Estimate') + xlab('Temperature (C)')
print(fig)
# Print summary statistics
kable(t(rbind(describe(data$temperature_1), describe(data$temperature_2))),
caption='Descriptive statistics of temperature',
col.names=c('Admission', 'Discharge/Death'),
digits=2) %>%
kable_styling(bootstrap_options = 'striped')
Admission | Discharge/Death | |
---|---|---|
vars | 1.00 | 1.00 |
n | 4612.00 | 4607.00 |
mean | 36.83 | 36.67 |
sd | 1.13 | 0.84 |
median | 36.78 | 36.67 |
trimmed | 36.82 | 36.66 |
mad | 0.99 | 0.74 |
min | 26.67 | 30.00 |
max | 40.89 | 41.00 |
range | 14.22 | 11.00 |
skew | -0.35 | -0.08 |
kurtosis | 3.43 | 2.82 |
se | 0.02 | 0.01 |
After the filtering and sorting, a full 4612 out of 4689 patients had temperature in the data set!
The next item is heart rate. A similar process is used to glean this information.
# Load data
heartrate <- parse_data(con, "CHARTEVENTS", c(211, 220045), hadm_list, '../cache/heartrate.RData', T, c(20, 400))
heartrate_first <- heartrate[,1:5]
heartrate_last <- heartrate[,6:10]
# Add to data object
data <- merge(data, heartrate_first[,!names(heartrate_first) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
data <- merge(data, heartrate_last[,!names(heartrate_last) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
colnames(data) <- c(colnames(data)[1:(length(data)-4)], 'heartrate_t1', 'heartrate_1', 'heartrate_t2', 'heartrate_2')
# Plot histograms
data_melt <- melt(data[,c('heartrate_1', 'heartrate_2')], na.rm=TRUE)
fig <- ggplot(data=data_melt, aes(x=value, fill=variable)) +
geom_density(alpha=0.5) +
scale_fill_discrete(name='Time', labels=c('Admission', 'Discharge/Death')) +
ggtitle('Distribution of Heart Rate in Septic Patients') +
ylab('Kernel Density Estimate') + xlab('Heart Rate (bpm)')
print(fig)
# Print summary statistics
kable(t(rbind(describe(data$heartrate_1), describe(data$heartrate_2))),
caption='Descriptive statistics of heart rate',
col.names=c('Admission', 'Discharge/Death'),
digits=2) %>%
kable_styling(bootstrap_options = 'striped')
Admission | Discharge/Death | |
---|---|---|
vars | 1.00 | 1.00 |
n | 4635.00 | 4635.00 |
mean | 95.04 | 88.63 |
sd | 21.28 | 19.84 |
median | 94.00 | 87.00 |
trimmed | 94.32 | 87.65 |
mad | 22.24 | 19.27 |
min | 21.00 | 21.00 |
max | 180.00 | 200.00 |
range | 159.00 | 179.00 |
skew | 0.32 | 0.55 |
kurtosis | -0.19 | 1.03 |
se | 0.31 | 0.29 |
4635 patients had heart rate in the data set.
The pattern is again repeated for respiratory rate.
# Load data
resprate <- parse_data(con, "CHARTEVENTS", c(618, 220210), hadm_list, '../cache/resprate.RData', T, c(0, 100))
resprate_first <- resprate[,1:5]
resprate_last <- resprate[,6:10]
# Add to data object
data <- merge(data, resprate_first[,!names(resprate_first) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
data <- merge(data, resprate_last[,!names(resprate_last) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
colnames(data) <- c(colnames(data)[1:(length(data)-4)], 'resprate_t1', 'resprate_1', 'resprate_t2', 'resprate_2')
# Plot histograms
data_melt <- melt(data[,c('resprate_1', 'resprate_2')], na.rm=TRUE)
fig <- ggplot(data=data_melt, aes(x=value, fill=variable)) +
geom_density(alpha=0.5) +
scale_fill_discrete(name='Time', labels=c('Admission', 'Discharge/Death')) +
ggtitle('Distribution of Respiratory Rate in Septic Patients') +
ylab('Kernel Density Estimate') + xlab('Respiratory Rate (bpm)')
print(fig)
# Print summary statistics
kable(t(rbind(describe(data$resprate_1), describe(data$resprate_2))),
caption='Descriptive statistics of respiratory rate',
col.names=c('Admission', 'Discharge/Death'),
digits=2) %>%
kable_styling(bootstrap_options = 'striped')
Admission | Discharge/Death | |
---|---|---|
vars | 1.00 | 1.00 |
n | 4631.00 | 4631.00 |
mean | 21.22 | 21.14 |
sd | 6.43 | 6.56 |
median | 20.00 | 21.00 |
trimmed | 20.80 | 20.82 |
mad | 5.93 | 5.93 |
min | 2.00 | 2.00 |
max | 65.00 | 99.00 |
range | 63.00 | 97.00 |
skew | 0.82 | 1.02 |
kurtosis | 1.81 | 5.83 |
se | 0.09 | 0.10 |
4631 patients had respiratory rate in the data set.
An alternative to respiratory rate is the partial pressure of CO_2_ in the arterial blood (PaCO_2_).
# Load data
paco2 <- parse_data(con, "LABEVENTS", c(50818), hadm_list, '../cache/paco2.RData', T, c(1, 150))
paco2_first <- paco2[,1:5]
paco2_last <- paco2[,6:10]
# Add to data object
data <- merge(data, paco2_first[,!names(paco2_first) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
data <- merge(data, paco2_last[,!names(paco2_last) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
colnames(data) <- c(colnames(data)[1:(length(data)-4)], 'paco2_t1', 'paco2_1', 'paco2_t2', 'paco2_2')
# Plot histograms
data_melt <- melt(data[,c('paco2_1', 'paco2_2')], na.rm=TRUE)
fig <- ggplot(data=data_melt, aes(x=value, fill=variable)) +
geom_density(alpha=0.5) +
scale_fill_discrete(name='Time', labels=c('Admission', 'Discharge/Death')) +
ggtitle('Distribution of PaCO_2_ in Septic Patients') +
ylab('Kernel Density Estimate') + xlab('PaCO_2_ (mmHg)')
print(fig)
# Print summary statistics
kable(t(rbind(describe(data$paco2_1), describe(data$paco2_2))),
caption='Descriptive statistics of PaCO_2_',
col.names=c('Admission', 'Discharge/Death'),
digits=2) %>%
kable_styling(bootstrap_options = 'striped')
Admission | Discharge/Death | |
---|---|---|
vars | 1.00 | 1.00 |
n | 3856.00 | 3856.00 |
mean | 41.20 | 41.55 |
sd | 13.59 | 12.46 |
median | 39.00 | 39.00 |
trimmed | 39.67 | 40.23 |
mad | 10.38 | 8.90 |
min | 11.00 | 12.00 |
max | 124.00 | 141.00 |
range | 113.00 | 129.00 |
skew | 1.52 | 1.79 |
kurtosis | 4.05 | 6.85 |
se | 0.22 | 0.20 |
3856 patients had PaCO_2_ in the data set.
# Load data
wbccount <- parse_data(con, "LABEVENTS", c(51301), hadm_list, '../cache/wbccount.RData', T, c(0, 100))
wbccount_first <- wbccount[,1:5]
wbccount_last <- wbccount[,6:10]
# Add to data object
data <- merge(data, wbccount_first[,!names(wbccount_first) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
data <- merge(data, wbccount_last[,!names(wbccount_last) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
colnames(data) <- c(colnames(data)[1:(length(data)-4)], 'wbccount_t1', 'wbccount_1', 'wbccount_t2', 'wbccount_2')
# Plot histograms
data_melt <- melt(data[,c('wbccount_1', 'wbccount_2')], na.rm=TRUE)
fig <- ggplot(data=data_melt, aes(x=value, fill=variable)) +
geom_density(alpha=0.5) +
scale_fill_discrete(name='Time', labels=c('Admission', 'Discharge/Death')) +
ggtitle('Distribution of White Blood Cell Count in Septic Patients') +
ylab('Kernel Density Estimate') + xlab('WBC Count (10^9^/L)')
print(fig)
# Print summary statistics
kable(t(rbind(describe(data$wbccount_1), describe(data$wbccount_2))),
caption='Descriptive statistics of WBC count',
col.names=c('Admission', 'Discharge/Death'),
digits=2) %>%
kable_styling(bootstrap_options = 'striped')
Admission | Discharge/Death | |
---|---|---|
vars | 1.00 | 1.00 |
n | 4683.00 | 4683.00 |
mean | 14.02 | 11.97 |
sd | 9.45 | 8.96 |
median | 12.30 | 9.70 |
trimmed | 12.97 | 10.52 |
mad | 7.26 | 5.04 |
min | 0.10 | 0.10 |
max | 99.80 | 95.80 |
range | 99.70 | 95.70 |
skew | 2.23 | 2.98 |
kurtosis | 11.04 | 14.53 |
se | 0.14 | 0.13 |
After extracting each of the five SIRS variables, we are left with 3787 subjects with all four variables collected! Below, these are computed into SIRS scores.
Data is scored iteratively based on the SIRS parameters listed above.
# preallocate SIRS
data$sirs_1 <- matrix(0, nrow=dim(data)[1], ncol=1)
data$sirs_2 <- matrix(0, nrow=dim(data)[1], ncol=1)
# iteratively score SIRS
for(i in which(complete.cases(data))){
if(data[i, 'temperature_1'] < 36 | data[i, 'temperature_1'] > 38){data[i, 'sirs_1'] <- data[i, 'sirs_1']+1}
if(data[i, 'temperature_2'] < 36 | data[i, 'temperature_2'] > 38){data[i, 'sirs_2'] <- data[i, 'sirs_2']+1}
if(data[i, 'heartrate_1'] > 90){data[i, 'sirs_1'] <- data[i, 'sirs_1']+1}
if(data[i, 'heartrate_2'] > 90){data[i, 'sirs_2'] <- data[i, 'sirs_2']+1}
if(data[i, 'resprate_1'] > 20){data[i, 'sirs_1'] <- data[i, 'sirs_1']+1}
if(data[i, 'resprate_2'] > 20){data[i, 'sirs_2'] <- data[i, 'sirs_2']+1}
if(data[i, 'paco2_1'] < 32){data[i, 'sirs_1'] <- data[i, 'sirs_1']+1}
if(data[i, 'paco2_2'] < 32){data[i, 'sirs_2'] <- data[i, 'sirs_2']+1}
if(data[i, 'wbccount_1'] > 12 | data[i, 'wbccount_1'] < 4){data[i, 'sirs_1'] <- data[i, 'sirs_1']+1}
if(data[i, 'wbccount_2'] > 12 | data[i, 'wbccount_2'] < 4){data[i, 'sirs_2'] <- data[i, 'sirs_2']+1}
}
# turn non-complete cases to NA
data[!complete.cases(data), c('sirs_1', 'sirs_2')] <- NA
# display a sample of the dataset so far
kable(head(data), caption='Sample of the extracted dataset so far') %>%
kable_styling(bootstrap_options = 'striped')
hadm_id | subject_id | admittime | dischtime | death | icd9_code | severity | temperature_t1 | temperature_1 | tempearture_t2 | temperature_2 | heartrate_t1 | heartrate_1 | heartrate_t2 | heartrate_2 | resprate_t1 | resprate_1 | resprate_t2 | resprate_2 | paco2_t1 | paco2_1 | paco2_t2 | paco2_2 | wbccount_t1 | wbccount_1 | wbccount_t2 | wbccount_2 | sirs_1 | sirs_2 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
100028 | 53456 | 2142-12-23 18:06:00 | 2142-12-30 15:45:00 | FALSE | 99591 | 1 | 2142-12-23 | 37.11111 | 2142-12-25 | 37.94444 | 2142-12-23 | 87 | 2142-12-25 | 73 | 2142-12-23 | 13 | 2142-12-25 | 18 | 2142-12-25 | 48 | 2142-12-25 | 48 | 2142-12-23 | 31.3 | 2142-12-30 | 11.5 | 1 | 0 |
100074 | 26885 | 2176-04-09 21:03:00 | 2176-04-12 14:30:00 | TRUE | 78552 | 3 | 2176-04-10 | 38.40000 | 2176-04-12 | 37.50000 | 2176-04-10 | 92 | 2176-04-12 | 106 | 2176-04-10 | 25 | 2176-04-12 | 22 | 2176-04-09 | 28 | 2176-04-12 | 39 | 2176-04-09 | 19.2 | 2176-04-12 | 48.4 | 5 | 3 |
100104 | 73946 | 2201-06-21 19:08:00 | 2201-07-03 16:15:00 | FALSE | 78552 | 3 | 2201-06-21 | 37.77778 | 2201-06-29 | 36.88889 | 2201-06-21 | 123 | 2201-06-29 | 114 | 2201-06-21 | 24 | 2201-06-29 | 30 | 2201-06-21 | 37 | 2201-06-29 | 41 | 2201-06-21 | 7.4 | 2201-07-03 | 18.8 | 2 | 3 |
100117 | 14297 | 2166-04-30 19:05:00 | 2166-05-09 15:56:00 | FALSE | 99592 | 2 | 2166-04-30 | 37.61111 | 2166-05-04 | 37.77778 | 2166-04-30 | 110 | 2166-05-04 | 127 | 2166-04-30 | 28 | 2166-05-04 | 25 | 2166-05-01 | 37 | 2166-05-01 | 37 | 2166-04-30 | 23.8 | 2166-05-09 | 10.9 | 3 | 2 |
100118 | 53749 | 2198-09-07 15:34:00 | 2198-09-11 18:24:00 | FALSE | 78552 | 3 | 2198-09-07 | 36.33333 | 2198-09-09 | 36.72222 | 2198-09-07 | 69 | 2198-09-09 | 64 | 2198-09-07 | 14 | 2198-09-09 | 18 | NA | NA | NA | NA | 2198-09-07 | 19.1 | 2198-09-11 | 9.2 | NA | NA |
100141 | 61215 | 2173-10-01 01:25:00 | 2173-10-20 12:15:00 | FALSE | 99591 | 1 | 2173-10-01 | 37.50000 | 2173-10-10 | 36.11111 | 2173-10-01 | 87 | 2173-10-10 | 83 | 2173-10-01 | 17 | 2173-10-10 | 17 | 2173-09-30 | 46 | 2173-10-07 | 54 | 2173-09-30 | 11.5 | 2173-10-12 | 10.2 | 0 | 0 |
Many of the parameters in the SOFA score weren’t as regularly collected or at all straightforward to extract. Even still, I gathered two markers of organ dysfunction - bilirubin, which is used as an index of liver function, and creatinine, which is associated with the kidneys. Other potential predictors, such as Glasgow Coma Score and Mean Arterial Pressure, would have been convenient to have but were so fragmented within the data set that reliably isolating them was difficult.
# Load data
bilirubin <- parse_data(con, "LABEVENTS", c(50885), hadm_list, '../cache/bilirubin.RData', T, c(0, 100))
bilirubin_first <- bilirubin[,1:5]
bilirubin_last <- bilirubin[,6:10]
# Add to data object
data <- merge(data, bilirubin_first[,!names(bilirubin_first) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
data <- merge(data, bilirubin_last[,!names(bilirubin_last) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
colnames(data) <- c(colnames(data)[1:(length(data)-4)], 'bilirubin_t1', 'bilirubin_1', 'bilirubin_t2', 'bilirubin_2')
# Plot histograms
data_melt <- melt(data[,c('bilirubin_1', 'bilirubin_2')], na.rm=TRUE)
fig <- ggplot(data=data_melt, aes(x=value, fill=variable)) +
geom_density(alpha=0.5) +
scale_fill_discrete(name='Time', labels=c('Admission', 'Discharge/Death')) +
ggtitle('Distribution of Bilirubin Levels in Septic Patients') +
ylab('Kernel Density Estimate') + xlab('Bilirubin (mg/dL)')
print(fig)
# Print summary statistics
kable(t(rbind(describe(data$bilirubin_1), describe(data$bilirubin_2))),
caption='Descriptive statistics of bilirubin',
col.names=c('Admission', 'Discharge/Death'),
digits=2) %>%
kable_styling(bootstrap_options = 'striped')
Admission | Discharge/Death | |
---|---|---|
vars | 1.00 | 1.00 |
n | 4336.00 | 4336.00 |
mean | 2.09 | 2.68 |
sd | 4.52 | 6.53 |
median | 0.70 | 0.60 |
trimmed | 1.03 | 1.01 |
mad | 0.59 | 0.44 |
min | 0.10 | 0.10 |
max | 60.50 | 66.00 |
range | 60.40 | 65.90 |
skew | 5.05 | 4.45 |
kurtosis | 31.95 | 22.67 |
se | 0.07 | 0.10 |
# Load data
creatinine <- parse_data(con, "LABEVENTS", c(50912), hadm_list, '../cache/creatinine.RData', T, c(0, 100))
creatinine_first <- creatinine[,1:5]
creatinine_last <- creatinine[,6:10]
# Add to data object
data <- merge(data, creatinine_first[,!names(creatinine_first) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
data <- merge(data, creatinine_last[,!names(creatinine_last) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
colnames(data) <- c(colnames(data)[1:(length(data)-4)], 'creatinine_t1', 'creatinine_1', 'creatinine_t2', 'creatinine_2')
# Plot histograms
data_melt <- melt(data[,c('creatinine_1', 'creatinine_2')], na.rm=TRUE)
fig <- ggplot(data=data_melt, aes(x=value, fill=variable)) +
geom_density(alpha=0.5) +
scale_fill_discrete(name='Time', labels=c('Admission', 'Discharge/Death')) +
ggtitle('Distribution of Creatinine Levels in Septic Patients') +
ylab('Kernel Density Estimate') + xlab('Creatinine (mg/dL')
print(fig)
# Print summary statistics
kable(t(rbind(describe(data$creatinine_1), describe(data$creatinine_2))),
caption='Descriptive statistics of creatinine',
col.names=c('Admission', 'Discharge/Death'),
digits=2) %>%
kable_styling(bootstrap_options = 'striped')
Admission | Discharge/Death | |
---|---|---|
vars | 1.00 | 1.00 |
n | 4682.00 | 4682.00 |
mean | 1.98 | 1.71 |
sd | 1.77 | 1.62 |
median | 1.40 | 1.10 |
trimmed | 1.62 | 1.38 |
mad | 0.89 | 0.74 |
min | 0.10 | 0.10 |
max | 17.70 | 14.60 |
range | 17.60 | 14.50 |
skew | 2.83 | 2.47 |
kurtosis | 11.15 | 7.91 |
se | 0.03 | 0.02 |
While not directly in the SIRS or SOFA score rubric, lactate is an important marker in determining metabolic dysfunction, especially in shock (both septic and otherwise).
# Load data
lactate <- parse_data(con, "LABEVENTS", c(50813), hadm_list, '../cache/lactate.RData', T, c(0, 100))
lactate_first <- lactate[,1:5]
lactate_last <- lactate[,6:10]
# Add to data object
data <- merge(data, lactate_first[,!names(lactate_first) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
data <- merge(data, lactate_last[,!names(lactate_last) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
colnames(data) <- c(colnames(data)[1:(length(data)-4)], 'lactate_t1', 'lactate_1', 'lactate_t2', 'lactate_2')
# Plot histograms
data_melt <- melt(data[,c('lactate_1', 'lactate_2')], na.rm=TRUE)
fig <- ggplot(data=data_melt, aes(x=value, fill=variable)) +
geom_density(alpha=0.5) +
scale_fill_discrete(name='Time', labels=c('Admission', 'Discharge/Death')) +
ggtitle('Distribution of Lactate Levels in Septic Patients') +
ylab('Kernel Density Estimate') + xlab('Lactate (mmol/L)')
print(fig)
# Print summary statistics
kable(t(rbind(describe(data$lactate_1), describe(data$lactate_2))),
caption='Descriptive statistics of lactate',
col.names=c('Admission', 'Discharge/Death'),
digits=2) %>%
kable_styling(bootstrap_options = 'striped')
Admission | Discharge/Death | |
---|---|---|
vars | 1.00 | 1.00 |
n | 4539.00 | 4539.00 |
mean | 2.89 | 2.61 |
sd | 2.30 | 2.99 |
median | 2.20 | 1.60 |
trimmed | 2.46 | 1.92 |
mad | 1.33 | 0.89 |
min | 0.30 | 0.07 |
max | 24.20 | 28.80 |
range | 23.90 | 28.73 |
skew | 2.67 | 3.58 |
kurtosis | 10.64 | 16.28 |
se | 0.03 | 0.04 |
potassium <- parse_data(con, "LABEVENTS", c(50971), hadm_list, '../cache/potassium.RData', T, c(0, 100))
potassium_first <- potassium[,1:5]
potassium_last <- potassium[,6:10]
# Add to data object
data <- merge(data, potassium_first[,!names(potassium_first) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
data <- merge(data, potassium_last[,!names(potassium_last) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
colnames(data) <- c(colnames(data)[1:(length(data)-4)], 'potassium_t1', 'potassium_1', 'potassium_t2', 'potassium_2')
# Plot histograms
data_melt <- melt(data[,c('potassium_1', 'potassium_2')], na.rm=TRUE)
fig <- ggplot(data=data_melt, aes(x=value, fill=variable)) +
geom_density(alpha=0.5) +
scale_fill_discrete(name='Time', labels=c('Admission', 'Discharge/Death')) +
ggtitle('Distribution of Potassium Level in Septic Patients') +
ylab('Kernel Density Estimate') + xlab('Potassium (mEq/L)')
print(fig)
# Print summary statistics
kable(t(rbind(describe(data$potassium_1), describe(data$potassium_2))),
caption='Descriptive statistics of potassium',
col.names=c('Admission', 'Discharge/Death'),
digits=2) %>%
kable_styling(bootstrap_options = 'striped')
Admission | Discharge/Death | |
---|---|---|
vars | 1.00 | 1.00 |
n | 4680.00 | 4680.00 |
mean | 4.35 | 4.18 |
sd | 0.93 | 0.68 |
median | 4.20 | 4.10 |
trimmed | 4.26 | 4.11 |
mad | 0.74 | 0.59 |
min | 1.80 | 2.00 |
max | 9.10 | 9.10 |
range | 7.30 | 7.10 |
skew | 1.25 | 1.39 |
kurtosis | 2.93 | 4.35 |
se | 0.01 | 0.01 |
calcium <- parse_data(con, "LABEVENTS", c(50893), hadm_list, '../cache/calcium.RData', T, c(0, 100))
calcium_first <- calcium[,1:5]
calcium_last <- calcium[,6:10]
# Add to data object
data <- merge(data, calcium_first[,!names(calcium_first) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
data <- merge(data, calcium_last[,!names(calcium_last) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
colnames(data) <- c(colnames(data)[1:(length(data)-4)], 'calcium_t1', 'calcium_1', 'calcium_t2', 'calcium_2')
# Plot histograms
data_melt <- melt(data[,c('calcium_1', 'calcium_2')], na.rm=TRUE)
fig <- ggplot(data=data_melt, aes(x=value, fill=variable)) +
geom_density(alpha=0.5) +
scale_fill_discrete(name='Time', labels=c('Admission', 'Discharge/Death')) +
ggtitle('Distribution of Calcium Level in Septic Patients') +
ylab('Kernel Density Estimate') + xlab('Calcium (mg/dL)')
print(fig)
# Print summary statistics
kable(t(rbind(describe(data$calcium_1), describe(data$calcium_2))),
caption='Descriptive statistics of calcium',
col.names=c('Admission', 'Discharge/Death'),
digits=2) %>%
kable_styling(bootstrap_options = 'striped')
Admission | Discharge/Death | |
---|---|---|
vars | 1.00 | 1.00 |
n | 4656.00 | 4656.00 |
mean | 8.18 | 8.38 |
sd | 1.04 | 0.91 |
median | 8.20 | 8.40 |
trimmed | 8.18 | 8.35 |
mad | 0.89 | 0.74 |
min | 2.90 | 3.80 |
max | 14.60 | 19.90 |
range | 11.70 | 16.10 |
skew | 0.05 | 1.23 |
kurtosis | 1.98 | 12.07 |
se | 0.02 | 0.01 |
The MIMIC-III data set also contains a wealth of free-text data. In light of this, I will practice some text mining (TM) and natural language processing (NLP) techniques to gather potentially relevant data in addition to the physiological data collected above.
To start with a clean slate, I again remove the multiple parameters that were generated in order to create the physiologic data (stored in the object data
).
rm(list=setdiff(ls(), c('hadm_list', 'con', 'data', 'plotHistogram', 'plotPieChart', 'plotBarChart')))
Then, I read in the all of the notes from the NOTEEVENTS
table. Because this is such a large data set, the following code block is not evaluated.
notes <- dbGetQuery(con, sprintf("SELECT *
FROM NOTEEVENTS
WHERE HADM_ID IN (%s)",
paste(as.character(hadm_list), collapse=", ")))
save(file='../cache/notes.RData', notes)
Instead, the cache file is read in.
load('../cache/notes.RData')
Virtually every patient has a multitude of different notes in their chart, from radiology and ultrasonography studies to nurses notes and discharge instructions. My initial thought was to look at the admission and discharge notes separately, as with the physiologic data above, however, there didn’t appear to be very many “admission” notes present. As it was, I opted to pool all of the nursing and physician notes for a given patient into a single document in order to maximize the amount of text fed into the TM/NLP model. I also opted to include the discharge summary since it ensures that every patient has at least one note.
pooled_notes <- data.frame(text=matrix(NA, length(hadm_list), ncol=2))
colnames(pooled_notes) <- c('doc_id', 'text')
for(i in 1:length(hadm_list)){
pooled_notes[i, 'doc_id'] <- hadm_list[i]
pooled_notes[i, 'text'] <- paste(notes[notes$hadm_id == hadm_list[i] &
notes$category %in%
c('Nursing/other', 'Nursing', 'Physician', 'Discharge summary'), 'text'],
collapse=" ")
}
We can visualize what the combined notes from a single patient look like. I display the first 500 characters. As can be seen, this text will require a significant amount of cleaning later on to remove the many numbers, punctuation marks, and white space.
print(substr( pooled_notes[1,'text'], 1, 500 ))
## [1] "Admission Date: [**2209-7-31**] Discharge Date: [**2209-8-20**]\n\nDate of Birth: [**2146-1-30**] Sex: M\n\nService: MEDICINE\n\nAllergies:\nAspirin\n\nAttending:[**First Name3 (LF) 2195**]\nChief Complaint:\n\"sepsis, pneumonia\"\n\n\nMajor Surgical or Invasive Procedure:\nnone\n\nHistory of Present Illness:\nThis is a 63 yo Cantonese-only speaking male with h/o type 2\ndiabetes, hypertension, medullary sponge kidney (with chronic\nrenal insufficiency) who now presents minimally respon"
The VCorpus()
function creates a “volatile corpus,” which is kept in memory and is thus easier to create and manipulate when using large datasets.
corpus <- VCorpus(DataframeSource(pooled_notes))
Next, we can convert all of the text within the corpus to lower-case. Note that due to the large size of the corpus document, the steps in this section were evaluated once, cached, and then the cache loaded at the end.
corpus <- tm_map(corpus, tolower)
When cleaning, we can also remove stop words - common words such as “and” and “the” which may skew any frequency-based analytics attempted later on. A list of common stop words:
print(stopwords('english'))
They can be removed from the corpus using the same tm_map()
function used to convert upper-case letters. Punctuation, numbers, and extra white space are also removed.
corpus <- tm_map(corpus, removeWords, stopwords('english'))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
As a last cleaning step, we can “stem” the Corpus. Stemming refers to isolating the main root (or “stem”) of the word of the word, ignoring common suffixes. For example, the word “analytics” might be considered equal to “analysis” depending on the algorithm used, considering they both share the same “analy” root. For the purposes of this assignment, I use Porter’s algorithm developed in the package SnowballC
.
corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, PlainTextDocument)
save(corpus, file='../cache/corpus.RData')
As a reminder, the above steps were evaluated once and then the results saved. Evaluation resumed starting with this next code chunk:
load('../cache/corpus.RData')
print(substr( corpus[[1]], 1, 500 ))
## [1] "admiss date discharg date date birth sex m servic medicin allergi aspirin attendingfirst name lf chief complaint sepsi pneumonia major surgic invas procedur none histori present ill yo cantones speak male ho type diabet hypertens medullari spong kidney chronic renal insuffici now present minim respons letharg home patient seen weekend famili unusu went check today found fetal posit bed call em blood glucos s patient co ill prior found minim respons ed initi vs left ij place access pt fever ed pt"
Now that the documents within the corpus have been cleaned, I create a Document Term Matrix (DTM). A DTM is a \(n\)-by-\(w\) matrix with \(n\)-rows representing \(n\) documents and \(w\) columns representing unique words. The individual elements represent the frequency of each word within the given document. A DTM can be trimmed to isolate only those words that occur a given number of times in the Corpus, thus ignoring specific words that might only appear a couple times in the entire data set.
dtm <- DocumentTermMatrix(corpus)
save(dtm, file='../cache/dtm.RData')
Additionally, since we cleared the metadata of the corpus when we converted it to plain-text, we can label it again using the hadm_id
.
load('../cache/dtm.RData')
dtm$dimnames$Docs <- as.character(hadm_list)
print(dtm)
## <<DocumentTermMatrix (documents: 4689, terms: 172797)>>
## Non-/sparse entries: 4310549/805934584
## Sparsity : 99%
## Maximal term length: 81
## Weighting : term frequency (tf)
This summary of the DTM informs us that there are 172797 unique, individual words in the corpus. Additionally, the sparsity of the DTM tells us how many “empty” elements are in this array. In this case, having such high sparsity means that there are a large number of words which occur very infrequently. Words with sparsity above a given cutoff can be removed using the function removeSparseTerms()
dtm_sparse <- removeSparseTerms(dtm, 0.9)
dtm_matrix <- data.frame(as.matrix(dtm_sparse))
By removing terms which have a sparsity > 0.9, I only retain those words that occur in > 10% of the documents. This effectively reduces our feature set from 172797 terms to 2081.
After creating and simplifying the document term matrix, we can visualize the 100 most common words using a word cloud.
wordcloud(corpus, max.words=100, scale = c(3, 0.2), random.order=F, colors=brewer.pal(6, "Spectral"), rot.per=0.35)
After both the physiological and text-derived predictors have been isolated, I split the data into training and testing data sets. Before splitting, I first separate the physiological predictors by time points (phy1_data
and phy2_data
). Then, I add the death
column into the text-derived predictors. Finally, I pool the two datasets to create grp1_data
and grp2_data
.
# Clean slate
rm(list=setdiff(ls(), c('hadm_list', 'con', 'data', 'dtm_matrix', 'plotHistogram', 'plotPieChart', 'plotBarChart')))
# turn death into a facor
data$death = as.factor(data$death)
# Extract physiologic data collected near admission
phy1_data <- data[,c('hadm_id', 'death', 'severity', 'temperature_1', 'heartrate_1', 'resprate_1', 'paco2_1', 'wbccount_1', 'sirs_1', 'bilirubin_1', 'creatinine_1', 'lactate_1', 'potassium_1', 'calcium_1')]
# Extract physiologic data data collected near discharge
phy2_data <- data[,c('hadm_id', 'death', 'severity', 'temperature_2', 'heartrate_2', 'resprate_2', 'paco2_2', 'wbccount_2', 'sirs_2', 'bilirubin_2', 'creatinine_2', 'lactate_2', 'potassium_2', 'calcium_2')]
# Add hadm_id and death to dtm_matrix
dtm_matrix$hadm_id = row.names(dtm_matrix)
dtm_matrix <- merge(dtm_matrix[,!names(dtm_matrix) == 'death'], data[,c('hadm_id', 'death')], by='hadm_id', all.x=T)
# Merge admission/discharge physiologic data to the dtm matrix
grp1_data <- merge(phy1_data, dtm_matrix, by=c('hadm_id', 'death'), all=T)
grp2_data <- merge(phy2_data, dtm_matrix, by=c('hadm_id', 'death'), all=T)
Since the number of subjects included in the cohort is so large, I chose an 75/25% split in training/testing data. First, the list of admissions identifiers (hadm_list
) is subset using the sample()
function. Then, the physiologic data (data
) and text-derived data (dtm
) is subset based on the admissions identifiers.
A brief summary of the different constructed datasets:
phy1
: physiologic (quantitative) data near hospital admissionphy2
: physiologic (quantitative) data near discharge or deathtext
: text-derived word frequency data from the DTM.grp1
: grouped physiologic (near admission) and text-derived datagrp2
: grouped physiologic (near discharge) and text-derived datasubset_indices <- sample(length(hadm_list), 0.25*length(hadm_list))
hadm_test <- hadm_list[subset_indices]
hadm_train <- hadm_list[-subset_indices]
phy1_test <- phy1_data[phy1_data$hadm_id %in% hadm_test,]
phy1_train <- phy1_data[phy1_data$hadm_id %in% hadm_train,]
phy2_test <- phy2_data[phy2_data$hadm_id %in% hadm_test,]
phy2_train <- phy2_data[phy2_data$hadm_id %in% hadm_train,]
text_test <- dtm_matrix[dtm_matrix$hadm_id %in% hadm_test,]
text_train <- dtm_matrix[dtm_matrix$hadm_id %in% hadm_train,]
grp1_test <- grp1_data[grp1_data$hadm_id %in% hadm_test,]
grp1_train <- grp1_data[grp1_data$hadm_id %in% hadm_train,]
grp2_test <- grp2_data[grp1_data$hadm_id %in% hadm_test,]
grp2_train <- grp2_data[grp1_data$hadm_id %in% hadm_train,]
To recap, in the previous section I built the SQL database, isolated the target population of sepsis patients, combed the database to collect as many physiologic parameters from the SIRS & SOFA scoring system as was feasible, and used a variety of free-text notes to generate a Document Term Matrix containing word frequency in the physician and nursing notes. In this section I will attempt to use these variables to classify subjects by survival, thus framing the outcome as a binary classification model.
I chose to explore two types of classification models: recursive classification via decision tree and Naive Bayesian classification. These two methods were chosen due to their seemingly opposing nature: while decision trees select variables which allow the highest information gain in order to split (DSPA Lecture 8), the Naive Bayes model assumes equal importance of each variable but is often used as a “powerful text classifier” (DSPA Lecture 7).
Because of the large feature set derived from the free-text data, I choose to first evaluate the performance of a decision tree models as they may provide insight into which terms from the DTM are most useful in differentiating between subjects.
Decision tree models are also an attractive starting point, as the resulting model is a human-readable algorithm (tree) that is similar to many of the goal-directed clinical decision algorithms currently in place (as an example, the advanced care and life support algorithm for cardiac arrest as described by Neumar et. al, 2010).
While decision tree models work through recursive partitioning - i.e. splitting aggregated data into smaller and smaller groups based on some optimization function - there are many different methods for doing so.
I chose to use the rpart
package because it uses the underlying CART
(Classification And Regression Tree) algorithm. This is a supervising clustering algorithm which is limited to making binary splits (Therneau & Atkinson, 2018). This is in contrast to other packages such as party
which use the unsupervised CHAID
(Chi-square Automatic Interaction Detector) algorithm.
The first step was to investigate the physiologic predictors. I remove hadm_id
and severity
from the model, and then created two rpart
decision trees.
First, I input the physiologic data collected nearest admission. Recall that this data represents the first available result during a patient’s stay. Models which can predict mortality at this stage will be most useful, however, any signal that may be present is likely at its weakest; much can change between the test results at this stage and at the end of a patient’s stay.
The model is built using the rpart()
function, then the tree is plotted. Secondly, the constructed model is applied to the testing dataset using the predict()
function. Note that in the visualization of the tree, FALSE
indicates that the patient survived to discharge and TRUE
indicates that the patient is deceased.
phy1_rpart <- rpart('death~.', data=phy1_train[, -c(1, 3)])
fancyRpartPlot(phy1_rpart)
phy1_rpart_pred <- predict(phy1_rpart, phy1_test, type='class')
cm <- confusionMatrix(phy1_rpart_pred, phy1_test$death)
print(cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 694 320
## TRUE 77 81
##
## Accuracy : 0.6613
## 95% CI : (0.6334, 0.6884)
## No Information Rate : 0.6578
## P-Value [Acc > NIR] : 0.4159
##
## Kappa : 0.1195
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9001
## Specificity : 0.2020
## Pos Pred Value : 0.6844
## Neg Pred Value : 0.5127
## Prevalence : 0.6578
## Detection Rate : 0.5922
## Detection Prevalence : 0.8652
## Balanced Accuracy : 0.5511
##
## 'Positive' Class : FALSE
##
Based on this model, the accuracy is reported as 0.66 with a sensitivity of 0.90 and a specificity of 0.20. Considering the model is based solely on lactate and temperature, this outcome seems reasonable.
We can also create a model for those values recorded shortly before discharge or death.
phy2_rpart <- rpart('death~.', data=phy2_train[, -c(1, 3)])
fancyRpartPlot(phy2_rpart)
phy2_rpart_pred <- predict(phy2_rpart, phy2_test, type='class')
cm <- confusionMatrix(phy2_rpart_pred, phy2_test$death)
print(cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 664 118
## TRUE 107 283
##
## Accuracy : 0.808
## 95% CI : (0.7843, 0.8302)
## No Information Rate : 0.6578
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5707
## Mcnemar's Test P-Value : 0.505
##
## Sensitivity : 0.8612
## Specificity : 0.7057
## Pos Pred Value : 0.8491
## Neg Pred Value : 0.7256
## Prevalence : 0.6578
## Detection Rate : 0.5666
## Detection Prevalence : 0.6672
## Balanced Accuracy : 0.7835
##
## 'Positive' Class : FALSE
##
As might be expected, this model is much more reasonable with an accuracy of reported as 0.81! It is considerably more complicated, using many more predictors across a number of steps. However, this model would be of limited clinical usefulness.
We can also try to construct an rpart
model based solely on the features derived from the collected notes.
text_rpart <- rpart('death~.', data=text_train[, -c(1, 3)])
fancyRpartPlot(text_rpart)
text_rpart_pred <- predict(text_rpart, text_test, type='class')
cm <- confusionMatrix(text_rpart_pred, text_test$death)
print(cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 748 21
## TRUE 23 380
##
## Accuracy : 0.9625
## 95% CI : (0.9499, 0.9726)
## No Information Rate : 0.6578
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9167
## Mcnemar's Test P-Value : 0.8802
##
## Sensitivity : 0.9702
## Specificity : 0.9476
## Pos Pred Value : 0.9727
## Neg Pred Value : 0.9429
## Prevalence : 0.6578
## Detection Rate : 0.6382
## Detection Prevalence : 0.6561
## Balanced Accuracy : 0.9589
##
## 'Positive' Class : FALSE
##
At first glance, the text-based model appears to perform spectacularly, with an accuracy of 0.96! However, when looking at the resulting tree, we see that there are several confounders. For example, the presence of the term expir
(likely stemmed from some form of expire
) is associated with poor outcome, while discharg
and followup
are associated with survival. This clearly isn’t the intended outcome.
To remedy this issue, I repeated the DTM processing above without including the discharge notes. This will be the data used for the remainder of the assignment.
rm(list=c('text_test', 'text_train', 'grp1_test', 'grp1_train', 'grp2_test', 'grp2_train'))
load('../cache/test_train_nodischarge.RData')
text_rpart <- rpart('death~.', data=text_train[, -c(1, 3)])
fancyRpartPlot(text_rpart)
text_rpart_pred <- predict(text_rpart, text_test, type='class')
cm <- confusionMatrix(text_rpart_pred, text_test$death)
print(cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 730 197
## TRUE 41 204
##
## Accuracy : 0.7969
## 95% CI : (0.7727, 0.8196)
## No Information Rate : 0.6578
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5025
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9468
## Specificity : 0.5087
## Pos Pred Value : 0.7875
## Neg Pred Value : 0.8327
## Prevalence : 0.6578
## Detection Rate : 0.6229
## Detection Prevalence : 0.7910
## Balanced Accuracy : 0.7278
##
## 'Positive' Class : FALSE
##
After removing the discharge notes, the model still performs fairly well with an accuracy of 0.80, (sensitivity = 0.95 and specificity = 0.51). Two of the branches that I found especially interesting were “vasopressin” and “chair” - having vasopressin (a drug often used in sepsis to treat hypotension) in the chart was associated with worse outcomes. Having “chair” in the note was associated with increased survival, presumably indicating that strictly bedridden patients do not perform as well.
The combined model - physiological and text-based predictors - can also be entered into the rpart
decision tree algorithm. First, using the physiologic data from admission (the text-based corpus remains the same):
grp1_rpart <- rpart('death~.', data=grp1_train[, -c(1, 3)])
fancyRpartPlot(grp1_rpart)
grp1_rpart_pred <- predict(grp1_rpart, grp1_test, type='class')
cm <- confusionMatrix(grp1_rpart_pred, grp1_test$death)
print(cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 705 167
## TRUE 66 234
##
## Accuracy : 0.8012
## 95% CI : (0.7772, 0.8237)
## No Information Rate : 0.6578
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.53
## Mcnemar's Test P-Value : 5.707e-11
##
## Sensitivity : 0.9144
## Specificity : 0.5835
## Pos Pred Value : 0.8085
## Neg Pred Value : 0.7800
## Prevalence : 0.6578
## Detection Rate : 0.6015
## Detection Prevalence : 0.7440
## Balanced Accuracy : 0.7490
##
## 'Positive' Class : FALSE
##
The accuracy of this combined model is 0.80, similar to the text-based model. However, several important physiologic predictors filter in, including lactate and potassium. Additionally, the sensitivity and specificity of this model are a bit more balanced, at 0.91 and 0.58 respectively.
Next, using the physiologic data from near discharge:
grp2_rpart <- rpart('death~.', data=grp2_train[, -c(1, 3)])
fancyRpartPlot(grp2_rpart)
grp2_rpart_pred <- predict(grp2_rpart, grp2_test, type='class')
cm <- confusionMatrix(grp2_rpart_pred, grp2_test$death)
print(cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 678 87
## TRUE 93 314
##
## Accuracy : 0.8464
## 95% CI : (0.8245, 0.8666)
## No Information Rate : 0.6578
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6601
## Mcnemar's Test P-Value : 0.7094
##
## Sensitivity : 0.8794
## Specificity : 0.7830
## Pos Pred Value : 0.8863
## Neg Pred Value : 0.7715
## Prevalence : 0.6578
## Detection Rate : 0.5785
## Detection Prevalence : 0.6527
## Balanced Accuracy : 0.8312
##
## 'Positive' Class : FALSE
##
This combined model has the highest accuracy of any model so far (besides the one for which discharge notes were included!), at 0.85. However, it is probably the least clinically-feasible model, since it incorporates all of the notes through the duration of the stay and the physiologic parameters shortly before discharge or death.
Naive Bayesian classification is classification algorithm for categorical data. Rather than rpart
decision trees, which try to recursively split the data to differentiate between defined classes, Naive Bayesian classification returns a probability of each class for each patient given the entire feature set.
According to Lecture 7, the Naive Bayes algorithm typically depends on the assumption that “all of the features are equally important and independent.” However, the model can still be useful in cases where the assumptions are not valid, especially when the number of features is large - hence why Naive Bayes is “frequently used for text classification.”
As was the case with the rpart
model, Naive Bayesian models are constructed with both of the physiologic datasets (admission and discharge/death), using the text-derived features alone, and the two combined datasets (admission + text, discharge/death + text).
# Physiologic predictors after admssions
phy1_nbayes <- naiveBayes(x=phy1_train[, -c(1, 2, 3)], y=phy1_train$death)
phy1_nbayes_pred <- predict(phy1_nbayes, phy1_test, type='class')
cm <- confusionMatrix(phy1_nbayes_pred, phy1_test$death)
print(cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 678 265
## TRUE 93 136
##
## Accuracy : 0.6945
## 95% CI : (0.6673, 0.7208)
## No Information Rate : 0.6578
## P-Value [Acc > NIR] : 0.004179
##
## Kappa : 0.2436
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8794
## Specificity : 0.3392
## Pos Pred Value : 0.7190
## Neg Pred Value : 0.5939
## Prevalence : 0.6578
## Detection Rate : 0.5785
## Detection Prevalence : 0.8046
## Balanced Accuracy : 0.6093
##
## 'Positive' Class : FALSE
##
Similar to the previous rpart
model of initial physiologic data, the accuracy of this model was 0.69. The low specificity (0.34) is particularly concerning. Part of the reason why the Naive Bayesian model did not achieve particularly good results is that there are not that many features (12) and they may not be equally important in determining the patient’s outcome.
# Physiologic predictors before discharge
phy2_nbayes <- naiveBayes(x=phy2_train[, -c(1, 2, 3)], y=phy2_train$death)
phy2_nbayes_pred <- predict(phy2_nbayes, phy2_test, type='class')
cm <- confusionMatrix(phy2_nbayes_pred, phy2_test$death)
print(cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 716 128
## TRUE 55 273
##
## Accuracy : 0.8439
## 95% CI : (0.8218, 0.8642)
## No Information Rate : 0.6578
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6373
## Mcnemar's Test P-Value : 1.024e-07
##
## Sensitivity : 0.9287
## Specificity : 0.6808
## Pos Pred Value : 0.8483
## Neg Pred Value : 0.8323
## Prevalence : 0.6578
## Detection Rate : 0.6109
## Detection Prevalence : 0.7201
## Balanced Accuracy : 0.8047
##
## 'Positive' Class : FALSE
##
The accuracy of the model using late-stage physiologic data was, unsurprisingly, considerably better. It had an accuracy of 0.84 with a sensitivity and specificity of 0.93 and 0.68 respectively.
Next, I use the Naive Bayes model to classify subjects based on the corpus of notes for each patient.
# Text-derived predictors
text_nbayes <- naiveBayes(x=text_train[, -c(1, 2, 3)], y=text_train$death)
text_nbayes_pred <- predict(text_nbayes, text_test, type='class')
cm <- confusionMatrix(text_nbayes_pred, text_test$death)
print(cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 701 329
## TRUE 70 72
##
## Accuracy : 0.6596
## 95% CI : (0.6316, 0.6867)
## No Information Rate : 0.6578
## P-Value [Acc > NIR] : 0.4645
##
## Kappa : 0.105
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9092
## Specificity : 0.1796
## Pos Pred Value : 0.6806
## Neg Pred Value : 0.5070
## Prevalence : 0.6578
## Detection Rate : 0.5981
## Detection Prevalence : 0.8788
## Balanced Accuracy : 0.5444
##
## 'Positive' Class : FALSE
##
This model did not perform particularly well. The majority of patients were classified as survivors regardless of their actual mortality status, indicated by a high sensitivity but low specificity.
# Grouped: physiologic parameters after admission + text
grp1_nbayes <- naiveBayes(x=grp1_train[, -c(1, 2, 3)], y=grp1_train$death)
grp1_nbayes_pred <- predict(grp1_nbayes, grp1_test, type='class')
cm <- confusionMatrix(grp1_nbayes_pred, grp1_test$death)
print(cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 699 329
## TRUE 72 72
##
## Accuracy : 0.6578
## 95% CI : (0.6299, 0.685)
## No Information Rate : 0.6578
## P-Value [Acc > NIR] : 0.5136
##
## Kappa : 0.1018
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9066
## Specificity : 0.1796
## Pos Pred Value : 0.6800
## Neg Pred Value : 0.5000
## Prevalence : 0.6578
## Detection Rate : 0.5964
## Detection Prevalence : 0.8771
## Balanced Accuracy : 0.5431
##
## 'Positive' Class : FALSE
##
# Grouped: physiologic parameters before discharge + text
grp2_nbayes <- naiveBayes(x=grp2_train[, -c(1, 2, 3)], y=grp2_train$death)
grp2_nbayes_pred <- predict(grp2_nbayes, grp2_test, type='class')
cm <- confusionMatrix(grp2_nbayes_pred, grp2_test$death)
print(cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 697 319
## TRUE 74 82
##
## Accuracy : 0.6647
## 95% CI : (0.6368, 0.6917)
## No Information Rate : 0.6578
## P-Value [Acc > NIR] : 0.323
##
## Kappa : 0.1271
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9040
## Specificity : 0.2045
## Pos Pred Value : 0.6860
## Neg Pred Value : 0.5256
## Prevalence : 0.6578
## Detection Rate : 0.5947
## Detection Prevalence : 0.8669
## Balanced Accuracy : 0.5543
##
## 'Positive' Class : FALSE
##
Both of the grouped models performed approximately as well as the text-based model alone. This is likely due to the fact that all parameters are assumed to play an equal role, and the number of text-based predictors (1251) far exceeds the number of physiologic features (11).
While the intent of this project was to investigate sepsis patients in the MIMIC-III database and use predictive analytic tools to predict patient mortality using physiologic and text-derived features, there were several limitations. The cohort selection process is one such limitation discussed previously, where patients who are discharged to long-term care or other outside facilities are “lost to follow-up” and are assumed to have survived their encounter. Additionally, due to the immense size and complex nature of the MIMIC dataset, only a small fraction of the physiologic data collected could be leveraged. With regard to text-mining, the data collected represents only one hospital in over a finite number of years - the precise decision trees and classification models may or may not be generalizable to other hospital systems. Lastly, future work should focus on better utilizing the longitudinal nature of this data, as pooling data into “near admission” or “near discharge” and grouping all notes together into an aggregate document is sub-optimal. Despite these limitations, however, this project was able to demonstrate that there are important physiologic and text-derived predictors present in this dataset that may be further investigated.
Of the approximately 4700 sepsis encounters isolated in the MIMIC-III dataset, roughly one-third of these patients died while the remainder survived at least until discharge from the hospital. Predicting mortality is a major focus of new sepsis screening tests, and adequately assessing risk is an important part of early, directed treatment.
Using 11 of commonly-cited sepsis predictors, in addition to the free-text data entered as physician and nursing notes, I have attempted to build models which can correctly classify those patients who are at highest risk. While much attention has been placed on using physiologic parameters in the past, the text-based models using natural language processing outperformed the physiologic parameters in nearly every case. Specifically, a decision tree might be used to successfully differentiate between patients using a limited number of words and features, making it easier to deploy, use, and understand. Notably, many of these models have higher sensitivity than they do specificity, indicating that future iterations of this work may be better used as a screening tool than a specific predictor.
I would first like to thank Dr. Ivo Dinov of the University of Michigan Statistics Online Computational Resource (SOCR) and Michigan Institute for Data Science (MIDAS), whose excellent Data Analysis and Predictive Analytics course provided the foundational analytic tools for this project. I also owe much to three of my mentors, Dr. Hakam Tiba, Mr. Brendan McCracken and Dr. Kevin Ward of the Michigan Center for Integrative Research in Critical Care. Lastly, I would like to thank Dr. Ashwin Belle of Fifth-Eye Inc., who introduced me to the MIMIC-III dataset, and Dr. Robert Dickson of the Division of Pulmonary and Critical Care Medicine, who first pointed me in the direction of further exploring mortality prediction in sepsis.
# Disconnect from database
invisible(dbDisconnect(con))