Author's note: The work contained in this document represents the final submission for the HS650 course at the University of Michigan, Ann Arbor. Copyright Brandon Cummings, 2018. No part of this work may be reproduced without the author's written permission.

Libraries Used:

# Data I/O
library('RPostgreSQL')    # access MIMIC-III
library('reshape2')       # melting dataframes

# Data preparation
library('psych')          # descriptive stats
library('tm')             # text mining
library('SnowballC')      # text stemming

# Analytics
library('rpart')          # rpart decision tree
library('caret')          # confusion matrices
library('e1071')          # naive Bayes

# Plots and tables
library('knitr')          # knitting Rmd to HTML; kable() function
library('kableExtra')     # extra formating options for knitr tables
library('DT')             # dynamic tables
library('ggplot2')        # 2d plotting
library('ggpubr')         # extra formatting options for ggplot
library('wordcloud')      # visualizing free text data
library('rattle')         # fancyRPartPlot

Random Seed Set

set.seed(123456)

1 Abstract

Introduction: Sepsis has recently been defined as a “dysregulated host response to infection” (Singer et al., 2016). Despite being “leading cause of death and long-term disability in developed countries” (Abe et. al, 2018), the medical and scientific community has struggled to enact a serviceable operational definition since the time of Hippocrates (Majno, 1991). Recent consensus definitions have attempted to implement scoring rubrics to guide diagnosis and treatment, and are increasingly focused on prediction mortality among septic patients (Bone et al., 1992 & Singer et al., 2016).

Materials and Methods: Using data from the Medical Information Mart for Intensive Care (MIMIC-III), I extract many of the physiologic parameters present in these scoring rubrics and subject them to predictive analytic models in an effort to predict which patients were able to leave the hospital. In addition, physician and nursing notes underwent text-mining to reveal commonly used words. An rpart decision classification tree and Naive Bayesian classification model were selected as initial models for investigation due to their opposing classification strategies.

Results: In almost all cases, the text-derived predictors out-performed the quantitative predictors, with an accuracy approaching 80%. In general, the resulting models had higher sensitivity than specificity, indicating that such models may be of more use as an initial screening tool than a definitive prediction.

Conclusion: While further investigation is required to refine the data extraction and the predictive model, it appears as if there may be a signal present which allows for binary classification between survivors and non-survivors. Additionally, text-mining and natural language processing may be an under-utilized tool when predicting mortality in a population of septic ICU patients.


2 Background

Modern sepsis is widely defined as a “dysregulated host response to infection” (Singer et al., 2016). It is commonly cited as a “leading cause of death and long-term disability in developed countries,” and recent international multicenter cohort studies have estimated the mortality rate of severe sepsis to range from 23% to as high as 49% (Abe et. al, 2018). Those who do survive often suffer long-term physical and cognitive impairments (Iwashyna, Ely, Smith & Langa, 2010). It is one of the four most costly conditions in the United States, and in 2011 accounted for 5.2% of total healthcare expenditures with a price tag of $20 billion (Torio & Andrews, 2013). And it is a growing issue: while estimates of sepsis prevalence can vary widely based on the criteria used, annual increases in severe sepsis are consistently estimated to be as high as 13% nationwide (Gaieski, Edwards, Kallan, & Carr, 2013).

The heterogeneity of sepsis makes it a difficult condition to define (Gül, Arslantaş, Cinel, & Kumar, 2017). Early definitions dating back to Hippocrates “claimed that sepsis (σήψις) was the process by which flesh rots, swamps generate foul airs, and wounds fester” (Angus & van der Poll, 2013; Majno, 1991). Later, work by Louis Pasteur and others led to the notion of “blood poisoning.” (Angus & van der Poll, 2013). This notion of sepsis as a systemic infection in which pathogenic microbes invade the bloodstream lasted until a 1991 meeting of the Critical Care Medicine Consensus Conference (Bone et al., 1992). New more potent antibiotics lead to the discovery that much sepsis-associated pathology could occur in patients who had been completely cleared of infection. Because of this discovery, the focus shifted from the microorganism and onto the host response to infection, which the conference referred to as the “Systemic Inflammatory Response Syndrome” or SIRS (Table 1).


<font size='0.75em'><i>**Table 1: SIRS Criteria**. From <a href=https://www.ncbi.nlm.nih.gov/pubmed/26903338>(Singer et al., 2016)</a>. In conjunction with a plausible source of infection, the systemic inflammatory response syndrome (SIRS) criteria formed the backbone of the sepsis definition until 2016.</i></font>

Table 1: SIRS Criteria. From (Singer et al., 2016). In conjunction with a plausible source of infection, the systemic inflammatory response syndrome (SIRS) criteria formed the backbone of the sepsis definition until 2016.


The 1991 consensus definition was revised by a task force in 2001 but left largely unchanged (Levy et al., 2003). This operational definition (Sepsis-II) was the paradigm used during the data collection period of the MIMIC-III data set, and as such is particularly relevant. The condition of sepsis was conceptualized as a three-tiered hierarchy where each tier represents an increased risk of mortality (Figure 1):

  • Sepsis: patient meets the SIRS criteria and who had a plausible source of infection.
  • Severe sepsis: patient meets requirements for sepsis and has evidence of organ dysfunction.
  • Septic shock: patient meets requirement for severe sepsis and has “sepsis-induced hypotension persisting despite adequate fluid resuscitation” (Bone et al., 1992).

In 2016 the definition was again revised drastically (Sepsis-III). This new paradigm defines sepsis as “life-threatening organ dysfunction caused by a dysregulated host response to infection” (Singer et al., 2016). The model adapts the previously established Systemic Organ Failure Assessment (SOFA) score to form an operational definition of sepsis (Table 2), and further characterizes septic shock as a “subset of sepsis in which particularly profound circulatory, cellular, and metabolic abnormalities are associated with a greater risk of mortality than with sepsis alone” (Singer et al., 2016). They further introduce a “quick SOFA” (qSOFA) score for rapid clinical identification (Table 3).


<font size='0.75em'><i>**Table 2: SOFA Criteria**. From <a href=https://www.ncbi.nlm.nih.gov/pubmed/26903338>(Singer et al., 2016)</a>. The Sequential [Sepsis-Related] Organ Failure Assessment (SOFA) score. In addition to a suspected source of infection, diagnosis under Sepsis-III requires of a score of at least two. Increases in SOFA scoring are attributed with a 10% increase in mortality.</i></font>

Table 2: SOFA Criteria. From (Singer et al., 2016). The Sequential [Sepsis-Related] Organ Failure Assessment (SOFA) score. In addition to a suspected source of infection, diagnosis under Sepsis-III requires of a score of at least two. Increases in SOFA scoring are attributed with a 10% increase in mortality.


<font size='0.75em'><i>**Table 3: qSOFA Criteria**. From <a href=https://www.ncbi.nlm.nih.gov/pubmed/26903338>(Singer et al., 2016)</a>. The “Quick SOFA” (qSOFA) score is a distilled version of the full organ dysfunction system, designed for quick evaluation by clinical staff at the point of care. A score of two or more qualifies the patient for further assessment.</i></font>

Table 3: qSOFA Criteria. From (Singer et al., 2016). The “Quick SOFA” (qSOFA) score is a distilled version of the full organ dysfunction system, designed for quick evaluation by clinical staff at the point of care. A score of two or more qualifies the patient for further assessment.


Interestingly, one of the major focal points of this model is on using the SOFA score as a predictor of mortality among septic patients. Even though the SOFA score was not used to evaluate patients during the period in which the MIMIC-III data set was collected, and thus not all the information is readily available for all patients, this new-found emphasis on mortality prediction pushed me to further explore the issue. Existing models under both Sepsis-II and -III have been limited to what is “clinically feasible” - that is, not every test can be ran on every patient. However, the MIMIC-III database is greatly expanded. In addition to a great many physiologic parameters, the data set also includes a wealth of free-text data stored as notes which may be underutilized. Using this expanded data set, I wish to create a model (or rather, models) which can differentiate between survivors and non-survivors and compare it’s predictive value to a gold standard of SIRS.


<font size='0.75em'><i>**Figure 1: Definitions Associated with Sepsis-II and Sepsis-III**. Adapted from (Bone et al., 1992) and (Singer et al., 2016).</i></font>

Figure 1: Definitions Associated with Sepsis-II and Sepsis-III. Adapted from (Bone et al., 1992) and (Singer et al., 2016).


3 Materials and Methods

3.1 MIMIC-III

The Medical Information Mart for Intensive Care (MIMIC-III) is a data set developed by the Massachusetts Institute of Technology (MIT) which contains de-identified healthcare data from more than 40,000 intensive care unit (ICU) patients over the years 2001-2012 (Johnson et al., 2016). It includes most of the information found in an electronic health record (EHR) including demographics, diagnosis codes, procedures, lab values, vital signs, admission and discharge information, physician notes, and optionally waveforms. It is available free of charge, however, accessing the database requires completion of an online course on research ethics.

A local copy of the data was stored as a PostgreSQL database. This was my first time interacting with any SQL system and it was a fun challenge. To build the database, I followed the tutorial listed on the PhysioNet website and used the tools available in the mimic-code GitHub repository (“Install MIMIC (Unix/Mac),” n.d.; Johnson, Stone, Celi, & Pollard, 2018).

Once the database was constructed, I used the RPostgreSQL library for data import and manipulation. Note that due to the large size of the data set, querying the database can be computationally intensive. In some of the larger queries, the results are cached as an .RData file and loaded in separately. This also occurs when processing the free-text document corpus, which is computationally intensive.

The first step is loading the database connection (stored in the variable con) using the functions dbConnect() and dbExcecute(). This connection will be open until it is closed using the function dbDisconnent(). Note that the invisible() wrapper around the dbExecute() function serves only to silence the output.

con <- dbConnect(dbDriver('PostgreSQL'),
                 dbname = 'mimic',
                 host = '127.0.0.1',
                 port = '5432',
                 user = 'postgres',
                 password = 'postgres')

invisible(dbExecute(con, paste("SET search_path TO ",
                               schema = 'mimiciii',
                               sep=" ")))

3.2 Cohort Selection and Preprocessing

3.2.1 Filtering

After acquiring the data, the first step was to extract septic patients. To do this, I use the International Classification of Disease (ICD-9) codes associated with the patient’s admission. The MIMIC-III database contains a table `D_ICD_DIAGNOSIS which contains the interpretation of these codes.

d_icd_diagnosis <- dbGetQuery(con, "SELECT * 
                                    FROM D_ICD_DIAGNOSES")

kable(head(d_icd_diagnosis), caption="Sample of `D_ICD_DIAGNOSIS` SQL table") %>%
  kable_styling(bootstrap_options='striped')
Sample of D_ICD_DIAGNOSIS SQL table
row_id icd9_code short_title long_title
174 01166 TB pneumonia-oth test Tuberculous pneumonia [any form], tubercle bacilli not found by bacteriological or histological examination, but tuberculosis confirmed by other methods [inoculation of animals]
175 01170 TB pneumothorax-unspec Tuberculous pneumothorax, unspecified
176 01171 TB pneumothorax-no exam Tuberculous pneumothorax, bacteriological or histological examination not done
177 01172 TB pneumothorx-exam unkn Tuberculous pneumothorax, bacteriological or histological examination unknown (at present)
178 01173 TB pneumothorax-micro dx Tuberculous pneumothorax, tubercle bacilli found (in sputum) by microscopy
179 01174 TB pneumothorax-cult dx Tuberculous pneumothorax, tubercle bacilli not found (in sputum) by microscopy, but found by bacterial culture

From within these diagnosis identifiers, I search for the three diagnoses specified under the Bone et. al Sepsis-II criteria: Sepsis, Severe sepsis, and Septic shock.

sepsis_codes <- d_icd_diagnosis[d_icd_diagnosis$short_title %in% c('Sepsis', 'Severe sepsis', 'Septic shock'), ]

kable(sepsis_codes, caption='ICD9 codes of interest') %>%
  kable_styling(bootstrap_options='striped')
ICD9 codes of interest
row_id icd9_code short_title long_title
10305 11403 99591 Sepsis Sepsis
10306 11404 99592 Severe sepsis Severe sepsis
13143 12991 78552 Septic shock Septic shock

Under the MIMIC-III schema, encounters are identified by both a patient identifier (subject_id) and by a hospital admission identifier (hadm_id). In this format, each hospital admission is associated with one unique patient, but a given patient may have multiple admissions in which they have been diagnosed with sepsis. Since the primary variable of interest is mortality, I am mostly concerned with the last admission during which a patient was diagnosed with one of the above three ICD-9 codes. For the purposes of this analysis, I make the assumption that patients who do are discharged by the hospital and do not return within the study period have survived the disease. One limitation of this selection process is that patients may have been discharged to an outside system such as another hospital or to a long-term care facility, in which case our knowledge of the outcome is limited.

The MIMIC-III database contains a DIAGNOSES_ICD table which contains the subject_id, hadm_id, and ICD-9 codes for each encounter. First, a list of all admissions which are associated with the sepsis ICD-9 codes is generated. I then add a severity column to the sepsis_hadm dataframe where:

  • 1 = Sepsis
  • 2 = Severe Sepsis
  • 3 = Septic Shock
diagnosis_icd <- dbGetQuery(con, "SELECT *
                                  FROM DIAGNOSES_ICD")

sepsis_hadm <- diagnosis_icd[diagnosis_icd$icd9_code %in% sepsis_codes$icd9_code,]

sepsis_hadm$severity <- factor(sepsis_hadm$icd9_code,
                              levels = c(99591, 99592, 78552),
                              labels=c(1, 2, 3))

kable(head(sepsis_hadm), caption="Sample septic patients in `DIAGNOSES_ICD` SQL table") %>%
  kable_styling(bootstrap_options='striped')
Sample septic patients in DIAGNOSES_ICD SQL table
row_id subject_id hadm_id seq_num icd9_code severity
88 1547 117 164853 16 99592 2
145 1604 124 138376 6 99592 2
277 505 64 172056 3 99591 1
451 679 85 112077 18 99591 1
739 131 21 111970 2 78552 3
748 140 21 111970 11 99592 2

Due to the aggregate nature of the sepsis diagnoses, patients may have multiple sepsis-related ICD9 codes in the database (e.g. both sepsis and septic shock). Therefore, I filter this list to only include the most severe diagnoses: e.g. if a patient has been diagnosed with both sepsis and septic shock, only the “septic shock” row is preserved as it is assumed that all “septic shock” patients are also suffering form sepsis. Practically speaking, this is done by sorting the dataframe by severity and then removing hadm_id duplicates as they appear.

sepsis_hadm_sorted <- sepsis_hadm[order(sepsis_hadm$severity, decreasing=T),]
sepsis_hadm <- sepsis_hadm_sorted[!duplicated(sepsis_hadm_sorted$hadm_id),]

The next step is to remove duplicate patients so that only the last available encounter is preserved. To do this, the sepsis_hadm identifiers filtered above are sorted based on admission time, and the duplicate subject_id is removed. The result is a list of unique hadm_id identifiers which represent a patient’s last sepsis-associated encounter.

admissions <- dbGetQuery(con, sprintf("SELECT *
                                       FROM ADMISSIONS
                                       WHERE hadm_id IN (%s)",
                                       paste(as.character(sepsis_hadm[,'hadm_id']), collapse=", ")))

admissions_sorted <- admissions[order(admissions$admittime, decreasing=T),]
hadm_list <- admissions_sorted[!duplicated(admissions_sorted$subject_id), 'hadm_id']

The final admissions list contains 4689 unique hospital admission identifiers which can be used to extract data from each patient’s last sepsis-associated encounter.


3.2.2 Verification

The fact that this process worked as intended can be verified using subject 353. He or she was admitted on three separate occasions, and during one of which was diagnosed with both severe sepsis and septic shock.

kable(sepsis_hadm_sorted[sepsis_hadm_sorted$subject_id == 353, ], caption='Example: Patient 353 before sorting') %>%
  kable_styling(bootstrap_options = 'striped')
Example: Patient 353 before sorting
row_id subject_id hadm_id seq_num icd9_code severity
4369 4145 353 112976 5 78552 3
4368 4144 353 112976 4 99592 2
4357 4133 353 108923 6 99591 1
4387 4163 353 131488 3 99591 1

Next, we see that after additional filtering, there is only one row listed for admission 112976 - septic shock (severity level 3). Please note that the times, while internally consistent within patients, have been randomized to protect confidentiality - hence why the year marking can be beyond the year 2100.

kable(admissions[admissions$subject_id == 353, 1:5], caption='Example: Patient 353 after removal of duplicate admissions') %>%
  kable_styling(bootstrap_options = 'striped')
Example: Patient 353 after removal of duplicate admissions
row_id subject_id hadm_id admittime dischtime
479 448 353 108923 2151-03-28 16:01:00 2151-04-13 16:10:00
694 449 353 112976 2151-06-23 22:18:00 2151-07-04 13:22:00
1727 450 353 131488 2151-10-01 20:42:00 2151-10-20 15:20:00

Lastly, we confirm that there is only the latest identifier left in the final hadm_list:

kable(cbind(admissions[admissions$subject_id == 353, 1:5],
            admissions[admissions$subject_id == 353, 'hadm_id'] %in% hadm_list),
            col.names=append(colnames(admissions)[1:5], 'in_hadm_list'),
            caption='Admission identifiers and times included in final admission list') %>%
  kable_styling(bootstrap_options = 'striped')
Admission identifiers and times included in final admission list
row_id subject_id hadm_id admittime dischtime in_hadm_list
479 448 353 108923 2151-03-28 16:01:00 2151-04-13 16:10:00 FALSE
694 449 353 112976 2151-06-23 22:18:00 2151-07-04 13:22:00 FALSE
1727 450 353 131488 2151-10-01 20:42:00 2151-10-20 15:20:00 TRUE

3.2.3 Cohort Demographics

3.2.3.1 Plotting functions

Before delving into the predictive modeling aspect, I wanted to provide some insight into the selected patient cohort via exploratory data analysis. To aid this, I first define several plotting functions which will make the following sections more convenient.

Pie Chart:

plotPieChart <- function(d, legend, title){
  fig <- ggplot(d, aes(x=factor(1), fill=legend)) +
    geom_bar(width=1) +
    coord_polar('y') +
    xlab('') + ylab('') + ggtitle(sprintf('Pie Chart of %s', title)) +
    scale_fill_discrete(name=title) +
    theme(axis.title.x=element_blank(), axis.title.y=element_blank(),
          axis.text.x=element_blank(), axis.text.y=element_blank(),
          axis.ticks.x=element_blank(), axis.ticks.y=element_blank(),
          panel.background=element_blank(), panel.border=element_blank(),
          panel.grid.major=element_blank(), panel.grid.minor=element_blank(),
          plot.background=element_blank())
}

Bar Chart:

plotBarChart <- function(d, legend, title){
  fig <- ggplot(d, aes(
      x=factor(legend, levels=names(sort(table(legend), decreasing=TRUE))),
      fill=factor(legend, levels=names(sort(table(legend), decreasing=TRUE))))) +
    scale_fill_discrete(name=title) +
    ggtitle(sprintf('Bar Chart of %s', title)) + xlab(title) +
    geom_bar()
}

Histogram:

plotHistogram <- function(d, xdata, title){
  fig <- ggplot(data=d, aes(x=xdata)) +
    geom_histogram() + 
    ggtitle(sprintf('Histogram of %s', title)) + xlab(title)
}

3.2.3.2 Mortality

First, and perhaps most pressingly, I wanted to examine the mortality rate in my cohort. To do this, I used the presence or absence of the deathtime attribute in the ADMISSIONS table.

admissions <- dbGetQuery(con, sprintf("SELECT *
                                       FROM ADMISSIONS
                                       WHERE HADM_ID IN (%s)",
                                       paste(as.character(hadm_list), collapse=", ")))

print(plotPieChart(admissions, !is.na(admissions$deathtime), 'Patient Mortality'))

Of the 4689 patients in the selected cohort, 3027 (65%) patients survived to discharge while the remaining 1662 (35%) did not. This falls within the range of estimates cited by Abe et. al’s 2018 study and review, which reported mortality rates ranging from 23-49%. This high mortality rate is concerning!


3.2.3.3 Length of Stay

Length of stay (LoS) was calculated for each patient as the difference between discharge and admission times. Note that for patients who did not survive, the dischtime attribute is equal to deathtime. The LoS is then visualized via histogram with summary statistics provided by the describe() function in the psych package. Additionally, LoS was calculated independently for patients who survived until discharge and those who did not.

admissions$length_of_stay <- difftime(admissions$dischtime, admissions$admittime, units='days')

ggarrange(
  plotHistogram(admissions,
                admissions$length_of_stay,
                'LoS: All Patients (days)'),
  plotHistogram(admissions[is.na(admissions$deathtime),],
                admissions$length_of_stay[is.na(admissions$deathtime)],
                'LoS: Discharged Patients (days)'),
  plotHistogram(admissions[!is.na(admissions$deathtime),],
                admissions$length_of_stay[!is.na(admissions$deathtime)],
                'LoS: Deceased Patients (days)'))

Each of the above distributions is unimodal and is heavily right-skewed. Summary statistics are presented below.

tbl <- data.frame(All_Patients=t(describe(as.numeric(admissions$length_of_stay))),
                  Discharged_Patients=t(describe(as.numeric(admissions$length_of_stay[is.na(admissions$deathtime)]))),
                  Deceased_Patients=t(describe(as.numeric(admissions$length_of_stay[!is.na(admissions$deathtime)]))))

kable(tbl, caption='LoS: All Patients (days)', digits=2, col.names=c("All Patients", "Discharged Patients", "Deceased Patients")) %>%
  kable_styling(bootstrap_options = 'striped')
LoS: All Patients (days)
All Patients Discharged Patients Deceased Patients
vars 1.00 1.00 1.00
n 4689.00 3027.00 1662.00
mean 14.71 15.16 13.87
sd 16.28 15.25 17.99
median 9.79 10.24 8.40
trimmed 11.82 12.41 10.44
mad 8.28 7.70 9.67
min -0.32 0.16 -0.32
max 294.62 294.62 206.43
range 294.94 294.46 206.75
skew 4.18 4.59 3.70
kurtosis 36.24 49.18 22.39
se 0.24 0.28 0.44

3.2.3.4 Microbiology cultures

Even though sepsis is a disease characterized by the host’s response, rather than a physical etiologic agent, it may be interesting to examine the types of organisms that are found during sepsis. To extract this information, I use the MICROBIOLOGYEVENTS table which contains the results from microbiology blood cultures. Additionally, NA values are interpreted as negative test results as per the MIMIC-III documentation. The 15 most common pathogens are reported below, along with their count and relative frequency.

microbiology_events <- dbGetQuery(con, sprintf("SELECT *
                                         FROM MICROBIOLOGYEVENTS
                                         WHERE hadm_id IN (%s) AND spec_type_desc = 'BLOOD CULTURE'",
                                         paste(as.character(hadm_list), collapse=", ")))
microbiology_events$org_name[is.na(microbiology_events$org_name)] <- 'Negative'

tbl <- data.frame(name = sort(table(microbiology_events$org_name), decreasing=T),
                  count = sort(prop.table(table(microbiology_events$org_name)), decreasing=T) )

kable(tbl[1:15, c(1, 2, 4)], col.names=c("Organism", "Count", "Frequency"), caption="Most common microorganisms") %>%
  kable_styling(bootstrap_options = 'striped')
Most common microorganisms
Organism Count Frequency
Negative 29568 0.5825059
ESCHERICHIA COLI 4829 0.0951340
STAPHYLOCOCCUS, COAGULASE NEGATIVE 4105 0.0808708
STAPH AUREUS COAG + 3756 0.0739953
KLEBSIELLA PNEUMONIAE 2071 0.0407998
ENTEROCOCCUS FAECIUM 1053 0.0207447
PSEUDOMONAS AERUGINOSA 720 0.0141844
PROTEUS MIRABILIS 524 0.0103231
ENTEROBACTER CLOACAE 421 0.0082939
ENTEROCOCCUS FAECALIS 404 0.0079590
STREPTOCOCCUS PNEUMONIAE 319 0.0062845
SERRATIA MARCESCENS 279 0.0054965
KLEBSIELLA OXYTOCA 247 0.0048660
ENTEROBACTER CLOACAE COMPLEX 142 0.0027975
STAPHYLOCOCCUS EPIDERMIDIS 140 0.0027581

According to these statistics, of the 116 unique microbiology results found, the most common finding is a negative test result. This isn’t surprising, given that only a fraction of bacteria can be cultured in the first place and that sepsis can continue long after the initial etiologic pathogen has been eradicated.


3.2.3.5 Underlying diagnoses

While some patients present to the hospital with full-blown sepsis, conditions such as pneumonia or urinary tract infections can also lead to sepsis. By examining the initial chief complaint from the ADMISSIONS table, we can evaluate some of the most common conditions preceding sepsis.

tbl <- data.frame(name = sort(table(admissions$diagnosis), decreasing=T),
                  count = sort(prop.table(table(admissions$diagnosis)), decreasing=T) )

kable(tbl[1:15, c(1, 2, 4)], col.names=c("Initial Diagnosis", "Count", "Frequency"), caption="Most common initial diagnoses") %>%
  kable_styling(bootstrap_options = 'striped')
Most common initial diagnoses
Initial Diagnosis Count Frequency
SEPSIS 712 0.1518447
PNEUMONIA 369 0.0786948
FEVER 150 0.0319898
ABDOMINAL PAIN 127 0.0270847
HYPOTENSION 114 0.0243122
ALTERED MENTAL STATUS 87 0.0185541
UROSEPSIS 60 0.0127959
CHOLANGITIS 57 0.0121561
URINARY TRACT INFECTION;PYELONEPHRITIS 54 0.0115163
CONGESTIVE HEART FAILURE 51 0.0108765
CELLULITIS 44 0.0093837
ACUTE RENAL FAILURE 41 0.0087439
LIVER FAILURE 39 0.0083173
PANCREATITIS 33 0.0070377
GASTROINTESTINAL BLEED 30 0.0063980

3.2.3.6 Admission location

In addition to initial diagnosis, patients can enter the hospital system in a variety of ways. Among our cohort, by far the most common ingress point is through the emergency department. Elective procedures and urgent care also represented a fair number of patients. There was a small number of newborn patients in the cohort.

tbl <- data.frame(name = sort(table(admissions$admission_type), decreasing=T),
                  count = sort(prop.table(table(admissions$admission_type)), decreasing=T) )

kable(tbl[, c(1, 2, 4)], col.names=c("Admission Type", "Count", "Frequency"), caption="Most common admission types") %>%
  kable_styling(bootstrap_options = 'striped')
Most common admission types
Admission Type Count Frequency
EMERGENCY 4463 0.9518021
ELECTIVE 150 0.0319898
URGENT 72 0.0153551
NEWBORN 4 0.0008531

3.3 Numerical Data Preprocessing

After exploring the selected cohort, I begin extracting features from the database. This was somewhat difficult and very time consuming, but the result was a data set containing 12 numerical predictors collected at the beginning and end of the patient’s stay. First, I create a clean slate by removing all of the excess data involved in the exploratory data analysis section.

rm(list=setdiff(ls(), c('hadm_list', 'con', 'plotHistogram', 'plotPieChart', 'plotBarChart')))

I then pre-allocate a dataframe with the admission and subject ID numbers, as well as the admission time, discharge time, and a Boolean value representing whether or not the patient was deceased.

admissions <- dbGetQuery(con, sprintf("SELECT *
                                       FROM ADMISSIONS
                                       WHERE HADM_ID IN (%s)",
                                       paste(as.character(hadm_list), collapse=", ")))

data <- cbind(admissions[, c('subject_id', 'hadm_id', 'admittime', 'dischtime')], !is.na(admissions$deathtime))
colnames(data) <- c('subject_id', 'hadm_id', 'admittime', 'dischtime', 'death')

I also want to add some index of severity (sepsis vs. severe sepsis vs. septic shock). I also plot a pie chart showing the relative proportion of severity (1 = sepsis, 2 = severe sepsis, 3 = septic shock). As can be seen, almost half of the patients in the cohort have been diagnosed with septic shock.

diagnosis_icd <- dbGetQuery(con, sprintf("SELECT *
                                          FROM DIAGNOSES_ICD
                                          WHERE HADM_ID IN (%s)",
                                          paste(as.character(hadm_list), collapse=", ")))

diagnosis_icd_sepsis <- diagnosis_icd[diagnosis_icd$icd9_code %in% c(99591, 99592, 78552),c('subject_id', 'hadm_id', 'icd9_code')]

diagnosis_icd_sepsis$severity <- factor(diagnosis_icd_sepsis$icd9_code,
                   levels = c(99591, 99592, 78552),
                   labels=c(1, 2, 3))

diagnosis_icd_sorted <- diagnosis_icd_sepsis[order(diagnosis_icd_sepsis$severity, decreasing=T),]
diagnosis_icd_sorted <- diagnosis_icd_sorted[!duplicated(diagnosis_icd_sorted$hadm_id),]

data <- merge(data, diagnosis_icd_sorted, by=c('hadm_id', 'subject_id'), all.x=T, all.y=F)

print(plotPieChart(data, data$severity, 'Severity'))

After preallocating the data table, I will define a function to extract data from the LABEVENTS and CHARTEVENTS portions of the database. Then, I can search for each variable and append them to the data frame constructed above. Due to the large size of the dataset, querying can be computationally intensive. Thus, I set up a rudimentary caching system which stores the query result as a more optimized .RData file. Then, as subsequent changes are made to the document, these caches are loaded in rather than re-querying the database.

Additionally, many of the parameters are collected repeatedly during the stay of each patient. Because of the vast differences in LoS among patients (from 0 to almost 300 days) and the irregularity of the sampling times, I chose to avoid the longitudinal nature and instead focused on cross-sectional analytics. To this end, I extract the first data point upon admission for each parameter, and the last point before discharge or death. Models built using the initial set of parameters nearest admission would be the most clinically useful, while I expect that models built using data from near discharge or death will have better performance.

parse_data <- function(con, table, itemid, hadm_list, cache_fn, isCached, keepRange){

  ## Check if data is cached, and if not make the request to the database
  if(isCached == F){
    var_data <- dbGetQuery(con, sprintf("SELECT SUBJECT_ID, HADM_ID, CHARTTIME, VALUENUM, VALUEUOM
                                         FROM %s
                                         WHERE HADM_ID IN (%s)
                                         AND ITEMID IN (%s)",
                                         table,
                                         paste(as.character(hadm_list), collapse=", "),
                                         paste(as.character(itemid), collapse=", ")))
    save(var_data, file=cache_fn)
  }

  if(isCached == T){
    load(cache_fn)
  }

  # Remove NA and impossible values
  var_data <- na.omit(var_data)
  var_data <- var_data[(var_data$valuenum > keepRange[1]) & (var_data$valuenum < keepRange[2]),]

  # Get first reading
  var_sorted <- var_data[order(var_data$charttime, decreasing=F),]
  var_first <- var_sorted[match(unique(var_sorted$hadm_id), var_sorted$hadm_id),]

  # Get last reading
  var_sorted <- var_data[order(var_data$charttime, decreasing=T),]
  var_last <- var_sorted[match(unique(var_sorted$hadm_id), var_sorted$hadm_id),]

  return(cbind(var_first, var_last))
}

3.3.1 SIRS Scoring

Next, each of the SIRS parameters are isolated from the larger database. These are:
* Temperature
* Heart rate
* Respiratory rate
* Partial pressure of CO_2_ in arterial blood (PaCO_2_)
* White blood cell count (WBC)


3.3.1.1 Temperature

The first item in the SIRS score table is temperature. This, along with most vitals, is stored in the CHARTEVENTS table. However, entries into this table are coded by an ITEMID code. To parse these codes, I load in the D_ITEMS table, search for the requisite entries, and then use the codes to search CHARTEVENTS. Because the data was collected using two different systems (CareVue and MetaVision), there are multiple ITEMID codes.

d_items <- dbGetQuery(con, 'SELECT * 
                            FROM D_ITEMS')

d_labitems <- dbGetQuery(con, 'SELECT *
                               FROM D_LABITEMS')

kable(d_items[d_items$itemid %in% c(676,677,3654,3655,6643,224027,223762),], caption='Temperature item identifiers of interest') %>%
  kable_styling(bootstrap_options = 'striped')
Temperature item identifiers of interest
row_id itemid label abbreviation dbsource linksto category unitname param_type conceptid
1447 627 676 Temperature C NA carevue chartevents NA NA NA NA
1448 628 677 Temperature C (calc) NA carevue chartevents NA NA NA NA
3896 2818 3654 Temp Rectal [F] NA carevue chartevents NA NA NA NA
3897 2819 3655 Temp Skin [C] NA carevue chartevents NA NA NA NA
5036 4298 6643 Temp Rectal NA carevue chartevents NA NA NA NA
9307 13042 224027 Skin Temperature Skin Temp metavision chartevents Skin - Assessment NA Text NA
12368 12758 223762 Temperature Celsius Temperature C metavision chartevents Routine Vital Signs ?C Numeric NA

Next, the parse_data() function is used to extract temperature data from the data set using the ITEMID codes. Those temperatures with values in Fahrenheit are converted into Celsius. Then, these values are integrated into the data object. The distributions of temperature on admission and discharge/death are plotted, and descriptive statistics are presented below.

# Load data
temperature <- parse_data(con, "CHARTEVENTS", c(676,677, 223761, 223762), hadm_list, '../cache/temperature.RData', T, c(10, 130))

temperature_first <- temperature[,1:5]
temperature_last <- temperature[,6:10]

# Convert F --> C
temperature_first[temperature_first$valueuom == '?F', 'valuenum'] <- as.numeric((as.numeric(temperature_first[temperature_first$valueuom == '?F', 'valuenum']) - 32)*5/9)

temperature_last[temperature_last$valueuom == '?F', 'valuenum'] <- as.numeric((as.numeric(temperature_last[temperature_last$valueuom == '?F', 'valuenum']) - 32)*5/9)

# Remove values outside range after conversion
temperature_first <- temperature_first[(temperature_first$valuenum > 20) & (temperature_first$valuenum < 50),]
temperature_last <- temperature_last[(temperature_last$valuenum > 20) & (temperature_last$valuenum < 50),]

# Add to data object
data <- merge(data, temperature_first[,!names(temperature_first) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
data <- merge(data, temperature_last[,!names(temperature_last) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
colnames(data) <- c(colnames(data)[1:(length(data)-4)], 'temperature_t1', 'temperature_1', 'tempearture_t2', 'temperature_2')

# Plot histograms
data_melt <- melt(data[,c('temperature_1', 'temperature_2')], na.rm=TRUE)
fig <- ggplot(data=data_melt, aes(x=value, fill=variable)) + 
  geom_density(alpha=0.5) +
  scale_fill_discrete(name='Time', labels=c('Admission', 'Discharge/Death')) +
  ggtitle('Distribution of Temperature in Septic Patients') +
  ylab('Kernel Density Estimate') + xlab('Temperature (C)')
print(fig)

# Print summary statistics
kable(t(rbind(describe(data$temperature_1), describe(data$temperature_2))),
      caption='Descriptive statistics of temperature',
      col.names=c('Admission', 'Discharge/Death'),
      digits=2) %>%
  kable_styling(bootstrap_options = 'striped')
Descriptive statistics of temperature
Admission Discharge/Death
vars 1.00 1.00
n 4612.00 4607.00
mean 36.83 36.67
sd 1.13 0.84
median 36.78 36.67
trimmed 36.82 36.66
mad 0.99 0.74
min 26.67 30.00
max 40.89 41.00
range 14.22 11.00
skew -0.35 -0.08
kurtosis 3.43 2.82
se 0.02 0.01

After the filtering and sorting, a full 4612 out of 4689 patients had temperature in the data set!


3.3.1.2 Heart Rate

The next item is heart rate. A similar process is used to glean this information.

# Load data
heartrate <- parse_data(con, "CHARTEVENTS", c(211, 220045), hadm_list, '../cache/heartrate.RData', T, c(20, 400))

heartrate_first <- heartrate[,1:5]
heartrate_last <- heartrate[,6:10]

# Add to data object
data <- merge(data, heartrate_first[,!names(heartrate_first) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
data <- merge(data, heartrate_last[,!names(heartrate_last) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
colnames(data) <- c(colnames(data)[1:(length(data)-4)], 'heartrate_t1', 'heartrate_1', 'heartrate_t2', 'heartrate_2')

# Plot histograms
data_melt <- melt(data[,c('heartrate_1', 'heartrate_2')], na.rm=TRUE)
fig <- ggplot(data=data_melt, aes(x=value, fill=variable)) + 
  geom_density(alpha=0.5) +
  scale_fill_discrete(name='Time', labels=c('Admission', 'Discharge/Death')) +
  ggtitle('Distribution of Heart Rate in Septic Patients') +
  ylab('Kernel Density Estimate') + xlab('Heart Rate (bpm)')
print(fig)

# Print summary statistics
kable(t(rbind(describe(data$heartrate_1), describe(data$heartrate_2))),
      caption='Descriptive statistics of heart rate',
      col.names=c('Admission', 'Discharge/Death'),
      digits=2) %>%
  kable_styling(bootstrap_options = 'striped')
Descriptive statistics of heart rate
Admission Discharge/Death
vars 1.00 1.00
n 4635.00 4635.00
mean 95.04 88.63
sd 21.28 19.84
median 94.00 87.00
trimmed 94.32 87.65
mad 22.24 19.27
min 21.00 21.00
max 180.00 200.00
range 159.00 179.00
skew 0.32 0.55
kurtosis -0.19 1.03
se 0.31 0.29

4635 patients had heart rate in the data set.


3.3.1.3 Respiratory Rate

The pattern is again repeated for respiratory rate.

# Load data
resprate <- parse_data(con, "CHARTEVENTS", c(618, 220210), hadm_list, '../cache/resprate.RData', T, c(0, 100))

resprate_first <- resprate[,1:5]
resprate_last <- resprate[,6:10]

# Add to data object
data <- merge(data, resprate_first[,!names(resprate_first) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
data <- merge(data, resprate_last[,!names(resprate_last) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
colnames(data) <- c(colnames(data)[1:(length(data)-4)], 'resprate_t1', 'resprate_1', 'resprate_t2', 'resprate_2')

# Plot histograms
data_melt <- melt(data[,c('resprate_1', 'resprate_2')], na.rm=TRUE)
fig <- ggplot(data=data_melt, aes(x=value, fill=variable)) + 
  geom_density(alpha=0.5) +
  scale_fill_discrete(name='Time', labels=c('Admission', 'Discharge/Death')) +
  ggtitle('Distribution of Respiratory Rate in Septic Patients') +
  ylab('Kernel Density Estimate') + xlab('Respiratory Rate (bpm)')
print(fig)

# Print summary statistics
kable(t(rbind(describe(data$resprate_1), describe(data$resprate_2))),
      caption='Descriptive statistics of respiratory rate',
      col.names=c('Admission', 'Discharge/Death'),
      digits=2) %>%
  kable_styling(bootstrap_options = 'striped')
Descriptive statistics of respiratory rate
Admission Discharge/Death
vars 1.00 1.00
n 4631.00 4631.00
mean 21.22 21.14
sd 6.43 6.56
median 20.00 21.00
trimmed 20.80 20.82
mad 5.93 5.93
min 2.00 2.00
max 65.00 99.00
range 63.00 97.00
skew 0.82 1.02
kurtosis 1.81 5.83
se 0.09 0.10

4631 patients had respiratory rate in the data set.


3.3.1.4 Pressure of Arterial Carbon Dioxide (PaCO_2_)

An alternative to respiratory rate is the partial pressure of CO_2_ in the arterial blood (PaCO_2_).

# Load data
paco2 <- parse_data(con, "LABEVENTS", c(50818), hadm_list, '../cache/paco2.RData', T, c(1, 150))

paco2_first <- paco2[,1:5]
paco2_last <- paco2[,6:10]

# Add to data object
data <- merge(data, paco2_first[,!names(paco2_first) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
data <- merge(data, paco2_last[,!names(paco2_last) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
colnames(data) <- c(colnames(data)[1:(length(data)-4)], 'paco2_t1', 'paco2_1', 'paco2_t2', 'paco2_2')

# Plot histograms
data_melt <- melt(data[,c('paco2_1', 'paco2_2')], na.rm=TRUE)
fig <- ggplot(data=data_melt, aes(x=value, fill=variable)) + 
  geom_density(alpha=0.5) +
  scale_fill_discrete(name='Time', labels=c('Admission', 'Discharge/Death')) +
  ggtitle('Distribution of PaCO_2_ in Septic Patients') +
  ylab('Kernel Density Estimate') + xlab('PaCO_2_ (mmHg)')
print(fig)

# Print summary statistics
kable(t(rbind(describe(data$paco2_1), describe(data$paco2_2))),
      caption='Descriptive statistics of PaCO_2_',
      col.names=c('Admission', 'Discharge/Death'),
      digits=2) %>%
  kable_styling(bootstrap_options = 'striped')
Descriptive statistics of PaCO_2_
Admission Discharge/Death
vars 1.00 1.00
n 3856.00 3856.00
mean 41.20 41.55
sd 13.59 12.46
median 39.00 39.00
trimmed 39.67 40.23
mad 10.38 8.90
min 11.00 12.00
max 124.00 141.00
range 113.00 129.00
skew 1.52 1.79
kurtosis 4.05 6.85
se 0.22 0.20

3856 patients had PaCO_2_ in the data set.


3.3.1.5 White Blood Cell Count

# Load data
wbccount <- parse_data(con, "LABEVENTS", c(51301), hadm_list, '../cache/wbccount.RData', T, c(0, 100))

wbccount_first <- wbccount[,1:5]
wbccount_last <- wbccount[,6:10]

# Add to data object

data <- merge(data, wbccount_first[,!names(wbccount_first) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
data <- merge(data, wbccount_last[,!names(wbccount_last) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
colnames(data) <- c(colnames(data)[1:(length(data)-4)], 'wbccount_t1', 'wbccount_1', 'wbccount_t2', 'wbccount_2')

# Plot histograms
data_melt <- melt(data[,c('wbccount_1', 'wbccount_2')], na.rm=TRUE)
fig <- ggplot(data=data_melt, aes(x=value, fill=variable)) + 
  geom_density(alpha=0.5) +
  scale_fill_discrete(name='Time', labels=c('Admission', 'Discharge/Death')) +
  ggtitle('Distribution of White Blood Cell Count in Septic Patients') +
  ylab('Kernel Density Estimate') + xlab('WBC Count (10^9^/L)')
print(fig)

# Print summary statistics
kable(t(rbind(describe(data$wbccount_1), describe(data$wbccount_2))),
      caption='Descriptive statistics of WBC count',
      col.names=c('Admission', 'Discharge/Death'),
      digits=2) %>%
  kable_styling(bootstrap_options = 'striped')
Descriptive statistics of WBC count
Admission Discharge/Death
vars 1.00 1.00
n 4683.00 4683.00
mean 14.02 11.97
sd 9.45 8.96
median 12.30 9.70
trimmed 12.97 10.52
mad 7.26 5.04
min 0.10 0.10
max 99.80 95.80
range 99.70 95.70
skew 2.23 2.98
kurtosis 11.04 14.53
se 0.14 0.13

After extracting each of the five SIRS variables, we are left with 3787 subjects with all four variables collected! Below, these are computed into SIRS scores.


3.3.1.6 SIRS Scoring

Data is scored iteratively based on the SIRS parameters listed above.

# preallocate SIRS
data$sirs_1 <- matrix(0, nrow=dim(data)[1], ncol=1)
data$sirs_2 <- matrix(0, nrow=dim(data)[1], ncol=1)

# iteratively score SIRS
for(i in which(complete.cases(data))){

  if(data[i, 'temperature_1'] < 36 | data[i, 'temperature_1'] > 38){data[i, 'sirs_1'] <- data[i, 'sirs_1']+1}
  if(data[i, 'temperature_2'] < 36 | data[i, 'temperature_2'] > 38){data[i, 'sirs_2'] <- data[i, 'sirs_2']+1}

  if(data[i, 'heartrate_1'] > 90){data[i, 'sirs_1'] <- data[i, 'sirs_1']+1}
  if(data[i, 'heartrate_2'] > 90){data[i, 'sirs_2'] <- data[i, 'sirs_2']+1}

  if(data[i, 'resprate_1'] > 20){data[i, 'sirs_1'] <- data[i, 'sirs_1']+1}
  if(data[i, 'resprate_2'] > 20){data[i, 'sirs_2'] <- data[i, 'sirs_2']+1}

  if(data[i, 'paco2_1'] < 32){data[i, 'sirs_1'] <- data[i, 'sirs_1']+1}
  if(data[i, 'paco2_2'] < 32){data[i, 'sirs_2'] <- data[i, 'sirs_2']+1}

  if(data[i, 'wbccount_1'] > 12 | data[i, 'wbccount_1'] < 4){data[i, 'sirs_1'] <- data[i, 'sirs_1']+1}
  if(data[i, 'wbccount_2'] > 12 | data[i, 'wbccount_2'] < 4){data[i, 'sirs_2'] <- data[i, 'sirs_2']+1}
}

# turn non-complete cases to NA
data[!complete.cases(data), c('sirs_1', 'sirs_2')] <- NA

# display a sample of the dataset so far
kable(head(data), caption='Sample of the extracted dataset so far') %>%
  kable_styling(bootstrap_options = 'striped')
Sample of the extracted dataset so far
hadm_id subject_id admittime dischtime death icd9_code severity temperature_t1 temperature_1 tempearture_t2 temperature_2 heartrate_t1 heartrate_1 heartrate_t2 heartrate_2 resprate_t1 resprate_1 resprate_t2 resprate_2 paco2_t1 paco2_1 paco2_t2 paco2_2 wbccount_t1 wbccount_1 wbccount_t2 wbccount_2 sirs_1 sirs_2
100028 53456 2142-12-23 18:06:00 2142-12-30 15:45:00 FALSE 99591 1 2142-12-23 37.11111 2142-12-25 37.94444 2142-12-23 87 2142-12-25 73 2142-12-23 13 2142-12-25 18 2142-12-25 48 2142-12-25 48 2142-12-23 31.3 2142-12-30 11.5 1 0
100074 26885 2176-04-09 21:03:00 2176-04-12 14:30:00 TRUE 78552 3 2176-04-10 38.40000 2176-04-12 37.50000 2176-04-10 92 2176-04-12 106 2176-04-10 25 2176-04-12 22 2176-04-09 28 2176-04-12 39 2176-04-09 19.2 2176-04-12 48.4 5 3
100104 73946 2201-06-21 19:08:00 2201-07-03 16:15:00 FALSE 78552 3 2201-06-21 37.77778 2201-06-29 36.88889 2201-06-21 123 2201-06-29 114 2201-06-21 24 2201-06-29 30 2201-06-21 37 2201-06-29 41 2201-06-21 7.4 2201-07-03 18.8 2 3
100117 14297 2166-04-30 19:05:00 2166-05-09 15:56:00 FALSE 99592 2 2166-04-30 37.61111 2166-05-04 37.77778 2166-04-30 110 2166-05-04 127 2166-04-30 28 2166-05-04 25 2166-05-01 37 2166-05-01 37 2166-04-30 23.8 2166-05-09 10.9 3 2
100118 53749 2198-09-07 15:34:00 2198-09-11 18:24:00 FALSE 78552 3 2198-09-07 36.33333 2198-09-09 36.72222 2198-09-07 69 2198-09-09 64 2198-09-07 14 2198-09-09 18 NA NA NA NA 2198-09-07 19.1 2198-09-11 9.2 NA NA
100141 61215 2173-10-01 01:25:00 2173-10-20 12:15:00 FALSE 99591 1 2173-10-01 37.50000 2173-10-10 36.11111 2173-10-01 87 2173-10-10 83 2173-10-01 17 2173-10-10 17 2173-09-30 46 2173-10-07 54 2173-09-30 11.5 2173-10-12 10.2 0 0

3.3.2 Alternate quantitative predictors

Many of the parameters in the SOFA score weren’t as regularly collected or at all straightforward to extract. Even still, I gathered two markers of organ dysfunction - bilirubin, which is used as an index of liver function, and creatinine, which is associated with the kidneys. Other potential predictors, such as Glasgow Coma Score and Mean Arterial Pressure, would have been convenient to have but were so fragmented within the data set that reliably isolating them was difficult.


3.3.2.1 Bilirubin

# Load data
bilirubin <- parse_data(con, "LABEVENTS", c(50885), hadm_list, '../cache/bilirubin.RData', T, c(0, 100))

bilirubin_first <- bilirubin[,1:5]
bilirubin_last <- bilirubin[,6:10]

# Add to data object
data <- merge(data, bilirubin_first[,!names(bilirubin_first) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
data <- merge(data, bilirubin_last[,!names(bilirubin_last) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
colnames(data) <- c(colnames(data)[1:(length(data)-4)], 'bilirubin_t1', 'bilirubin_1', 'bilirubin_t2', 'bilirubin_2')

# Plot histograms
data_melt <- melt(data[,c('bilirubin_1', 'bilirubin_2')], na.rm=TRUE)
fig <- ggplot(data=data_melt, aes(x=value, fill=variable)) + 
  geom_density(alpha=0.5) +
  scale_fill_discrete(name='Time', labels=c('Admission', 'Discharge/Death')) +
  ggtitle('Distribution of Bilirubin Levels in Septic Patients') +
  ylab('Kernel Density Estimate') + xlab('Bilirubin (mg/dL)')
print(fig)

# Print summary statistics
kable(t(rbind(describe(data$bilirubin_1), describe(data$bilirubin_2))),
      caption='Descriptive statistics of bilirubin',
      col.names=c('Admission', 'Discharge/Death'),
      digits=2) %>%
  kable_styling(bootstrap_options = 'striped')
Descriptive statistics of bilirubin
Admission Discharge/Death
vars 1.00 1.00
n 4336.00 4336.00
mean 2.09 2.68
sd 4.52 6.53
median 0.70 0.60
trimmed 1.03 1.01
mad 0.59 0.44
min 0.10 0.10
max 60.50 66.00
range 60.40 65.90
skew 5.05 4.45
kurtosis 31.95 22.67
se 0.07 0.10

3.3.2.2 Creatinine

# Load data
creatinine <- parse_data(con, "LABEVENTS", c(50912), hadm_list, '../cache/creatinine.RData', T, c(0, 100))

creatinine_first <- creatinine[,1:5]
creatinine_last <- creatinine[,6:10]

# Add to data object
data <- merge(data, creatinine_first[,!names(creatinine_first) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
data <- merge(data, creatinine_last[,!names(creatinine_last) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
colnames(data) <- c(colnames(data)[1:(length(data)-4)], 'creatinine_t1', 'creatinine_1', 'creatinine_t2', 'creatinine_2')

# Plot histograms
data_melt <- melt(data[,c('creatinine_1', 'creatinine_2')], na.rm=TRUE)
fig <- ggplot(data=data_melt, aes(x=value, fill=variable)) + 
  geom_density(alpha=0.5) +
  scale_fill_discrete(name='Time', labels=c('Admission', 'Discharge/Death')) +
  ggtitle('Distribution of Creatinine Levels in Septic Patients') +
  ylab('Kernel Density Estimate') + xlab('Creatinine (mg/dL')
print(fig)

# Print summary statistics
kable(t(rbind(describe(data$creatinine_1), describe(data$creatinine_2))),
      caption='Descriptive statistics of creatinine',
      col.names=c('Admission', 'Discharge/Death'),
      digits=2) %>%
  kable_styling(bootstrap_options = 'striped')
Descriptive statistics of creatinine
Admission Discharge/Death
vars 1.00 1.00
n 4682.00 4682.00
mean 1.98 1.71
sd 1.77 1.62
median 1.40 1.10
trimmed 1.62 1.38
mad 0.89 0.74
min 0.10 0.10
max 17.70 14.60
range 17.60 14.50
skew 2.83 2.47
kurtosis 11.15 7.91
se 0.03 0.02

3.3.2.3 Lactate

While not directly in the SIRS or SOFA score rubric, lactate is an important marker in determining metabolic dysfunction, especially in shock (both septic and otherwise).

# Load data
lactate <- parse_data(con, "LABEVENTS", c(50813), hadm_list, '../cache/lactate.RData', T, c(0, 100))

lactate_first <- lactate[,1:5]
lactate_last <- lactate[,6:10]

# Add to data object
data <- merge(data, lactate_first[,!names(lactate_first) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
data <- merge(data, lactate_last[,!names(lactate_last) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
colnames(data) <- c(colnames(data)[1:(length(data)-4)], 'lactate_t1', 'lactate_1', 'lactate_t2', 'lactate_2')

# Plot histograms
data_melt <- melt(data[,c('lactate_1', 'lactate_2')], na.rm=TRUE)
fig <- ggplot(data=data_melt, aes(x=value, fill=variable)) + 
  geom_density(alpha=0.5) +
  scale_fill_discrete(name='Time', labels=c('Admission', 'Discharge/Death')) +
  ggtitle('Distribution of Lactate Levels in Septic Patients') +
  ylab('Kernel Density Estimate') + xlab('Lactate (mmol/L)')
print(fig)

# Print summary statistics
kable(t(rbind(describe(data$lactate_1), describe(data$lactate_2))),
      caption='Descriptive statistics of lactate',
      col.names=c('Admission', 'Discharge/Death'),
      digits=2) %>%
  kable_styling(bootstrap_options = 'striped')
Descriptive statistics of lactate
Admission Discharge/Death
vars 1.00 1.00
n 4539.00 4539.00
mean 2.89 2.61
sd 2.30 2.99
median 2.20 1.60
trimmed 2.46 1.92
mad 1.33 0.89
min 0.30 0.07
max 24.20 28.80
range 23.90 28.73
skew 2.67 3.58
kurtosis 10.64 16.28
se 0.03 0.04

3.3.2.4 Potassium

potassium <- parse_data(con, "LABEVENTS", c(50971), hadm_list, '../cache/potassium.RData', T, c(0, 100))

potassium_first <- potassium[,1:5]
potassium_last <- potassium[,6:10]

# Add to data object
data <- merge(data, potassium_first[,!names(potassium_first) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
data <- merge(data, potassium_last[,!names(potassium_last) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
colnames(data) <- c(colnames(data)[1:(length(data)-4)], 'potassium_t1', 'potassium_1', 'potassium_t2', 'potassium_2')

# Plot histograms
data_melt <- melt(data[,c('potassium_1', 'potassium_2')], na.rm=TRUE)
fig <- ggplot(data=data_melt, aes(x=value, fill=variable)) + 
  geom_density(alpha=0.5) +
  scale_fill_discrete(name='Time', labels=c('Admission', 'Discharge/Death')) +
  ggtitle('Distribution of Potassium Level in Septic Patients') +
  ylab('Kernel Density Estimate') + xlab('Potassium (mEq/L)')
print(fig)

# Print summary statistics
kable(t(rbind(describe(data$potassium_1), describe(data$potassium_2))),
      caption='Descriptive statistics of potassium',
      col.names=c('Admission', 'Discharge/Death'),
      digits=2) %>%
  kable_styling(bootstrap_options = 'striped')
Descriptive statistics of potassium
Admission Discharge/Death
vars 1.00 1.00
n 4680.00 4680.00
mean 4.35 4.18
sd 0.93 0.68
median 4.20 4.10
trimmed 4.26 4.11
mad 0.74 0.59
min 1.80 2.00
max 9.10 9.10
range 7.30 7.10
skew 1.25 1.39
kurtosis 2.93 4.35
se 0.01 0.01

3.3.2.5 Calcium

calcium <- parse_data(con, "LABEVENTS", c(50893), hadm_list, '../cache/calcium.RData', T, c(0, 100))

calcium_first <- calcium[,1:5]
calcium_last <- calcium[,6:10]

# Add to data object
data <- merge(data, calcium_first[,!names(calcium_first) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
data <- merge(data, calcium_last[,!names(calcium_last) %in% 'valueuom'], by=c('hadm_id', 'subject_id'), all=T)
colnames(data) <- c(colnames(data)[1:(length(data)-4)], 'calcium_t1', 'calcium_1', 'calcium_t2', 'calcium_2')

# Plot histograms
data_melt <- melt(data[,c('calcium_1', 'calcium_2')], na.rm=TRUE)
fig <- ggplot(data=data_melt, aes(x=value, fill=variable)) + 
  geom_density(alpha=0.5) +
  scale_fill_discrete(name='Time', labels=c('Admission', 'Discharge/Death')) +
  ggtitle('Distribution of Calcium Level in Septic Patients') +
  ylab('Kernel Density Estimate') + xlab('Calcium (mg/dL)')
print(fig)

# Print summary statistics
kable(t(rbind(describe(data$calcium_1), describe(data$calcium_2))),
      caption='Descriptive statistics of calcium',
      col.names=c('Admission', 'Discharge/Death'),
      digits=2) %>%
  kable_styling(bootstrap_options = 'striped')
Descriptive statistics of calcium
Admission Discharge/Death
vars 1.00 1.00
n 4656.00 4656.00
mean 8.18 8.38
sd 1.04 0.91
median 8.20 8.40
trimmed 8.18 8.35
mad 0.89 0.74
min 2.90 3.80
max 14.60 19.90
range 11.70 16.10
skew 0.05 1.23
kurtosis 1.98 12.07
se 0.02 0.01

3.4 Free text preprocessing

The MIMIC-III data set also contains a wealth of free-text data. In light of this, I will practice some text mining (TM) and natural language processing (NLP) techniques to gather potentially relevant data in addition to the physiological data collected above.

To start with a clean slate, I again remove the multiple parameters that were generated in order to create the physiologic data (stored in the object data).

rm(list=setdiff(ls(), c('hadm_list', 'con', 'data', 'plotHistogram', 'plotPieChart', 'plotBarChart')))

Then, I read in the all of the notes from the NOTEEVENTS table. Because this is such a large data set, the following code block is not evaluated.

notes <- dbGetQuery(con, sprintf("SELECT *
                                  FROM NOTEEVENTS
                                  WHERE HADM_ID IN (%s)",
                                  paste(as.character(hadm_list), collapse=", ")))

save(file='../cache/notes.RData', notes)

Instead, the cache file is read in.

load('../cache/notes.RData')

3.4.1 Pooled Corpus

Virtually every patient has a multitude of different notes in their chart, from radiology and ultrasonography studies to nurses notes and discharge instructions. My initial thought was to look at the admission and discharge notes separately, as with the physiologic data above, however, there didn’t appear to be very many “admission” notes present. As it was, I opted to pool all of the nursing and physician notes for a given patient into a single document in order to maximize the amount of text fed into the TM/NLP model. I also opted to include the discharge summary since it ensures that every patient has at least one note.

pooled_notes <- data.frame(text=matrix(NA, length(hadm_list), ncol=2))
colnames(pooled_notes) <- c('doc_id', 'text')

for(i in 1:length(hadm_list)){
  pooled_notes[i, 'doc_id'] <- hadm_list[i]
  pooled_notes[i, 'text'] <- paste(notes[notes$hadm_id == hadm_list[i] &
                                         notes$category %in% 
                                           c('Nursing/other', 'Nursing', 'Physician', 'Discharge summary'), 'text'],
                                   collapse=" ")
}

We can visualize what the combined notes from a single patient look like. I display the first 500 characters. As can be seen, this text will require a significant amount of cleaning later on to remove the many numbers, punctuation marks, and white space.

print(substr( pooled_notes[1,'text'], 1, 500 ))
## [1] "Admission Date:  [**2209-7-31**]              Discharge Date:   [**2209-8-20**]\n\nDate of Birth:  [**2146-1-30**]             Sex:   M\n\nService: MEDICINE\n\nAllergies:\nAspirin\n\nAttending:[**First Name3 (LF) 2195**]\nChief Complaint:\n\"sepsis, pneumonia\"\n\n\nMajor Surgical or Invasive Procedure:\nnone\n\nHistory of Present Illness:\nThis is a 63 yo Cantonese-only speaking male with h/o type 2\ndiabetes, hypertension, medullary sponge kidney (with chronic\nrenal insufficiency) who now presents minimally respon"

The VCorpus() function creates a “volatile corpus,” which is kept in memory and is thus easier to create and manipulate when using large datasets.

corpus <- VCorpus(DataframeSource(pooled_notes))

3.4.2 Corpus cleaning

Next, we can convert all of the text within the corpus to lower-case. Note that due to the large size of the corpus document, the steps in this section were evaluated once, cached, and then the cache loaded at the end.

corpus <- tm_map(corpus, tolower)

When cleaning, we can also remove stop words - common words such as “and” and “the” which may skew any frequency-based analytics attempted later on. A list of common stop words:

print(stopwords('english'))

They can be removed from the corpus using the same tm_map() function used to convert upper-case letters. Punctuation, numbers, and extra white space are also removed.

corpus <- tm_map(corpus, removeWords, stopwords('english'))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)

As a last cleaning step, we can “stem” the Corpus. Stemming refers to isolating the main root (or “stem”) of the word of the word, ignoring common suffixes. For example, the word “analytics” might be considered equal to “analysis” depending on the algorithm used, considering they both share the same “analy” root. For the purposes of this assignment, I use Porter’s algorithm developed in the package SnowballC.

corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, PlainTextDocument)
save(corpus, file='../cache/corpus.RData')

As a reminder, the above steps were evaluated once and then the results saved. Evaluation resumed starting with this next code chunk:

load('../cache/corpus.RData')

print(substr( corpus[[1]], 1, 500 ))
## [1] "admiss date discharg date date birth sex m servic medicin allergi aspirin attendingfirst name lf chief complaint sepsi pneumonia major surgic invas procedur none histori present ill yo cantones speak male ho type diabet hypertens medullari spong kidney chronic renal insuffici now present minim respons letharg home patient seen weekend famili unusu went check today found fetal posit bed call em blood glucos s patient co ill prior found minim respons ed initi vs left ij place access pt fever ed pt"

3.4.3 Document term matrix

Now that the documents within the corpus have been cleaned, I create a Document Term Matrix (DTM). A DTM is a \(n\)-by-\(w\) matrix with \(n\)-rows representing \(n\) documents and \(w\) columns representing unique words. The individual elements represent the frequency of each word within the given document. A DTM can be trimmed to isolate only those words that occur a given number of times in the Corpus, thus ignoring specific words that might only appear a couple times in the entire data set.

dtm <- DocumentTermMatrix(corpus)
save(dtm, file='../cache/dtm.RData')

Additionally, since we cleared the metadata of the corpus when we converted it to plain-text, we can label it again using the hadm_id.

load('../cache/dtm.RData')
dtm$dimnames$Docs <- as.character(hadm_list)
print(dtm)
## <<DocumentTermMatrix (documents: 4689, terms: 172797)>>
## Non-/sparse entries: 4310549/805934584
## Sparsity           : 99%
## Maximal term length: 81
## Weighting          : term frequency (tf)

This summary of the DTM informs us that there are 172797 unique, individual words in the corpus. Additionally, the sparsity of the DTM tells us how many “empty” elements are in this array. In this case, having such high sparsity means that there are a large number of words which occur very infrequently. Words with sparsity above a given cutoff can be removed using the function removeSparseTerms()

dtm_sparse <- removeSparseTerms(dtm, 0.9)
dtm_matrix <- data.frame(as.matrix(dtm_sparse))

By removing terms which have a sparsity > 0.9, I only retain those words that occur in > 10% of the documents. This effectively reduces our feature set from 172797 terms to 2081.

After creating and simplifying the document term matrix, we can visualize the 100 most common words using a word cloud.

wordcloud(corpus, max.words=100, scale = c(3, 0.2), random.order=F, colors=brewer.pal(6, "Spectral"), rot.per=0.35)


3.5 Testing and Training Datasets

After both the physiological and text-derived predictors have been isolated, I split the data into training and testing data sets. Before splitting, I first separate the physiological predictors by time points (phy1_data and phy2_data). Then, I add the death column into the text-derived predictors. Finally, I pool the two datasets to create grp1_data and grp2_data.

# Clean slate
rm(list=setdiff(ls(), c('hadm_list', 'con', 'data', 'dtm_matrix', 'plotHistogram', 'plotPieChart', 'plotBarChart')))

# turn death into a facor
data$death = as.factor(data$death)

# Extract physiologic data collected near admission
phy1_data <- data[,c('hadm_id', 'death', 'severity', 'temperature_1', 'heartrate_1', 'resprate_1', 'paco2_1', 'wbccount_1', 'sirs_1', 'bilirubin_1', 'creatinine_1', 'lactate_1', 'potassium_1', 'calcium_1')]

# Extract physiologic data data collected near discharge
phy2_data <- data[,c('hadm_id', 'death', 'severity', 'temperature_2', 'heartrate_2', 'resprate_2', 'paco2_2', 'wbccount_2', 'sirs_2', 'bilirubin_2', 'creatinine_2', 'lactate_2', 'potassium_2', 'calcium_2')]

# Add hadm_id and death to dtm_matrix
dtm_matrix$hadm_id = row.names(dtm_matrix)
dtm_matrix <- merge(dtm_matrix[,!names(dtm_matrix) == 'death'], data[,c('hadm_id', 'death')], by='hadm_id', all.x=T)

# Merge admission/discharge physiologic data to the dtm matrix
grp1_data <- merge(phy1_data, dtm_matrix, by=c('hadm_id', 'death'), all=T)
grp2_data <- merge(phy2_data, dtm_matrix, by=c('hadm_id', 'death'), all=T)

Since the number of subjects included in the cohort is so large, I chose an 75/25% split in training/testing data. First, the list of admissions identifiers (hadm_list) is subset using the sample() function. Then, the physiologic data (data) and text-derived data (dtm) is subset based on the admissions identifiers.

A brief summary of the different constructed datasets:

  • phy1: physiologic (quantitative) data near hospital admission
  • phy2: physiologic (quantitative) data near discharge or death
  • text: text-derived word frequency data from the DTM.
  • grp1: grouped physiologic (near admission) and text-derived data
  • grp2: grouped physiologic (near discharge) and text-derived data
subset_indices <- sample(length(hadm_list), 0.25*length(hadm_list))
hadm_test <- hadm_list[subset_indices]
hadm_train <- hadm_list[-subset_indices]

phy1_test <-  phy1_data[phy1_data$hadm_id %in% hadm_test,]
phy1_train <- phy1_data[phy1_data$hadm_id %in% hadm_train,]

phy2_test <-  phy2_data[phy2_data$hadm_id %in% hadm_test,]
phy2_train <- phy2_data[phy2_data$hadm_id %in% hadm_train,]

text_test  <- dtm_matrix[dtm_matrix$hadm_id %in% hadm_test,]
text_train <- dtm_matrix[dtm_matrix$hadm_id %in% hadm_train,]

grp1_test <-  grp1_data[grp1_data$hadm_id %in% hadm_test,]
grp1_train <- grp1_data[grp1_data$hadm_id %in% hadm_train,]

grp2_test <-  grp2_data[grp1_data$hadm_id %in% hadm_test,]
grp2_train <- grp2_data[grp1_data$hadm_id %in% hadm_train,]

4 Results

To recap, in the previous section I built the SQL database, isolated the target population of sepsis patients, combed the database to collect as many physiologic parameters from the SIRS & SOFA scoring system as was feasible, and used a variety of free-text notes to generate a Document Term Matrix containing word frequency in the physician and nursing notes. In this section I will attempt to use these variables to classify subjects by survival, thus framing the outcome as a binary classification model.

I chose to explore two types of classification models: recursive classification via decision tree and Naive Bayesian classification. These two methods were chosen due to their seemingly opposing nature: while decision trees select variables which allow the highest information gain in order to split (DSPA Lecture 8), the Naive Bayes model assumes equal importance of each variable but is often used as a “powerful text classifier” (DSPA Lecture 7).


4.1 Decision tree classification

Because of the large feature set derived from the free-text data, I choose to first evaluate the performance of a decision tree models as they may provide insight into which terms from the DTM are most useful in differentiating between subjects.

Decision tree models are also an attractive starting point, as the resulting model is a human-readable algorithm (tree) that is similar to many of the goal-directed clinical decision algorithms currently in place (as an example, the advanced care and life support algorithm for cardiac arrest as described by Neumar et. al, 2010).

While decision tree models work through recursive partitioning - i.e. splitting aggregated data into smaller and smaller groups based on some optimization function - there are many different methods for doing so.

I chose to use the rpart package because it uses the underlying CART (Classification And Regression Tree) algorithm. This is a supervising clustering algorithm which is limited to making binary splits (Therneau & Atkinson, 2018). This is in contrast to other packages such as party which use the unsupervised CHAID (Chi-square Automatic Interaction Detector) algorithm.


4.1.1 Physiologic predictors

The first step was to investigate the physiologic predictors. I remove hadm_id and severity from the model, and then created two rpart decision trees.

First, I input the physiologic data collected nearest admission. Recall that this data represents the first available result during a patient’s stay. Models which can predict mortality at this stage will be most useful, however, any signal that may be present is likely at its weakest; much can change between the test results at this stage and at the end of a patient’s stay.

The model is built using the rpart() function, then the tree is plotted. Secondly, the constructed model is applied to the testing dataset using the predict() function. Note that in the visualization of the tree, FALSE indicates that the patient survived to discharge and TRUE indicates that the patient is deceased.

phy1_rpart <- rpart('death~.', data=phy1_train[, -c(1, 3)])
fancyRpartPlot(phy1_rpart)

phy1_rpart_pred <- predict(phy1_rpart, phy1_test, type='class')
cm <- confusionMatrix(phy1_rpart_pred, phy1_test$death)
print(cm)
## Confusion Matrix and Statistics
##
##           Reference
## Prediction FALSE TRUE
##      FALSE   694  320
##      TRUE     77   81
##
##                Accuracy : 0.6613
##                  95% CI : (0.6334, 0.6884)
##     No Information Rate : 0.6578
##     P-Value [Acc > NIR] : 0.4159
##
##                   Kappa : 0.1195
##  Mcnemar's Test P-Value : <2e-16
##
##             Sensitivity : 0.9001
##             Specificity : 0.2020
##          Pos Pred Value : 0.6844
##          Neg Pred Value : 0.5127
##              Prevalence : 0.6578
##          Detection Rate : 0.5922
##    Detection Prevalence : 0.8652
##       Balanced Accuracy : 0.5511
##
##        'Positive' Class : FALSE
## 

Based on this model, the accuracy is reported as 0.66 with a sensitivity of 0.90 and a specificity of 0.20. Considering the model is based solely on lactate and temperature, this outcome seems reasonable.

We can also create a model for those values recorded shortly before discharge or death.

phy2_rpart <- rpart('death~.', data=phy2_train[, -c(1, 3)])
fancyRpartPlot(phy2_rpart)

phy2_rpart_pred <- predict(phy2_rpart, phy2_test, type='class')
cm <- confusionMatrix(phy2_rpart_pred, phy2_test$death)
print(cm)
## Confusion Matrix and Statistics
##
##           Reference
## Prediction FALSE TRUE
##      FALSE   664  118
##      TRUE    107  283
##
##                Accuracy : 0.808
##                  95% CI : (0.7843, 0.8302)
##     No Information Rate : 0.6578
##     P-Value [Acc > NIR] : <2e-16
##
##                   Kappa : 0.5707
##  Mcnemar's Test P-Value : 0.505
##
##             Sensitivity : 0.8612
##             Specificity : 0.7057
##          Pos Pred Value : 0.8491
##          Neg Pred Value : 0.7256
##              Prevalence : 0.6578
##          Detection Rate : 0.5666
##    Detection Prevalence : 0.6672
##       Balanced Accuracy : 0.7835
##
##        'Positive' Class : FALSE
## 

As might be expected, this model is much more reasonable with an accuracy of reported as 0.81! It is considerably more complicated, using many more predictors across a number of steps. However, this model would be of limited clinical usefulness.


4.1.2 Text-derived predictors

We can also try to construct an rpart model based solely on the features derived from the collected notes.

text_rpart <- rpart('death~.', data=text_train[, -c(1, 3)])
fancyRpartPlot(text_rpart)

text_rpart_pred <- predict(text_rpart, text_test, type='class')
cm <- confusionMatrix(text_rpart_pred, text_test$death)
print(cm)
## Confusion Matrix and Statistics
##
##           Reference
## Prediction FALSE TRUE
##      FALSE   748   21
##      TRUE     23  380
##
##                Accuracy : 0.9625
##                  95% CI : (0.9499, 0.9726)
##     No Information Rate : 0.6578
##     P-Value [Acc > NIR] : <2e-16
##
##                   Kappa : 0.9167
##  Mcnemar's Test P-Value : 0.8802
##
##             Sensitivity : 0.9702
##             Specificity : 0.9476
##          Pos Pred Value : 0.9727
##          Neg Pred Value : 0.9429
##              Prevalence : 0.6578
##          Detection Rate : 0.6382
##    Detection Prevalence : 0.6561
##       Balanced Accuracy : 0.9589
##
##        'Positive' Class : FALSE
## 

At first glance, the text-based model appears to perform spectacularly, with an accuracy of 0.96! However, when looking at the resulting tree, we see that there are several confounders. For example, the presence of the term expir (likely stemmed from some form of expire) is associated with poor outcome, while discharg and followup are associated with survival. This clearly isn’t the intended outcome.

To remedy this issue, I repeated the DTM processing above without including the discharge notes. This will be the data used for the remainder of the assignment.

rm(list=c('text_test', 'text_train', 'grp1_test', 'grp1_train', 'grp2_test', 'grp2_train'))
load('../cache/test_train_nodischarge.RData')

text_rpart <- rpart('death~.', data=text_train[, -c(1, 3)])
fancyRpartPlot(text_rpart)

text_rpart_pred <- predict(text_rpart, text_test, type='class')
cm <- confusionMatrix(text_rpart_pred, text_test$death)
print(cm)
## Confusion Matrix and Statistics
##
##           Reference
## Prediction FALSE TRUE
##      FALSE   730  197
##      TRUE     41  204
##
##                Accuracy : 0.7969
##                  95% CI : (0.7727, 0.8196)
##     No Information Rate : 0.6578
##     P-Value [Acc > NIR] : < 2.2e-16
##
##                   Kappa : 0.5025
##  Mcnemar's Test P-Value : < 2.2e-16
##
##             Sensitivity : 0.9468
##             Specificity : 0.5087
##          Pos Pred Value : 0.7875
##          Neg Pred Value : 0.8327
##              Prevalence : 0.6578
##          Detection Rate : 0.6229
##    Detection Prevalence : 0.7910
##       Balanced Accuracy : 0.7278
##
##        'Positive' Class : FALSE
## 

After removing the discharge notes, the model still performs fairly well with an accuracy of 0.80, (sensitivity = 0.95 and specificity = 0.51). Two of the branches that I found especially interesting were “vasopressin” and “chair” - having vasopressin (a drug often used in sepsis to treat hypotension) in the chart was associated with worse outcomes. Having “chair” in the note was associated with increased survival, presumably indicating that strictly bedridden patients do not perform as well.


4.1.3 Combined model

The combined model - physiological and text-based predictors - can also be entered into the rpart decision tree algorithm. First, using the physiologic data from admission (the text-based corpus remains the same):

grp1_rpart <- rpart('death~.', data=grp1_train[, -c(1, 3)])
fancyRpartPlot(grp1_rpart)

grp1_rpart_pred <- predict(grp1_rpart, grp1_test, type='class')
cm <- confusionMatrix(grp1_rpart_pred, grp1_test$death)
print(cm)
## Confusion Matrix and Statistics
##
##           Reference
## Prediction FALSE TRUE
##      FALSE   705  167
##      TRUE     66  234
##
##                Accuracy : 0.8012
##                  95% CI : (0.7772, 0.8237)
##     No Information Rate : 0.6578
##     P-Value [Acc > NIR] : < 2.2e-16
##
##                   Kappa : 0.53
##  Mcnemar's Test P-Value : 5.707e-11
##
##             Sensitivity : 0.9144
##             Specificity : 0.5835
##          Pos Pred Value : 0.8085
##          Neg Pred Value : 0.7800
##              Prevalence : 0.6578
##          Detection Rate : 0.6015
##    Detection Prevalence : 0.7440
##       Balanced Accuracy : 0.7490
##
##        'Positive' Class : FALSE
## 

The accuracy of this combined model is 0.80, similar to the text-based model. However, several important physiologic predictors filter in, including lactate and potassium. Additionally, the sensitivity and specificity of this model are a bit more balanced, at 0.91 and 0.58 respectively.

Next, using the physiologic data from near discharge:

grp2_rpart <- rpart('death~.', data=grp2_train[, -c(1, 3)])
fancyRpartPlot(grp2_rpart)

grp2_rpart_pred <- predict(grp2_rpart, grp2_test, type='class')
cm <- confusionMatrix(grp2_rpart_pred, grp2_test$death)
print(cm)
## Confusion Matrix and Statistics
##
##           Reference
## Prediction FALSE TRUE
##      FALSE   678   87
##      TRUE     93  314
##
##                Accuracy : 0.8464
##                  95% CI : (0.8245, 0.8666)
##     No Information Rate : 0.6578
##     P-Value [Acc > NIR] : <2e-16
##
##                   Kappa : 0.6601
##  Mcnemar's Test P-Value : 0.7094
##
##             Sensitivity : 0.8794
##             Specificity : 0.7830
##          Pos Pred Value : 0.8863
##          Neg Pred Value : 0.7715
##              Prevalence : 0.6578
##          Detection Rate : 0.5785
##    Detection Prevalence : 0.6527
##       Balanced Accuracy : 0.8312
##
##        'Positive' Class : FALSE
## 

This combined model has the highest accuracy of any model so far (besides the one for which discharge notes were included!), at 0.85. However, it is probably the least clinically-feasible model, since it incorporates all of the notes through the duration of the stay and the physiologic parameters shortly before discharge or death.


4.2 Naive Bayes

Naive Bayesian classification is classification algorithm for categorical data. Rather than rpart decision trees, which try to recursively split the data to differentiate between defined classes, Naive Bayesian classification returns a probability of each class for each patient given the entire feature set.

According to Lecture 7, the Naive Bayes algorithm typically depends on the assumption that “all of the features are equally important and independent.” However, the model can still be useful in cases where the assumptions are not valid, especially when the number of features is large - hence why Naive Bayes is “frequently used for text classification.”

As was the case with the rpart model, Naive Bayesian models are constructed with both of the physiologic datasets (admission and discharge/death), using the text-derived features alone, and the two combined datasets (admission + text, discharge/death + text).


4.2.1 Physiologic predictors

# Physiologic predictors after admssions
phy1_nbayes <- naiveBayes(x=phy1_train[, -c(1, 2, 3)], y=phy1_train$death)

phy1_nbayes_pred <- predict(phy1_nbayes, phy1_test, type='class')
cm <- confusionMatrix(phy1_nbayes_pred, phy1_test$death)
print(cm)
## Confusion Matrix and Statistics
##
##           Reference
## Prediction FALSE TRUE
##      FALSE   678  265
##      TRUE     93  136
##
##                Accuracy : 0.6945
##                  95% CI : (0.6673, 0.7208)
##     No Information Rate : 0.6578
##     P-Value [Acc > NIR] : 0.004179
##
##                   Kappa : 0.2436
##  Mcnemar's Test P-Value : < 2.2e-16
##
##             Sensitivity : 0.8794
##             Specificity : 0.3392
##          Pos Pred Value : 0.7190
##          Neg Pred Value : 0.5939
##              Prevalence : 0.6578
##          Detection Rate : 0.5785
##    Detection Prevalence : 0.8046
##       Balanced Accuracy : 0.6093
##
##        'Positive' Class : FALSE
## 

Similar to the previous rpart model of initial physiologic data, the accuracy of this model was 0.69. The low specificity (0.34) is particularly concerning. Part of the reason why the Naive Bayesian model did not achieve particularly good results is that there are not that many features (12) and they may not be equally important in determining the patient’s outcome.

# Physiologic predictors before discharge
phy2_nbayes <- naiveBayes(x=phy2_train[, -c(1, 2, 3)], y=phy2_train$death)

phy2_nbayes_pred <- predict(phy2_nbayes, phy2_test, type='class')
cm <- confusionMatrix(phy2_nbayes_pred, phy2_test$death)
print(cm)
## Confusion Matrix and Statistics
##
##           Reference
## Prediction FALSE TRUE
##      FALSE   716  128
##      TRUE     55  273
##
##                Accuracy : 0.8439
##                  95% CI : (0.8218, 0.8642)
##     No Information Rate : 0.6578
##     P-Value [Acc > NIR] : < 2.2e-16
##
##                   Kappa : 0.6373
##  Mcnemar's Test P-Value : 1.024e-07
##
##             Sensitivity : 0.9287
##             Specificity : 0.6808
##          Pos Pred Value : 0.8483
##          Neg Pred Value : 0.8323
##              Prevalence : 0.6578
##          Detection Rate : 0.6109
##    Detection Prevalence : 0.7201
##       Balanced Accuracy : 0.8047
##
##        'Positive' Class : FALSE
## 

The accuracy of the model using late-stage physiologic data was, unsurprisingly, considerably better. It had an accuracy of 0.84 with a sensitivity and specificity of 0.93 and 0.68 respectively.


4.2.2 Text-derived predictors.

Next, I use the Naive Bayes model to classify subjects based on the corpus of notes for each patient.

# Text-derived predictors
text_nbayes <- naiveBayes(x=text_train[, -c(1, 2, 3)], y=text_train$death)

text_nbayes_pred <- predict(text_nbayes, text_test, type='class')
cm <- confusionMatrix(text_nbayes_pred, text_test$death)
print(cm)
## Confusion Matrix and Statistics
##
##           Reference
## Prediction FALSE TRUE
##      FALSE   701  329
##      TRUE     70   72
##
##                Accuracy : 0.6596
##                  95% CI : (0.6316, 0.6867)
##     No Information Rate : 0.6578
##     P-Value [Acc > NIR] : 0.4645
##
##                   Kappa : 0.105
##  Mcnemar's Test P-Value : <2e-16
##
##             Sensitivity : 0.9092
##             Specificity : 0.1796
##          Pos Pred Value : 0.6806
##          Neg Pred Value : 0.5070
##              Prevalence : 0.6578
##          Detection Rate : 0.5981
##    Detection Prevalence : 0.8788
##       Balanced Accuracy : 0.5444
##
##        'Positive' Class : FALSE
## 

This model did not perform particularly well. The majority of patients were classified as survivors regardless of their actual mortality status, indicated by a high sensitivity but low specificity.


4.2.3 Combined model

# Grouped: physiologic parameters after admission + text
grp1_nbayes <- naiveBayes(x=grp1_train[, -c(1, 2, 3)], y=grp1_train$death)

grp1_nbayes_pred <- predict(grp1_nbayes, grp1_test, type='class')
cm <- confusionMatrix(grp1_nbayes_pred, grp1_test$death)
print(cm)
## Confusion Matrix and Statistics
##
##           Reference
## Prediction FALSE TRUE
##      FALSE   699  329
##      TRUE     72   72
##
##                Accuracy : 0.6578
##                  95% CI : (0.6299, 0.685)
##     No Information Rate : 0.6578
##     P-Value [Acc > NIR] : 0.5136
##
##                   Kappa : 0.1018
##  Mcnemar's Test P-Value : <2e-16
##
##             Sensitivity : 0.9066
##             Specificity : 0.1796
##          Pos Pred Value : 0.6800
##          Neg Pred Value : 0.5000
##              Prevalence : 0.6578
##          Detection Rate : 0.5964
##    Detection Prevalence : 0.8771
##       Balanced Accuracy : 0.5431
##
##        'Positive' Class : FALSE
## 
# Grouped: physiologic parameters before discharge + text
grp2_nbayes <- naiveBayes(x=grp2_train[, -c(1, 2, 3)], y=grp2_train$death)

grp2_nbayes_pred <- predict(grp2_nbayes, grp2_test, type='class')
cm <- confusionMatrix(grp2_nbayes_pred, grp2_test$death)
print(cm)
## Confusion Matrix and Statistics
##
##           Reference
## Prediction FALSE TRUE
##      FALSE   697  319
##      TRUE     74   82
##
##                Accuracy : 0.6647
##                  95% CI : (0.6368, 0.6917)
##     No Information Rate : 0.6578
##     P-Value [Acc > NIR] : 0.323
##
##                   Kappa : 0.1271
##  Mcnemar's Test P-Value : <2e-16
##
##             Sensitivity : 0.9040
##             Specificity : 0.2045
##          Pos Pred Value : 0.6860
##          Neg Pred Value : 0.5256
##              Prevalence : 0.6578
##          Detection Rate : 0.5947
##    Detection Prevalence : 0.8669
##       Balanced Accuracy : 0.5543
##
##        'Positive' Class : FALSE
## 

Both of the grouped models performed approximately as well as the text-based model alone. This is likely due to the fact that all parameters are assumed to play an equal role, and the number of text-based predictors (1251) far exceeds the number of physiologic features (11).


5 Limitations

While the intent of this project was to investigate sepsis patients in the MIMIC-III database and use predictive analytic tools to predict patient mortality using physiologic and text-derived features, there were several limitations. The cohort selection process is one such limitation discussed previously, where patients who are discharged to long-term care or other outside facilities are “lost to follow-up” and are assumed to have survived their encounter. Additionally, due to the immense size and complex nature of the MIMIC dataset, only a small fraction of the physiologic data collected could be leveraged. With regard to text-mining, the data collected represents only one hospital in over a finite number of years - the precise decision trees and classification models may or may not be generalizable to other hospital systems. Lastly, future work should focus on better utilizing the longitudinal nature of this data, as pooling data into “near admission” or “near discharge” and grouping all notes together into an aggregate document is sub-optimal. Despite these limitations, however, this project was able to demonstrate that there are important physiologic and text-derived predictors present in this dataset that may be further investigated.


6 Conclusion

Of the approximately 4700 sepsis encounters isolated in the MIMIC-III dataset, roughly one-third of these patients died while the remainder survived at least until discharge from the hospital. Predicting mortality is a major focus of new sepsis screening tests, and adequately assessing risk is an important part of early, directed treatment.

Using 11 of commonly-cited sepsis predictors, in addition to the free-text data entered as physician and nursing notes, I have attempted to build models which can correctly classify those patients who are at highest risk. While much attention has been placed on using physiologic parameters in the past, the text-based models using natural language processing outperformed the physiologic parameters in nearly every case. Specifically, a decision tree might be used to successfully differentiate between patients using a limited number of words and features, making it easier to deploy, use, and understand. Notably, many of these models have higher sensitivity than they do specificity, indicating that future iterations of this work may be better used as a screening tool than a specific predictor.


7 Acknowledgements

I would first like to thank Dr. Ivo Dinov of the University of Michigan Statistics Online Computational Resource (SOCR) and Michigan Institute for Data Science (MIDAS), whose excellent Data Analysis and Predictive Analytics course provided the foundational analytic tools for this project. I also owe much to three of my mentors, Dr. Hakam Tiba, Mr. Brendan McCracken and Dr. Kevin Ward of the Michigan Center for Integrative Research in Critical Care. Lastly, I would like to thank Dr. Ashwin Belle of Fifth-Eye Inc., who introduced me to the MIMIC-III dataset, and Dr. Robert Dickson of the Division of Pulmonary and Critical Care Medicine, who first pointed me in the direction of further exploring mortality prediction in sepsis.


8 References

  • Abe, T., Ogura, H., Shiraishi, A., Kushimoto, S., Saitoh, D., Fujishima, S., … JAAM FORECAST group. (2018). Characteristics, management, and in-hospital mortality among patients with severe sepsis in intensive care units in Japan: the FORECAST study. Critical Care (London, England), 22(1), 322. https://doi.org/10.1186/s13054-018-2186-7
  • Angus, D. C., & van der Poll, T. (2013). Severe Sepsis and Septic Shock. New England Journal of Medicine, 369(9), 840–851. https://doi.org/10.1056/NEJMra1208623
  • Bone, R. C., Balk, R. A., Cerra, F. B., Dellinger, R. P., Fein, A. M., Knaus, W. A., … Sibbald, W. J. (1992). Definitions for sepsis and organ failure and guidelines for the use of innovative therapies in sepsis. The ACCP/SCCM Consensus Conference Committee. American College of Chest Physicians/Society of Critical Care Medicine. Chest, 101(6), 1644–1655.
  • Gaieski, D. F., Edwards, J. M., Kallan, M. J., & Carr, B. G. (2013). Benchmarking the incidence and mortality of severe sepsis in the United States. Critical Care Medicine, 41(5), 1167–1174. https://doi.org/10.1097/CCM.0b013e31827c09f8
  • Gül, F., Arslantaş, M. K., Cinel, İ., & Kumar, A. (2017). Changing Definitions of Sepsis. Turkish Journal of Anaesthesiology and Reanimation, 45(3), 129–138. https://doi.org/10.5152/TJAR.2017.93753
  • Install MIMIC (Unix/Mac). (n.d.). Retrieved November 24, 2018, from https://mimic.physionet.org/tutorials/install-mimic-locally-ubuntu/
  • Iwashyna, T. J., Ely, E. W., Smith, D. M., & Langa, K. M. (2010). Long-term cognitive impairment and functional disability among survivors of severe sepsis. JAMA, 304(16), 1787–1794. https://doi.org/10.1001/jama.2010.1553
  • Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L. H., Feng, M., Ghassemi, M., … Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. Scientific Data, 3, 160035. https://doi.org/10.1038/sdata.2016.35
  • Johnson, A. E. W., Stone, D. J., Celi, L. A., & Pollard, T. J. (2018). The MIMIC Code Repository: enabling reproducibility in critical care research. Journal of the American Medical Informatics Association, 25(1), 32–39. https://doi.org/10.1093/jamia/ocx084
  • Levy, M. M., Fink, M. P., Marshall, J. C., Abraham, E., Angus, D., Cook, D., … for the International Sepsis Definitions Conference. (2003). 2001 SCCM/ESICM/ACCP/ATS/SIS International Sepsis Definitions Conference. Intensive Care Medicine, 29(4), 530–538. https://doi.org/10.1007/s00134-003-1662-x
  • Majno, G. (1991). The ancient riddle of sigma eta psi iota sigma (sepsis). The Journal of Infectious Diseases, 163(5), 937–945.
  • Neumar RW, Otto CW, Link MS, Kronick SL, Shuster M, Callaway CW, Kudenchuk PJ, Ornato JP, McNally B, Silvers SM, Passman RS, White RD, Hess EP, Tang W, Davis D, Sinz E, Morrison LJ. (2010). Part 8: adult advanced cardiovascular life support: 2010 American Heart Association Guidelines for Cardiopulmonary Resuscitation and Emergency Cardiovascular Care. Circulation, 122(18 Suppl 3):S729-67. https://doi.org/10.1161/CIRCULATIONAHA.110.970988
  • Patil N, Lathi R, Chitre V. Comparison of C5.0 & CART Classification algorithms using pruning technique. International Journal of Engineering Research. 2012;1(4):6.
  • Singer, M., Deutschman, C. S., Seymour, C. W., Shankar-Hari, M., Annane, D., Bauer, M., … Angus, D. C. (2016). The Third International Consensus Definitions for Sepsis and Septic Shock (Sepsis-3). JAMA, 315(8), 801–810. https://doi.org/10.1001/jama.2016.0287
  • Therneau TM, Atkinson EJ, Foundation M. An Introduction to Recursive Partitioning Using the RPART Routines.
  • Torio, C. M., & Andrews, R. M. (2006). National Inpatient Hospital Costs: The Most Expensive Conditions by Payer, 2011: Statistical Brief #160. In Healthcare Cost and Utilization Project (HCUP) Statistical Briefs. Rockville (MD): Agency for Healthcare Research and Quality (US). Retrieved from http://www.ncbi.nlm.nih.gov/books/NBK169005/
# Disconnect from database
invisible(dbDisconnect(con))