Biostat 203B Homework 2

Q1. PhysioNet credential
Q2. read.csv (base R) vs read_csv (tidyverse) vs fread (data.table)
Q3. ICU stays
Q4. admission data
Q5. patient data
Q6. Lab results
Q7. Vitals from chartered events
Q8. Putting things together

Display machine information for reproducibility:

sessionInfo()

## R version 3.6.0 (2019-04-26)
## Platform: x86_64-redhat-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.6.0  magrittr_2.0.1  tools_3.6.0     htmltools_0.5.0
##  [5] yaml_2.2.1      stringi_1.5.3   rmarkdown_2.6   knitr_1.30     
##  [9] stringr_1.4.0   xfun_0.19       digest_0.6.27   rlang_0.4.10   
## [13] evaluate_0.14

knitr::opts_chunk$set(echo = TRUE, cache = TRUE, cache.lazy = FALSE)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## The following object is masked from 'package:purrr':
## 
##     transpose

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:data.table':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

os <- sessionInfo()$running
if (str_detect(os, "Linux")) {
  mimic_path <- "/usr/203b-data/mimic-iv"
} else if (str_detect(os, "macOS")) {
  mimic_path <- "/Users/huazhou/Documents/Box Sync/MIMIC/mimic-iv-0.4"
}

Use tidyverse (ggpot2, dplyr) to explore the MIMIC-IV data introduced in homework 1.

# tree -s -L 2 /Users/huazhou/Documents/Box\ Sync/MIMIC/mimic-iv-0.4
system(str_c("tree -s -L 2 ", shQuote(mimic_path)), intern = TRUE)

##  [1] "/usr/203b-data/mimic-iv"                         
##  [2] "├── [         78]  core"                         
##  [3] "│   ├── [   17224843]  admissions.csv.gz"        
##  [4] "│   ├── [    2884996]  patients.csv.gz"          
##  [5] "│   └── [   51188147]  transfers.csv.gz"         
##  [6] "├── [       4096]  hosp"                         
##  [7] "│   ├── [     430049]  d_hcpcs.csv.gz"           
##  [8] "│   ├── [   26575586]  diagnoses_icd.csv.gz"     
##  [9] "│   ├── [     723633]  d_icd_diagnoses.csv.gz"   
## [10] "│   ├── [     564422]  d_icd_procedures.csv.gz"  
## [11] "│   ├── [      14845]  d_labitems.csv.gz"        
## [12] "│   ├── [   12913088]  drgcodes.csv.gz"          
## [13] "│   ├── [  518077567]  emar.csv.gz"              
## [14] "│   ├── [  479709397]  emar_detail.csv.gz"       
## [15] "│   ├── [    1415469]  hcpcsevents.csv.gz"       
## [16] "│   ├── [ 2093725833]  labevents.csv.gz"         
## [17] "│   ├── [   15896456]  microbiologyevents.csv.gz"
## [18] "│   ├── [  423170857]  pharmacy.csv.gz"          
## [19] "│   ├── [  501822286]  poe.csv.gz"               
## [20] "│   ├── [   23675550]  poe_detail.csv.gz"        
## [21] "│   ├── [  367321152]  prescriptions.csv.gz"     
## [22] "│   ├── [    4965027]  procedures_icd.csv.gz"    
## [23] "│   └── [    9579255]  services.csv.gz"          
## [24] "├── [        189]  icu"                          
## [25] "│   ├── [ 2264326210]  chartevents.csv.gz"       
## [26] "│   ├── [   40440772]  datetimeevents.csv.gz"    
## [27] "│   ├── [      56593]  d_items.csv.gz"           
## [28] "│   ├── [    2628845]  icustays.csv.gz"          
## [29] "│   ├── [  328835832]  inputevents.csv.gz"       
## [30] "│   ├── [   35300863]  outputevents.csv.gz"      
## [31] "│   └── [   19362097]  procedureevents.csv.gz"   
## [32] "├── [       2518]  LICENSE.txt"                  
## [33] "└── [       2459]  SHA256SUMS.txt"               
## [34] ""                                                
## [35] "3 directories, 29 files"

Q1. PhysioNet credential

At this moment, you should already get credentialed on the PhysioNet. Please include a screenshot of your Data Use Agreement for the MIMIC-IV (v0.4).

Q2. `read.csv` (base R) vs `read_csv` (tidyverse) vs `fread` (data.table)

There are quite a few utilities in R for reading data files. Let us test the speed of reading a moderate sized compressed csv file, admissions.csv.gz, by three programs: read.csv in base R, read_csv in tidyverse, and fread in the popular data.table package. Is there any speed difference? (Hint: R function system.time measures runtimes.)

In this homework, we stick to the tidyverse or data.table.

Q3. ICU stays

icustays.csv.gz (https://mimic-iv.mit.edu/docs/datasets/icu/icustays/) contains data about Intensive Care Units (ICU) stays. Summarize following variables using appropriate numerics or graphs:

how many unique stay_id?
how many unique subject_id?
length of ICU stay
first ICU unit
last ICU unit

Q4. `admission` data

Information of the patients admitted into hospital is available in admissions.csv.gz. See https://mimic-iv.mit.edu/docs/datasets/core/admissions/ for details of each field in this file. Summarize following variables using appropriate graphs. Explain any patterns you observe.

admission year
admission month
admission month day
admission week day
admission hour (anything unusual?)
number of deaths in each year
admission type
number of admissions per patient
admission location
discharge location
insurance
language
martial status
ethnicity
death

Note it is possible that one patient (uniquely identified by the subject_id) is admitted into hospital multiple times. When summarizing some demographic information, it makes sense to summarize based on unique patients.

Q5. `patient` data

Explore patients.csv.gz (https://mimic-iv.mit.edu/docs/datasets/core/patients/) and summarize following variables using appropriate numerics and graphs:

gender
anchor_age (explain pattern you see)

Q6. Lab results

labevents.csv.gz (https://mimic-iv.mit.edu/docs/datasets/hosp/labevents/) contains all laboratory measurements for patients.

We are interested in the lab measurements of creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), glucose (50931), magnesium (50960), calcium (50893), and lactate (50813). Find the itemids of these lab measurements from d_labitems.csv.gz and retrieve a subset of labevents.csv.gz only containing these items.

Q7. Vitals from chartered events

We are interested in the vitals for ICU patients: heart rate, mean and systolic blood pressure (invasive and noninvasive measurements combined), body temperature, SpO2, and respiratory rate. Find the itemids of these vitals from d_items.csv.gz and retrieve a subset of chartevents.csv.gz only containing these items.

chartevents.csv.gz (https://mimic-iv.mit.edu/docs/datasets/icu/chartevents/) contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid.

d_items.csv.gz (https://mimic-iv.mit.edu/docs/datasets/icu/d_items/) is the dictionary for the itemid in chartevents.csv.gz.

Q8. Putting things together

Let us create a tibble for all ICU stays, where rows are

first ICU stay of each unique patient
adults (age at admission > 18)

and columns contain at least following variables

all variables in icustays.csv.gz
all variables in admission.csv.gz
all variables in patients.csv.gz
first lab measurements during ICU stay
first vitals measurement during ICU stay
an indicator variable whether the patient died within 30 days of hospital admission

Biostat 203B Homework 2

Due Feb 5 Feb 12 @ 11:59PM