Display machine information for reproducibility:
sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-redhat-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
##
## Matrix products: default
## BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.6.0 magrittr_2.0.1 tools_3.6.0 htmltools_0.5.0
## [5] yaml_2.2.1 stringi_1.5.3 rmarkdown_2.6 knitr_1.30
## [9] stringr_1.4.0 xfun_0.19 digest_0.6.27 rlang_0.4.10
## [13] evaluate_0.14
knitr::opts_chunk$set(echo = TRUE, cache = TRUE, cache.lazy = FALSE)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.4 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
os <- sessionInfo()$running
if (str_detect(os, "Linux")) {
mimic_path <- "/usr/203b-data/mimic-iv"
} else if (str_detect(os, "macOS")) {
mimic_path <- "/Users/huazhou/Documents/Box Sync/MIMIC/mimic-iv-0.4"
}
Use tidyverse (ggpot2, dplyr) to explore the MIMIC-IV data introduced in homework 1.
# tree -s -L 2 /Users/huazhou/Documents/Box\ Sync/MIMIC/mimic-iv-0.4
system(str_c("tree -s -L 2 ", shQuote(mimic_path)), intern = TRUE)
## [1] "/usr/203b-data/mimic-iv"
## [2] "├── [ 78] core"
## [3] "│ ├── [ 17224843] admissions.csv.gz"
## [4] "│ ├── [ 2884996] patients.csv.gz"
## [5] "│ └── [ 51188147] transfers.csv.gz"
## [6] "├── [ 4096] hosp"
## [7] "│ ├── [ 430049] d_hcpcs.csv.gz"
## [8] "│ ├── [ 26575586] diagnoses_icd.csv.gz"
## [9] "│ ├── [ 723633] d_icd_diagnoses.csv.gz"
## [10] "│ ├── [ 564422] d_icd_procedures.csv.gz"
## [11] "│ ├── [ 14845] d_labitems.csv.gz"
## [12] "│ ├── [ 12913088] drgcodes.csv.gz"
## [13] "│ ├── [ 518077567] emar.csv.gz"
## [14] "│ ├── [ 479709397] emar_detail.csv.gz"
## [15] "│ ├── [ 1415469] hcpcsevents.csv.gz"
## [16] "│ ├── [ 2093725833] labevents.csv.gz"
## [17] "│ ├── [ 15896456] microbiologyevents.csv.gz"
## [18] "│ ├── [ 423170857] pharmacy.csv.gz"
## [19] "│ ├── [ 501822286] poe.csv.gz"
## [20] "│ ├── [ 23675550] poe_detail.csv.gz"
## [21] "│ ├── [ 367321152] prescriptions.csv.gz"
## [22] "│ ├── [ 4965027] procedures_icd.csv.gz"
## [23] "│ └── [ 9579255] services.csv.gz"
## [24] "├── [ 189] icu"
## [25] "│ ├── [ 2264326210] chartevents.csv.gz"
## [26] "│ ├── [ 40440772] datetimeevents.csv.gz"
## [27] "│ ├── [ 56593] d_items.csv.gz"
## [28] "│ ├── [ 2628845] icustays.csv.gz"
## [29] "│ ├── [ 328835832] inputevents.csv.gz"
## [30] "│ ├── [ 35300863] outputevents.csv.gz"
## [31] "│ └── [ 19362097] procedureevents.csv.gz"
## [32] "├── [ 2518] LICENSE.txt"
## [33] "└── [ 2459] SHA256SUMS.txt"
## [34] ""
## [35] "3 directories, 29 files"
At this moment, you should already get credentialed on the PhysioNet. Please include a screenshot of your Data Use Agreement for the MIMIC-IV (v0.4)
.
read.csv
(base R) vs read_csv
(tidyverse) vs fread
(data.table)There are quite a few utilities in R for reading data files. Let us test the speed of reading a moderate sized compressed csv file, admissions.csv.gz
, by three programs: read.csv
in base R, read_csv
in tidyverse, and fread
in the popular data.table package. Is there any speed difference? (Hint: R function system.time
measures runtimes.)
In this homework, we stick to the tidyverse or data.table.
icustays.csv.gz
(https://mimic-iv.mit.edu/docs/datasets/icu/icustays/) contains data about Intensive Care Units (ICU) stays. Summarize following variables using appropriate numerics or graphs:
stay_id
?subject_id
?admission
dataInformation of the patients admitted into hospital is available in admissions.csv.gz
. See https://mimic-iv.mit.edu/docs/datasets/core/admissions/ for details of each field in this file. Summarize following variables using appropriate graphs. Explain any patterns you observe.
Note it is possible that one patient (uniquely identified by the subject_id
) is admitted into hospital multiple times. When summarizing some demographic information, it makes sense to summarize based on unique patients.
patient
dataExplore patients.csv.gz
(https://mimic-iv.mit.edu/docs/datasets/core/patients/) and summarize following variables using appropriate numerics and graphs:
gender
anchor_age
(explain pattern you see)labevents.csv.gz
(https://mimic-iv.mit.edu/docs/datasets/hosp/labevents/) contains all laboratory measurements for patients.
We are interested in the lab measurements of creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), glucose (50931), magnesium (50960), calcium (50893), and lactate (50813). Find the itemid
s of these lab measurements from d_labitems.csv.gz
and retrieve a subset of labevents.csv.gz
only containing these items.
We are interested in the vitals for ICU patients: heart rate, mean and systolic blood pressure (invasive and noninvasive measurements combined), body temperature, SpO2, and respiratory rate. Find the itemid
s of these vitals from d_items.csv.gz
and retrieve a subset of chartevents.csv.gz
only containing these items.
chartevents.csv.gz
(https://mimic-iv.mit.edu/docs/datasets/icu/chartevents/) contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid
variable indicates a single measurement type in the database. The value
variable is the value measured for itemid
.
d_items.csv.gz
(https://mimic-iv.mit.edu/docs/datasets/icu/d_items/) is the dictionary for the itemid
in chartevents.csv.gz
.
Let us create a tibble for all ICU stays, where rows are
and columns contain at least following variables
icustays.csv.gz
admission.csv.gz
patients.csv.gz