Display machine information for reproducibility:
sessionInfo()
No handwritten homework reports are accepted for this course. We work with Git and GitHub. Efficient and abundant use of Git, e.g., frequent and well-documented commits, is an important criterion for grading your homework.
Apply for the Student Developer Pack at GitHub using your UCLA email.
Create a private repository biostat-203b-2021-winter
and add Hua-Zhou
, Chris-German
and ElvisCuiHan
as your collaborators with write permission.
Top directories of the repository should be hw1
, hw2
, … Maintain two branches master
and develop
. The develop
branch will be your main playground, the place where you develop solution (code) to homework problems and write up report. The master
branch will be your presentation area. Submit your homework files (R markdown file Rmd
, html
file converted from R markdown, all code and data sets to reproduce results) in master
branch.
After each homework due date, teaching assistant and instructor will check out your master branch for grading. Tag each of your homework submissions with tag names hw1
, hw2
, … Tagging time will be used as your submission time. That means if you tag your hw1
submission after deadline, penalty points will be deducted for late submission.
After this course, you can make this repository public and use it to demonstrate your skill sets on job market.
This exercise (and later in this course) uses the MIMIC-IV data, a freely accessible critical care database developed by the MIT Lab for Computational Physiology. Follow the instructions at https://mimic-iv.mit.edu/docs/access/ to (1) complete the CITI Data or Specimens Only Research
course and (2) obtain the PhysioNet credential for using the MIMIC-IV data. Display the verification links to your completion report and completion certificate here. (Hint: The CITI training takes a couple hours and the PhysioNet credentialing takes a couple days; do not leave it to the last minute.)
The /usr/203b-data/mimic-iv/
folder on teaching server contains data sets from MIMIC-IV. Refer to https://mimic-iv.mit.edu/docs/datasets/ for details of data files.
ls -l /usr/203b-data/mimic-iv
## total 12
## drwxr-xr-x. 2 huazhou huazhou 78 Jan 11 02:28 core
## drwxr-xr-x. 2 huazhou huazhou 4096 Jan 11 02:13 hosp
## drwxr-xr-x. 2 huazhou huazhou 189 Jan 11 02:25 icu
## -rw-r--r--. 1 huazhou huazhou 2518 Jan 11 02:13 LICENSE.txt
## -rw-r--r--. 1 huazhou huazhou 2459 Jan 11 02:13 SHA256SUMS.txt
Please, do not put these data files into Git; they are big. Do not copy them into your directory. Do not decompress the gz data files. These create unnecessary big files on storage and are not big data friendly practices. Just read from the data folder /usr/203b-data/mimic-iv
directly in following exercises.
Use Bash commands to answer following questions.
Display the contents in the folders core
, hosp
, icu
. What are the functionalities of the bash commands zcat
, zless
, zmore
, and zgrep
?
What’s the output of following bash script?
for datafile in /usr/203b-data/mimic-iv/core/*.gz
do
ls -l $datafile
done
Display the number of lines in each data file using a similar loop.
Display the first few lines of admissions.csv.gz
. How many rows are in this data file? How many unique patients (identified by subject_id
) are in this data file? What are the possible values taken by each of the variable admission_type
, admission_location
, insurance
, language
, marital_status
, and ethnicity
? Also report the count for each unique value of these variables. (Hint: combine Linux commands zcat
, head
/tail
, awk
, uniq
, wc
, and so on.)
You and your friend just have finished reading Pride and Prejudice by Jane Austen. Among the four main characters in the book, Elizabeth, Jane, Lydia, and Darcy, your friend thinks that Darcy was the most mentioned. You, however, are certain it was Elizabeth. Obtain the full text of the novel from http://www.gutenberg.org/cache/epub/42671/pg42671.txt and save to your local folder.
curl http://www.gutenberg.org/cache/epub/42671/pg42671.txt > pride_and_prejudice.txt
Do not put this text file pride_and_prejudice.txt
in Git. Using a for
loop, how would you tabulate the number of times each of the four characters is mentioned?
What’s the difference between the following two commands?
echo 'hello, world' > test1.txt
and
echo 'hello, world' >> test2.txt
Using your favorite text editor (e.g., vi
), type the following and save the file as middle.sh
:
#!/bin/sh
# Select lines from the middle of a file.
# Usage: bash middle.sh filename end_line num_lines
head -n "$2" "$1" | tail -n "$3"
Using chmod
make the file executable by the owner, and run
./middle.sh pride_and_prejudice.txt 20 5
Explain the output. Explain the meaning of "$1"
, "$2"
, and "$3"
in this shell script. Why do we need the first line of the shell script?
Try these commands in Bash and interpret the results: cal
, cal 2021
, cal 9 1752
(anything unusual?), date
, hostname
, arch
, uname -a
, uptime
, who am i
, who
, w
, id
, last | head
, echo {con,pre}{sent,fer}{s,ed}
, time sleep 5
, history | tail
.