Skip to content

Commit 9e90e3f

Browse files
committed
edit the read-cases episode
1 parent 4b4aff0 commit 9e90e3f

File tree

1 file changed

+44
-39
lines changed

1 file changed

+44
-39
lines changed

episodes/read-cases.Rmd

Lines changed: 44 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,8 @@ editor_options:
99
:::::::::::::::::::::::::::::::::::::: questions
1010

1111
- Where do you usually store your outbreak data?
12-
- How many different data formats can you use for analysis?
13-
- Can you import data from databases and health APIs?
12+
- What data formats do you use for analysis?
13+
- Can you import data from databases and Health Information Systems (HIS) through their APIs?
1414
::::::::::::::::::::::::::::::::::::::::::::::::
1515

1616
::::::::::::::::::::::::::::::::::::: objectives
@@ -30,9 +30,9 @@ This episode requires you to be familiar with:
3030

3131
## Introduction
3232

33-
The initial step in outbreak analysis typically involves importing the target dataset into the `R` environment from either a local source (like a file on your computer) or external source (like a database). Outbreak data can be stored in diverse formats, relational database management systems (RDBMS), or health information systems (HIS), such as [REDCap](https://www.project-redcap.org/) and [DHIS2](https://dhis2.org/), which provide application program interfaces (APIs) to the database systems so verified users can easily add and access data entries. The latter option is particularly well-suited for collecting and storing large-scal institutional health data. This episode will elucidate the process of reading cases from these sources.
33+
The initial step in outbreak analysis typically involves importing the target dataset into the `R` environment from either a local source (like a file on your computer) or external source (like a database). Outbreak data can be stored in diverse formats, relational database management systems (RDBMS), or health information systems (HIS), such as [REDCap](https://www.project-redcap.org/) and [DHIS2](https://dhis2.org/), which provide application program interfaces (APIs) to the system's database so verified users can easily add and fetch data entries. The latter option is particularly well-suited for collecting and storing large-scale institutional health data. This episode will elucidate the process of reading cases from these sources.
3434

35-
Let's start by loading the package `{rio}` to read data and the package `{here}` to easily find a file path within your RStudio project. We'll use the pipe `%>%` to easily connect some of their functions, including functions from the data formatting package `{dplyr}`. We'll therefore call the tidyverse package, which includes both the pipe and `{dplyr}`:
35+
Let's start by loading the `{rio}` package to read data and the `{here}` package to easily find a file path within your RStudio project. We'll use the pipe `%>%` operator from the `{magrittr}` package to easily connect some of their functions, including functions from the data formatting package `{dplyr}`. We'll therefore call the `{tidyverse}` package, which includes both `{magrittr}` and `{dplyr}`.
3636

3737
```{r,eval=TRUE,message=FALSE,warning=FALSE}
3838
# Load packages
@@ -43,13 +43,17 @@ library(here) # for easy file referencing
4343

4444
::::::::::::::::::: checklist
4545

46-
### The double-colon
46+
### The double-colon (`::`) operator
4747

48-
The double-colon `::` in R lets you call a specific function from a package without loading the entire package into the current environment.
48+
The `::` in R lets you access functions or objects from a specific package without attaching the entire package to the search path. It offers several important
49+
advantages including the followings:
4950

50-
For example, `dplyr::filter(data, condition)` uses `filter()` from the `{dplyr}` package, without having to use `library(dplyr)` at the start of a script.
51+
* Telling explicitly which package a function comes from, reducing ambiguity and potential conflicts when several packages have functions with the same name.
52+
* Allowing to call a function from a package without loading the whole package
53+
with library().
5154

52-
This help us remember package functions and avoid namespace conflicts (i.e. when two different packages include functions with the same name, so R does not know which to use).
55+
For example, the command `dplyr::filter(data, condition)` means we are calling
56+
the `filter()` function from the `{dplyr}` package.
5357

5458
:::::::::::::::::::
5559

@@ -66,15 +70,15 @@ This help us remember package functions and avoid namespace conflicts (i.e. when
6670

6771
## Reading from files
6872

69-
Several packages are available for importing outbreak data stored in individual files into `R`. These include [rio](https://gesistsa.github.io/rio/), [readr](https://readr.tidyverse.org/) from the `tidyverse`, [io](https://bitbucket.org/djhshih/io/src/master/), [ImportExport](https://cran.r-project.org/web/packages/ImportExport/index.html), and [data.table](https://rdatatable.gitlab.io/data.table/). Together, these packages offer methods to read single or multiple files in a wide range of formats.
73+
Several packages are available for importing outbreak data stored in individual files into `R`. These include [{rio}](https://gesistsa.github.io/rio/), [{readr}](https://readr.tidyverse.org/) from the `{tidyverse}`, [{io}](https://bitbucket.org/djhshih/io/src/master/), [{ImportExport}](https://cran.r-project.org/web/packages/ImportExport/index.html), [{data.table}](https://rdatatable.gitlab.io/data.table/), and similar functions from the {base} R package. Together, these packages offer methods to read single or multiple files in a wide range of formats.
7074

71-
The below example shows how to import a `csv` file into `R` environment using `{rio}` package. We use the `{here}` package to tell R to look for the file in the `data/` folder of your project, and `as_tibble()` to convert into a tidier format for subsequent analysis in R.
75+
The below example shows how to import a `csv` file into `R` environment using `{rio}` package. We use the `{here}` package to tell R to look for the file in the `data/` folder of your project, and `as_tibble()` to convert it into a tidier format for subsequent analysis in R.
7276

7377
```{r,eval=FALSE,echo=TRUE}
7478
# read data
75-
# e.g., the path to our file is data/raw-data/ebola_cases_2.csv then:
79+
# e.g., if the path to our file is "data/raw-data/ebola_cases_2.csv" then:
7680
ebola_confirmed <- rio::import(
77-
here::here("data", "ebola_cases_2.csv")
81+
here::here("data", "raw-data", "ebola_cases_2.csv")
7882
) %>%
7983
dplyr::as_tibble() # for a simple data frame output
8084
@@ -84,7 +88,6 @@ ebola_confirmed
8488

8589

8690
```{r,eval=TRUE, echo=FALSE, message=FALSE}
87-
# internal for DBI::dbWriteTable()
8891
# read data
8992
ebola_confirmed <- rio::import(
9093
file.path("data", "ebola_cases_2.csv")
@@ -95,18 +98,19 @@ ebola_confirmed <- rio::import(
9598
ebola_confirmed
9699
```
97100

98-
Similarly, you can import files of other formats such as `tsv`, `xlsx`, ... etc.
101+
Similarly, you can import files of other formats such as `tsv`, `xlsx`, ... etc
102+
using the same function.
99103

100104
:::::::::::::::::::: checklist
101105

102106
### Why should we use the {here} package?
103107

104-
The `{here}` package is designed to simplify file referencing in R projects by providing a reliable way to construct file paths relative to the project root. The main reason to use it is **Cross-Environment Compatibility**.
108+
The `{here}` package is designed to simplify file referencing in R projects by providing a reliable way to construct file paths relative to the project root. The main reason to use it is to ensure **Cross-Environment Compatibility**.
105109

106110
It works across different operating systems (Windows, Mac, Linux) without needing to adjust file paths.
107111

108112
- On Windows, paths are written using backslashes ( `\` ) as the separator between folder names: `"data\raw-data\file.csv"`
109-
- On Unix based operating system such as macOS or Linux the forward slash ( `/` ) is used as the path separator: `"data/raw-data/file.csv"`
113+
- On Unix based operating system such as macOS or Linux the forward slash ( `/` ) is used as separator between folder names: `"data/raw-data/file.csv"`
110114

111115
The `{here}` package is ideal for adding one more layer of reproducibility to your work. If you are interested in reproducibility, we invite you to [read this tutorial to increase the openess, sustainability, and reproducibility of your epidemic analysis with R](https://epiverse-trace.github.io/research-compendium/)
112116

@@ -122,14 +126,14 @@ Can you read data from a compressed file in `R`? Download this [zip file](https:
122126
::::::::::::::::: hint
123127

124128
You can check the [full list of supported file formats](https://gesistsa.github.io/rio/#supported-file-formats)
125-
in the `{rio}` package on the package website. To expand {rio} to the full range of support for import and export formats run:
129+
from the `{rio}` package website. To expand {rio} to its current unsupported file formats, run:
126130

127131

128132
```{r, eval=FALSE}
129133
rio::install_formats()
130134
```
131135

132-
You can use this template to read the file:
136+
You can use this example to read the file:
133137

134138
`rio::import(here::here("some", "where", "downto", "path", "file_name.zip"))`
135139

@@ -147,29 +151,29 @@ rio::import(here::here("data", "Marburg.zip"))
147151

148152
## Reading from databases
149153

150-
The [DBI](https://dbi.r-dbi.org/) package serves as a versatile interface for interacting with database management systems (DBMS) across different back-ends or servers. It offers a uniform method for accessing and retrieving data from various database systems.
154+
The [{DBI}](https://dbi.r-dbi.org/) package serves as a versatile interface for interacting with relational database management systems (DBMS) across different back-ends or servers. It offers a uniform method for accessing and retrieving data from various database systems.
151155

152156
::::::::::::: discussion
153157

154-
### When to read directly from a database?
158+
### Advantages of reading data directly from a database?
155159

156-
We can use database interface packages to optimize memory usage. If we process the database with "queries" (e.g., select, filter, summarise) before extraction, we can reduce the memory load in our RStudio session. Conversely, conducting all data manipulation outside the database management system by loading the full dataset into R can use up much more computer memory (i.e. RAM) than is feasible on a local machine, which can lead RStudio to slow down or even freeze.
160+
We can use database interface packages to optimize the amount of memory used during our R session. If we query the database with data filtration requests (e.g., select, filter, summarise) before extraction, we can reduce the memory load in our RStudio session. Conversely, conducting all data manipulation outside the database management system by loading the full dataset into R can use up much more computer memory (i.e. RAM) than is feasible on a local machine, which can lead RStudio to slow down or even freeze.
157161

158-
External relational database management systems (RDBMS) also have the advantage that multiple users can access, store and analyse parts of the dataset simultaneously, without having to transfer individual files, which would make it very difficult to track which version is up-to-date.
162+
Relational database management systems (RDBMS) also have the advantage that multiple users can access, store and analyse parts of the dataset simultaneously, without having to transfer individual files, which would make it very difficult to track which version is up-to-date.
159163

160164
:::::::::::::
161165

162-
The following code chunk demonstrates in four steps how to create a temporary SQLite database in memory, store the `ebola_confirmed` as a table on it, and subsequently read it:
166+
The following code chunk demonstrates in four steps how to create a SQLite database in memory, store the `ebola_confirmed` as a table on it, and subsequently read it.
163167

164-
### 1. Connect with a database
168+
### 1. Connection to the a database
165169

166-
First, we establish a connection to an SQLite database created on our machine and stored in its local memory with `DBI::dbConnect()`.
170+
We first need to establish a connection to an SQLite database created on our machine and stored in its local memory with `DBI::dbConnect()`.
167171

168172
```{r,warning=FALSE,message=FALSE}
169173
library(DBI)
170174
library(RSQLite)
171175
172-
# Create a temporary SQLite database in memory
176+
# Create a SQLite database in memory
173177
db_connection <- DBI::dbConnect(
174178
drv = RSQLite::SQLite(),
175179
dbname = ":memory:"
@@ -180,7 +184,7 @@ db_connection <- DBI::dbConnect(
180184

181185
A real-life connection to an external SQLite database would look like this:
182186

183-
```r
187+
```{r}
184188
# in real-life
185189
db_connection <- DBI::dbConnect(
186190
RSQLite::SQLite(),
@@ -192,12 +196,12 @@ db_connection <- DBI::dbConnect(
192196

193197
:::::::::::::::::
194198

195-
### 2. Write a local data frame as a table in a database
199+
### 2. Create a table in the database from a data frame
196200

197-
Then, we can write the `ebola_confirmed` into a table named `cases` within the database using the `DBI::dbWriteTable()` function.
201+
After establishing the connection with the database, we can now write out the `ebola_confirmed` data frame into a table named `cases` within the database using the `DBI::dbWriteTable()` function.
198202

199203
```{r,warning=FALSE,message=FALSE}
200-
# Store the 'ebola_confirmed' dataframe as a table named 'cases'
204+
# Store the 'ebola_confirmed' data frame as a table named 'cases'
201205
# in the SQLite database
202206
DBI::dbWriteTable(
203207
conn = db_connection,
@@ -206,7 +210,7 @@ DBI::dbWriteTable(
206210
)
207211
```
208212

209-
In a database framework, you can have more than one table. Each table can belong to a specific `entity` (e.g., patients, care units, jobs). All tables will be related by a common ID or `primary key`.
213+
In a relational database framework, you can have more than one table. Each table can belong to a specific `entity` (e.g., patients, care units, jobs). All tables will be related by a common ID or `primary key`.
210214

211215
### 3. Read data from a table in a database
212216

@@ -220,17 +224,17 @@ In a database framework, you can have more than one table. Each table can belong
220224
<!-- ) -->
221225
<!-- ``` -->
222226

223-
Subsequently, we reads the data from the `cases` table using `dplyr::tbl()`.
227+
We can reads the data from the `cases` table using `dplyr::tbl()`.
224228

225229
```{r}
226230
# Read one table from the database
227231
mytable_db <- dplyr::tbl(src = db_connection, "cases")
228232
```
229233

230-
If we apply `{dplyr}` verbs to this database SQLite table, these verbs will be translated to SQL queries.
234+
If we apply `{dplyr}` verbs to this table of a SQLite database, these verbs will be translated to an SQL queries.
231235

232236
```{r}
233-
# Show the SQL queries translated
237+
# Show the translated SQL queries
234238
mytable_db %>%
235239
dplyr::filter(confirm > 50) %>%
236240
dplyr::arrange(desc(confirm)) %>%
@@ -249,7 +253,7 @@ extracted_data <- mytable_db %>%
249253
dplyr::collect()
250254
```
251255

252-
The `extracted_data` object represents the extracted, ideally after specifying queries that reduces its size.
256+
The `extracted_data` object represents the extracted data, ideally after applying the specified queries that reduces its size.
253257

254258
```{r,warning=FALSE,message=FALSE}
255259
# View the extracted_data
@@ -258,11 +262,11 @@ extracted_data
258262

259263
:::::::::::::::::::::: callout
260264

261-
### Run SQL queries in R using dbplyr
265+
### Run SQL queries in R using {dbplyr}
262266

263-
Practice how to make relational database SQL queries using multiple `{dplyr}` verbs like `dplyr::left_join()` among tables before pulling down data to your local session with `dplyr::collect()`!
267+
Practice how to make relational database SQL queries using multiple `{dplyr}` verbs like `dplyr::left_join()` among tables before pulling out data to your local session with `dplyr::collect()`!
264268

265-
You can also review the `{dbplyr}` R package. But for a step-by-step tutorial about SQL, we recommend you this [tutorial about data management with SQL for Ecologist](https://datacarpentry.org/sql-ecology-lesson/). You will find close to `{dplyr}`!
269+
You can also review the `{dbplyr}` R package. But for a step-by-step tutorial about SQL, we recommend you this [tutorial about data management with SQL for Ecologist](https://datacarpentry.org/sql-ecology-lesson/).
266270

267271
::::::::::::::::::::::
268272

@@ -278,10 +282,11 @@ DBI::dbDisconnect(conn = db_connection)
278282

279283
## Reading from HIS APIs
280284

281-
Health related data are also increasingly stored in specialized HIS APIs like **Fingertips**, **GoData**, **REDCap**, and **DHIS2**. In such case one can resort to [readepi](https://epiverse-trace.github.io/readepi/) package, which enables reading data from HIS-APIs.
285+
Health related data are also increasingly stored in specialized HIS APIs like **Fingertips**, **GoData**, **REDCap**, and **DHIS2**. In such case one can resort to the [{readepi}](https://epiverse-trace.github.io/readepi/) package, which enables reading data from HIS APIs.
282286
-[TBC]
283287

284288
::::::::::::::::::::::::::::::::::::: keypoints
285289
- Use `{rio}, {io}, {readr}` and `{ImportExport}` to read data from individual files.
290+
- Use {DBI}, {dplyr}, and {dbplyr} to import data from RDBMS
286291
- Use `{readepi}` to read data form HIS APIs and RDBMS.
287292
::::::::::::::::::::::::::::::::::::::::::::::::

0 commit comments

Comments
 (0)