Skip to content

Commit d749ffe

Browse files
committed
update the describe-cases episode
1 parent 9e90e3f commit d749ffe

File tree

1 file changed

+45
-63
lines changed

1 file changed

+45
-63
lines changed

episodes/describe-cases.Rmd

Lines changed: 45 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -8,29 +8,26 @@ exercises: 10
88

99
- How to aggregate and summarise case data?
1010
- How to visualize aggregated data?
11-
- What is distribution of cases in time, place, gender, age?
11+
- What is distribution of cases across time, space, gender, and age?
1212

1313
::::::::::::::::::::::::::::::::::::::::::::::::
1414

1515
::::::::::::::::::::::::::::::::::::: objectives
1616

1717
- Simulate synthetic outbreak data
18-
- Convert indivdual linelist data to incidence over time
18+
- Convert linelist data into incidence over time
1919
- Create epidemic curves from incidence data
2020
::::::::::::::::::::::::::::::::::::::::::::::::
2121

2222
## Introduction
2323

24-
In an analytic pipeline, exploratory data analysis (EDA) is an important step before formal modelling. EDA helps
25-
determine relationships between variables and summarize their main characteristics, often by means of data visualization.
24+
In an analytic pipeline, exploratory data analysis (EDA) is an important step before formal modelling. EDA helps determine relationships between variables and summarize their main characteristics, often by means of data visualization.
2625

2726
This episode focuses on EDA of outbreak data using R packages.
2827
A key aspect of EDA in epidemic analysis is 'person, place and time'. It is useful to identify how observed events - such as confirmed cases, hospitalizations, deaths, and recoveries - change over time, and how these vary across different locations and demographic factors, including gender, age, and more.
2928

30-
Let's start by loading the package `{incidence2}` to aggregate linelist data according to specific characteristics, and visualize the resulting epidemic curves (epicurves) that plot the number of new events (i.e. incidence) over time.
31-
We'll use `{simulist}` to simulate some outbreak data to analyse, and `{tracetheme}` for figure formatting.
32-
We'll use the pipe `%>%` to connect some of their functions, including others from the packages `{dplyr}` and
33-
`{ggplot2}`, so let's also call to the tidyverse package:
29+
Let's start by loading the `{incidence2}` package to aggregate the linelist data according to specific characteristics, and visualize the resulting epidemic curves (epicurves) that plot the number of new events (i.e. case incidence over time).
30+
We'll use the `{simulist}` package to simulate the outbreak data to analyse, and `{tracetheme}` for figure formatting. We'll use the pipe operator (`%>%`) to connect some of their functions, including others from the `{dplyr}` and `{ggplot2}` packages, so let's also call to the {tidyverse} package.
3431

3532
```{r,eval=TRUE,message=FALSE,warning=FALSE}
3633
# Load packages
@@ -40,23 +37,10 @@ library(tracetheme) # For formatting figures
4037
library(tidyverse) # For {dplyr} and {ggplot2} functions and the pipe %>%
4138
```
4239

43-
::::::::::::::::::: checklist
44-
45-
### The double-colon
46-
47-
The double-colon `::` in R lets you call a specific function from a package without loading the entire package into the current environment.
48-
49-
For example, `dplyr::filter(data, condition)` uses `filter()` from the `{dplyr}` package.
50-
This help us remember package functions and avoid namespace conflicts.
51-
52-
:::::::::::::::::::
53-
5440

5541
## Synthetic outbreak data
5642

57-
To illustrate the process of conducting EDA on outbreak data, we will generate a line list
58-
for a hypothetical disease outbreak utilizing the `{simulist}` package. `{simulist}` generates simulation data for outbreak according to a given configuration.
59-
Its minimal configuration can generate a linelist, as shown in the below code chunk:
43+
To illustrate the process of conducting EDA on outbreak data, we will generate a line list for a hypothetical disease outbreak utilizing the `{simulist}` package. `{simulist}` generates simulated data for outbreak according to a given configuration. Its minimal configuration can generate a linelist, as shown in the below code chunk:
6044

6145
```{r}
6246
# Simulate linelist data for an outbreak with size between 1000 and 1500
@@ -68,28 +52,23 @@ sim_data <- simulist::sim_linelist(outbreak_size = c(1000, 1500)) %>%
6852
sim_data
6953
```
7054

71-
This linelist dataset has entries on individual-level simulated events during the outbreak.
55+
This linelist dataset has simulated entries on individual-level events during an outbreak.
7256

7357
::::::::::::::::::: spoiler
7458

7559
## Additional Resources on Outbreak Data
7660

77-
The above is the default configuration of `{simulist}`, so includes a number of assumptions about the transmissibility and severity of the pathogen. If you want to know more about `sim_linelist()` and other functionalities
78-
check the [documentation website](https://epiverse-trace.github.io/simulist/).
61+
The above is the default configuration of `{simulist}`. It includes a number of assumptions about the transmissibility and severity of the pathogen. If you want to know more about the `simulist::sim_linelist()` function and other functionalities check the [documentation website](https://epiverse-trace.github.io/simulist/).
7962

80-
You can also find data sets from real emergencies from the past at the [`{outbreaks}` R package](https://www.reconverse.org/outbreaks/).
63+
You can also find data sets from past real outbreaks within the [`{outbreaks}`](https://www.reconverse.org/outbreaks/) R package.
8164

8265
:::::::::::::::::::
8366

8467

8568

86-
## Aggregating
69+
## Aggregating the data
8770

88-
Often we want to analyse and visualise the number of events that occur on a particular day or week, rather than focusing on individual cases. This requires grouping linelist
89-
data into incidence data. The [incidence2]((https://www.reconverse.org/incidence2/articles/incidence2.html){.external target="_blank"})
90-
package offers a useful function called `incidence2::incidence()` for grouping case data, usually based around dated events
91-
and/or other characteristics. The code chunk provided below demonstrates the creation of an `<incidence2>` class object from the
92-
simulated Ebola `linelist` data based on the date of onset.
71+
Often we want to analyse and visualise the number of events that occur on a particular day or week, rather than focusing on individual cases. This requires grouping the linelist data into incidence data. The [{incidence2}]((https://www.reconverse.org/incidence2/articles/incidence2.html){.external target="_blank"}) package offers a useful function called `incidence2::incidence()` for grouping case data, usually based around dated events and/or other characteristics. The code chunk provided below demonstrates the creation of an `<incidence2>` class object from the simulated Ebola `linelist` data based on the date of onset.
9372

9473
```{r}
9574
# Create an incidence object by aggregating case data based on the date of onset
@@ -102,8 +81,8 @@ daily_incidence <- incidence2::incidence(
10281
# View the incidence data
10382
daily_incidence
10483
```
105-
With the `{incidence2}` package, you can specify the desired interval (e.g. day, week) and categorize cases by one or
106-
more factors. Below is a code snippet demonstrating weekly cases grouped by the date of onset, sex, and type of case.
84+
85+
With the `{incidence2}` package, you can specify the desired interval (e.g. day, week) and categorize cases by one or more factors. Below is a code snippet demonstrating weekly cases grouped by the date of onset, sex, and type of case.
10786

10887
```{r}
10988
# Group incidence data by week, accounting for sex and case type
@@ -119,15 +98,15 @@ weekly_incidence
11998
```
12099

121100
::::::::::::::::::::::::::::::::::::: callout
122-
## Dates Completion
123-
When cases are grouped by different factors, it's possible that the events involving these groups may have different date ranges in the
124-
resulting `incidence2` object. The `incidence2` package provides a function called `complete_dates()` to ensure that an
125-
incidence object has the same range of dates for each group. By default, missing counts for a particular group will be filled with 0 for that date.
101+
102+
## Dates Completion
103+
104+
When cases are grouped by different factors, it's possible that the events involving these groups may have different date ranges in the resulting `incidence2` object. The `{incidence2}` package provides a function called `incidence2::complete_dates()` to ensure that an incidence object has the same range of dates for each group. By default, missing counts for a particular group will be filled with 0 for that date.
126105

127-
This functionality is also available as an argument within `incidence2::incidence()` adding `complete_dates = TRUE`.
106+
This functionality is also available within the `incidence2::incidence()` function by setting the value of the `complete_dates` to `TRUE`.
128107

129108
```{r}
130-
# Create an incidence object grouped by sex, aggregating daily
109+
# Create a daily incidence object grouped by sex
131110
daily_incidence_2 <- incidence2::incidence(
132111
sim_data,
133112
date_index = "date_onset",
@@ -154,16 +133,15 @@ daily_incidence_2_complete <- incidence2::complete_dates(
154133
::::::::::::::::::::::::::::::::::::: challenge
155134

156135
## Challenge 1: Can you do it?
157-
- **Task**: Aggregate `sim_data` linelist based on admission date and case outcome in __biweekly__
158-
intervals, and save the results in an object called `biweekly_incidence`.
136+
137+
- **Task**: Calculate the __biweekly__ incidence of cases from the `sim_data` linelist based on their admission date and outcome. Save the result in an object called `biweekly_incidence`.
159138

160139
::::::::::::::::::::::::::::::::::::::::::::::::
161140

162141
## Visualization
163142

164-
The `incidence2` object can be visualized using the `plot()` function from the base R package.
165-
The resulting graph is referred to as an epidemic curve, or epi-curve for short. The following code
166-
snippets generate epi-curves for the `daily_incidence` and `weekly_incidence` incidence objects mentioned above.
143+
The `incidence2` objects can be visualized using the `plot()` function from the base R package.
144+
The resulting graph is referred to as an epidemic curve, or epi-curve for short. The following code snippets generate epi-curves for the `daily_incidence` and `weekly_incidence` incidence objects mentioned above.
167145

168146
```{r}
169147
# Plot daily incidence data
@@ -172,7 +150,8 @@ base::plot(daily_incidence) +
172150
x = "Time (in days)", # x-axis label
173151
y = "Dialy cases" # y-axis label
174152
) +
175-
tracetheme::theme_trace() # Apply the custom trace theme
153+
theme_bw()
154+
# tracetheme::theme_trace() # Apply the custom trace theme
176155
```
177156

178157

@@ -183,33 +162,35 @@ base::plot(weekly_incidence) +
183162
x = "Time (in weeks)", # x-axis label
184163
y = "weekly cases" # y-axis label
185164
) +
186-
tracetheme::theme_trace() # Apply the custom trace theme
165+
theme_bw()
166+
# tracetheme::theme_trace() # Apply the custom trace theme
187167
```
188168

189169
:::::::::::::::::::::::: callout
190170

191171
#### Easy aesthetics
192172

193-
We invite you to skim the `{incidence2}` package ["Get started" vignette](https://www.reconverse.org/incidence2/articles/incidence2.html). Find how you can use arguments within `plot()` to provide aesthetics to your incidence2 class objects.
173+
We invite you to take a look at the `{incidence2}` [package vignette](https://www.reconverse.org/incidence2/articles/incidence2.html). Find how you can use the arguments within the `plot()` function to provide aesthetics to your incidence2 class objects.
194174

195175
```{r}
196176
base::plot(weekly_incidence, fill = "sex")
197177
```
198178

199-
Some of them include `show_cases = TRUE`, `angle = 45`, and `n_breaks = 5`. Feel free to give them a try.
179+
Some of them include `show_cases = TRUE`, `angle = 45`, and `n_breaks = 5`. Try them and see how they impact on the resulting plot.
200180

201181
::::::::::::::::::::::::
202182

203183
::::::::::::::::::::::::::::::::::::: challenge
204184

205185
## Challenge 2: Can you do it?
206-
- **Task**: Visualize `biweekly_incidence` object.
186+
187+
- **Task**: Visualize the `biweekly_incidence` object.
207188

208189
::::::::::::::::::::::::::::::::::::::::::::::::
209190

210191
## Curve of cumulative cases
211192

212-
The cumulative number of cases can be calculated using the `cumulate()` function from an `incidence2` object and visualized, as in the example below.
193+
The cumulative number of cases can be calculated using the `incidence2::cumulate()` function on an `incidence2` object and visualized it, as in the example below.
213194

214195
```{r}
215196
# Calculate cumulative incidence
@@ -221,7 +202,8 @@ base::plot(cum_df) +
221202
x = "Time (in days)", # x-axis label
222203
y = "weekly cases" # y-axis label
223204
) +
224-
tracetheme::theme_trace() # Apply the custom trace theme
205+
theme_bw()
206+
# tracetheme::theme_trace() # Apply the custom trace theme
225207
```
226208

227209
Note that this function preserves grouping, i.e., if the `incidence2` object contains groups, it will accumulate the cases accordingly.
@@ -230,14 +212,13 @@ Note that this function preserves grouping, i.e., if the `incidence2` object con
230212
::::::::::::::::::::::::::::::::::::: challenge
231213

232214
## Challenge 3: Can you do it?
233-
- **Task**: Visulaize the cumulatie cases from `biweekly_incidence` object.
215+
- **Task**: Visulaize the cumulative cases from the `biweekly_incidence` object.
234216

235217
::::::::::::::::::::::::::::::::::::::::::::::::
236218

237-
## Peak estimation
219+
## Peak time estimation
238220

239-
You can estimate the peak -- the time with the highest number of recorded cases-- using the `estimate_peak()` function from the {incidence2} package.
240-
This function employs a bootstrapping method to determine the peak time (i.e. by resampling dates with replacement, resulting in a distribution of estimated peak times).
221+
You can estimate the peak -- the time with the highest number of recorded cases -- using the `incidence2::estimate_peak()` function from the {incidence2} package. This function uses a bootstrapping method to determine the peak time (i.e. by resampling dates with replacement, resulting in a distribution of estimated peak times).
241222

242223
```{r}
243224
# Estimate the peak of the daily incidence data
@@ -252,21 +233,21 @@ peak <- incidence2::estimate_peak(
252233
# Display the estimated peak
253234
print(peak)
254235
```
255-
This example demonstrates how to estimate the peak time using the `estimate_peak()` function at $95%$
256-
confidence interval and using 100 bootstrap samples.
236+
237+
This example demonstrates how to estimate the peak time using the `incidence2::estimate_peak()` function at $95%$ confidence interval and using 100 bootstrap samples.
257238

258239
::::::::::::::::::::::::::::::::::::: challenge
259240

260241
## Challenge 4: Can you do it?
261-
- **Task**: Estimate the peak time from `biweekly_incidence` object.
242+
- **Task**: Estimate the peak time from the `biweekly_incidence` object.
262243

263244
::::::::::::::::::::::::::::::::::::::::::::::::
264245

265246

266247
## Visualization with ggplot2
267248

268249

269-
`{incidence2}` produces basic plots for epicurves, but additional work is required to create well-annotated graphs. However, using the `{ggplot2}` package, you can generate more sophisticated and epicurves with more flexibility in annotation.
250+
`{incidence2}` produces basic plots for epicurves, but additional work is required to create well-annotated graphs. However, using the `{ggplot2}` package, you can generate more sophisticated epicurves, with more flexibility in annotation.
270251
`{ggplot2}` is a comprehensive package with many functionalities. However, we will focus on three key elements for producing epicurves: histogram plots, scaling date axes and their labels, and general plot theme annotation.
271252
The example below demonstrates how to configure these three elements for a simple `{incidence2}` object.
272253

@@ -316,7 +297,7 @@ ggplot2::ggplot(data = daily_incidence) +
316297
Use the `group` option in the mapping function to visualize an epicurve with different groups. If there is more than one grouping factor, use the `facet_wrap()` option, as demonstrated in the example below:
317298

318299
```{r}
319-
# Plot daily incidence by sex with facets
300+
# Plot daily incidence faceted by sex
320301
ggplot2::ggplot(data = daily_incidence_2) +
321302
geom_histogram(
322303
mapping = aes(
@@ -357,14 +338,15 @@ ggplot2::ggplot(data = daily_incidence_2) +
357338
::::::::::::::::::::::::::::::::::::: challenge
358339

359340
## Challenge 5: Can you do it?
360-
- **Task**: Produce an annotated figure for biweekly_incidence using `{ggplot2}` package.
341+
342+
- **Task**: Produce an annotated figure for the `biweekly_incidence` object using the `{ggplot2}` package.
361343

362344
::::::::::::::::::::::::::::::::::::::::::::::::
363345

364346
::::::::::::::::::::::::::::::::::::: keypoints
365347

366348
- Use `{simulist}` package to generate synthetic outbreak data
367-
- Use `{incidence2}` package to aggregate case data based on a date event, and produce epidemic curves.
349+
- Use `{incidence2}` package to aggregate case data based on a date event, and other variables to produce epidemic curves.
368350
- Use `{ggplot2}` package to produce better annotated epicurves.
369351

370352
::::::::::::::::::::::::::::::::::::::::::::::::

0 commit comments

Comments
 (0)