You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: episodes/describe-cases.Rmd
+45-63Lines changed: 45 additions & 63 deletions
Original file line number
Diff line number
Diff line change
@@ -8,29 +8,26 @@ exercises: 10
8
8
9
9
- How to aggregate and summarise case data?
10
10
- How to visualize aggregated data?
11
-
- What is distribution of cases in time, place, gender, age?
11
+
- What is distribution of cases across time, space, gender, and age?
12
12
13
13
::::::::::::::::::::::::::::::::::::::::::::::::
14
14
15
15
::::::::::::::::::::::::::::::::::::: objectives
16
16
17
17
- Simulate synthetic outbreak data
18
-
- Convert indivdual linelist data to incidence over time
18
+
- Convert linelist data into incidence over time
19
19
- Create epidemic curves from incidence data
20
20
::::::::::::::::::::::::::::::::::::::::::::::::
21
21
22
22
## Introduction
23
23
24
-
In an analytic pipeline, exploratory data analysis (EDA) is an important step before formal modelling. EDA helps
25
-
determine relationships between variables and summarize their main characteristics, often by means of data visualization.
24
+
In an analytic pipeline, exploratory data analysis (EDA) is an important step before formal modelling. EDA helps determine relationships between variables and summarize their main characteristics, often by means of data visualization.
26
25
27
26
This episode focuses on EDA of outbreak data using R packages.
28
27
A key aspect of EDA in epidemic analysis is 'person, place and time'. It is useful to identify how observed events - such as confirmed cases, hospitalizations, deaths, and recoveries - change over time, and how these vary across different locations and demographic factors, including gender, age, and more.
29
28
30
-
Let's start by loading the package `{incidence2}` to aggregate linelist data according to specific characteristics, and visualize the resulting epidemic curves (epicurves) that plot the number of new events (i.e. incidence) over time.
31
-
We'll use `{simulist}` to simulate some outbreak data to analyse, and `{tracetheme}` for figure formatting.
32
-
We'll use the pipe `%>%` to connect some of their functions, including others from the packages `{dplyr}` and
33
-
`{ggplot2}`, so let's also call to the tidyverse package:
29
+
Let's start by loading the `{incidence2}` package to aggregate the linelist data according to specific characteristics, and visualize the resulting epidemic curves (epicurves) that plot the number of new events (i.e. case incidence over time).
30
+
We'll use the `{simulist}` package to simulate the outbreak data to analyse, and `{tracetheme}` for figure formatting. We'll use the pipe operator (`%>%`) to connect some of their functions, including others from the `{dplyr}` and `{ggplot2}` packages, so let's also call to the {tidyverse} package.
34
31
35
32
```{r,eval=TRUE,message=FALSE,warning=FALSE}
36
33
# Load packages
@@ -40,23 +37,10 @@ library(tracetheme) # For formatting figures
40
37
library(tidyverse) # For {dplyr} and {ggplot2} functions and the pipe %>%
41
38
```
42
39
43
-
::::::::::::::::::: checklist
44
-
45
-
### The double-colon
46
-
47
-
The double-colon `::` in R lets you call a specific function from a package without loading the entire package into the current environment.
48
-
49
-
For example, `dplyr::filter(data, condition)` uses `filter()` from the `{dplyr}` package.
50
-
This help us remember package functions and avoid namespace conflicts.
51
-
52
-
:::::::::::::::::::
53
-
54
40
55
41
## Synthetic outbreak data
56
42
57
-
To illustrate the process of conducting EDA on outbreak data, we will generate a line list
58
-
for a hypothetical disease outbreak utilizing the `{simulist}` package. `{simulist}` generates simulation data for outbreak according to a given configuration.
59
-
Its minimal configuration can generate a linelist, as shown in the below code chunk:
43
+
To illustrate the process of conducting EDA on outbreak data, we will generate a line list for a hypothetical disease outbreak utilizing the `{simulist}` package. `{simulist}` generates simulated data for outbreak according to a given configuration. Its minimal configuration can generate a linelist, as shown in the below code chunk:
60
44
61
45
```{r}
62
46
# Simulate linelist data for an outbreak with size between 1000 and 1500
This linelist dataset has entries on individual-level simulated events during the outbreak.
55
+
This linelist dataset has simulated entries on individual-level events during an outbreak.
72
56
73
57
::::::::::::::::::: spoiler
74
58
75
59
## Additional Resources on Outbreak Data
76
60
77
-
The above is the default configuration of `{simulist}`, so includes a number of assumptions about the transmissibility and severity of the pathogen. If you want to know more about `sim_linelist()` and other functionalities
78
-
check the [documentation website](https://epiverse-trace.github.io/simulist/).
61
+
The above is the default configuration of `{simulist}`. It includes a number of assumptions about the transmissibility and severity of the pathogen. If you want to know more about the `simulist::sim_linelist()` function and other functionalities check the [documentation website](https://epiverse-trace.github.io/simulist/).
79
62
80
-
You can also find data sets from real emergencies from the past at the [`{outbreaks}` R package](https://www.reconverse.org/outbreaks/).
63
+
You can also find data sets from past real outbreaks within the [`{outbreaks}`](https://www.reconverse.org/outbreaks/) R package.
81
64
82
65
:::::::::::::::::::
83
66
84
67
85
68
86
-
## Aggregating
69
+
## Aggregating the data
87
70
88
-
Often we want to analyse and visualise the number of events that occur on a particular day or week, rather than focusing on individual cases. This requires grouping linelist
89
-
data into incidence data. The [incidence2]((https://www.reconverse.org/incidence2/articles/incidence2.html){.external target="_blank"})
90
-
package offers a useful function called `incidence2::incidence()` for grouping case data, usually based around dated events
91
-
and/or other characteristics. The code chunk provided below demonstrates the creation of an `<incidence2>` class object from the
92
-
simulated Ebola `linelist` data based on the date of onset.
71
+
Often we want to analyse and visualise the number of events that occur on a particular day or week, rather than focusing on individual cases. This requires grouping the linelist data into incidence data. The [{incidence2}]((https://www.reconverse.org/incidence2/articles/incidence2.html){.external target="_blank"}) package offers a useful function called `incidence2::incidence()` for grouping case data, usually based around dated events and/or other characteristics. The code chunk provided below demonstrates the creation of an `<incidence2>` class object from the simulated Ebola `linelist` data based on the date of onset.
93
72
94
73
```{r}
95
74
# Create an incidence object by aggregating case data based on the date of onset
With the `{incidence2}` package, you can specify the desired interval (e.g. day, week) and categorize cases by one or
106
-
more factors. Below is a code snippet demonstrating weekly cases grouped by the date of onset, sex, and type of case.
84
+
85
+
With the `{incidence2}` package, you can specify the desired interval (e.g. day, week) and categorize cases by one or more factors. Below is a code snippet demonstrating weekly cases grouped by the date of onset, sex, and type of case.
107
86
108
87
```{r}
109
88
# Group incidence data by week, accounting for sex and case type
@@ -119,15 +98,15 @@ weekly_incidence
119
98
```
120
99
121
100
::::::::::::::::::::::::::::::::::::: callout
122
-
## Dates Completion
123
-
When cases are grouped by different factors, it's possible that the events involving these groups may have different date ranges in the
124
-
resulting `incidence2` object. The `incidence2` package provides a function called `complete_dates()` to ensure that an
125
-
incidence object has the same range of dates for each group. By default, missing counts for a particular group will be filled with 0 for that date.
101
+
102
+
## Dates Completion
103
+
104
+
When cases are grouped by different factors, it's possible that the events involving these groups may have different date ranges in the resulting `incidence2` object. The `{incidence2}` package provides a function called `incidence2::complete_dates()` to ensure that an incidence object has the same range of dates for each group. By default, missing counts for a particular group will be filled with 0 for that date.
126
105
127
-
This functionality is also available as an argument within `incidence2::incidence()`adding `complete_dates = TRUE`.
106
+
This functionality is also available within the `incidence2::incidence()`function by setting the value of the `complete_dates` to `TRUE`.
128
107
129
108
```{r}
130
-
# Create an incidence object grouped by sex, aggregating daily
-**Task**: Aggregate `sim_data` linelist based on admission date and case outcome in __biweekly__
158
-
intervals, and save the results in an object called `biweekly_incidence`.
136
+
137
+
-**Task**: Calculate the __biweekly__ incidence of cases from the `sim_data` linelist based on their admission date and outcome. Save the result in an object called `biweekly_incidence`.
159
138
160
139
::::::::::::::::::::::::::::::::::::::::::::::::
161
140
162
141
## Visualization
163
142
164
-
The `incidence2` object can be visualized using the `plot()` function from the base R package.
165
-
The resulting graph is referred to as an epidemic curve, or epi-curve for short. The following code
166
-
snippets generate epi-curves for the `daily_incidence` and `weekly_incidence` incidence objects mentioned above.
143
+
The `incidence2` objects can be visualized using the `plot()` function from the base R package.
144
+
The resulting graph is referred to as an epidemic curve, or epi-curve for short. The following code snippets generate epi-curves for the `daily_incidence` and `weekly_incidence` incidence objects mentioned above.
167
145
168
146
```{r}
169
147
# Plot daily incidence data
@@ -172,7 +150,8 @@ base::plot(daily_incidence) +
172
150
x = "Time (in days)", # x-axis label
173
151
y = "Dialy cases" # y-axis label
174
152
) +
175
-
tracetheme::theme_trace() # Apply the custom trace theme
153
+
theme_bw()
154
+
# tracetheme::theme_trace() # Apply the custom trace theme
tracetheme::theme_trace() # Apply the custom trace theme
165
+
theme_bw()
166
+
# tracetheme::theme_trace() # Apply the custom trace theme
187
167
```
188
168
189
169
:::::::::::::::::::::::: callout
190
170
191
171
#### Easy aesthetics
192
172
193
-
We invite you to skim the `{incidence2}` package ["Get started" vignette](https://www.reconverse.org/incidence2/articles/incidence2.html). Find how you can use arguments within `plot()` to provide aesthetics to your incidence2 class objects.
173
+
We invite you to take a look at the `{incidence2}`[package vignette](https://www.reconverse.org/incidence2/articles/incidence2.html). Find how you can use the arguments within the `plot()` function to provide aesthetics to your incidence2 class objects.
194
174
195
175
```{r}
196
176
base::plot(weekly_incidence, fill = "sex")
197
177
```
198
178
199
-
Some of them include `show_cases = TRUE`, `angle = 45`, and `n_breaks = 5`. Feel free to give them a try.
179
+
Some of them include `show_cases = TRUE`, `angle = 45`, and `n_breaks = 5`. Try them and see how they impact on the resulting plot.
200
180
201
181
::::::::::::::::::::::::
202
182
203
183
::::::::::::::::::::::::::::::::::::: challenge
204
184
205
185
## Challenge 2: Can you do it?
206
-
-**Task**: Visualize `biweekly_incidence` object.
186
+
187
+
-**Task**: Visualize the `biweekly_incidence` object.
207
188
208
189
::::::::::::::::::::::::::::::::::::::::::::::::
209
190
210
191
## Curve of cumulative cases
211
192
212
-
The cumulative number of cases can be calculated using the `cumulate()` function from an `incidence2` object and visualized, as in the example below.
193
+
The cumulative number of cases can be calculated using the `incidence2::cumulate()` function on an `incidence2` object and visualized it, as in the example below.
213
194
214
195
```{r}
215
196
# Calculate cumulative incidence
@@ -221,7 +202,8 @@ base::plot(cum_df) +
221
202
x = "Time (in days)", # x-axis label
222
203
y = "weekly cases" # y-axis label
223
204
) +
224
-
tracetheme::theme_trace() # Apply the custom trace theme
205
+
theme_bw()
206
+
# tracetheme::theme_trace() # Apply the custom trace theme
225
207
```
226
208
227
209
Note that this function preserves grouping, i.e., if the `incidence2` object contains groups, it will accumulate the cases accordingly.
@@ -230,14 +212,13 @@ Note that this function preserves grouping, i.e., if the `incidence2` object con
230
212
::::::::::::::::::::::::::::::::::::: challenge
231
213
232
214
## Challenge 3: Can you do it?
233
-
-**Task**: Visulaize the cumulatie cases from `biweekly_incidence` object.
215
+
-**Task**: Visulaize the cumulative cases from the`biweekly_incidence` object.
234
216
235
217
::::::::::::::::::::::::::::::::::::::::::::::::
236
218
237
-
## Peak estimation
219
+
## Peak time estimation
238
220
239
-
You can estimate the peak -- the time with the highest number of recorded cases-- using the `estimate_peak()` function from the {incidence2} package.
240
-
This function employs a bootstrapping method to determine the peak time (i.e. by resampling dates with replacement, resulting in a distribution of estimated peak times).
221
+
You can estimate the peak -- the time with the highest number of recorded cases -- using the `incidence2::estimate_peak()` function from the {incidence2} package. This function uses a bootstrapping method to determine the peak time (i.e. by resampling dates with replacement, resulting in a distribution of estimated peak times).
This example demonstrates how to estimate the peak time using the `estimate_peak()` function at $95%$
256
-
confidence interval and using 100 bootstrap samples.
236
+
237
+
This example demonstrates how to estimate the peak time using the `incidence2::estimate_peak()` function at $95%$ confidence interval and using 100 bootstrap samples.
257
238
258
239
::::::::::::::::::::::::::::::::::::: challenge
259
240
260
241
## Challenge 4: Can you do it?
261
-
-**Task**: Estimate the peak time from `biweekly_incidence` object.
242
+
-**Task**: Estimate the peak time from the `biweekly_incidence` object.
262
243
263
244
::::::::::::::::::::::::::::::::::::::::::::::::
264
245
265
246
266
247
## Visualization with ggplot2
267
248
268
249
269
-
`{incidence2}` produces basic plots for epicurves, but additional work is required to create well-annotated graphs. However, using the `{ggplot2}` package, you can generate more sophisticated and epicurves with more flexibility in annotation.
250
+
`{incidence2}` produces basic plots for epicurves, but additional work is required to create well-annotated graphs. However, using the `{ggplot2}` package, you can generate more sophisticated epicurves, with more flexibility in annotation.
270
251
`{ggplot2}` is a comprehensive package with many functionalities. However, we will focus on three key elements for producing epicurves: histogram plots, scaling date axes and their labels, and general plot theme annotation.
271
252
The example below demonstrates how to configure these three elements for a simple `{incidence2}` object.
Use the `group` option in the mapping function to visualize an epicurve with different groups. If there is more than one grouping factor, use the `facet_wrap()` option, as demonstrated in the example below:
0 commit comments