update the describe-cases episode

Karim-Mane · Karim-Mane · commit d749ffe35371 · 2025-06-26T14:11:31.000Z
diff --git a/episodes/describe-cases.Rmd b/episodes/describe-cases.Rmd
@@ -8,29 +8,26 @@ exercises: 10
 
 - How to aggregate and summarise case data? 
 - How to visualize aggregated data?
-- What is distribution of cases in time, place, gender, age?
+- What is distribution of cases across time, space, gender, and age?
 
 ::::::::::::::::::::::::::::::::::::::::::::::::
 
 ::::::::::::::::::::::::::::::::::::: objectives
 
 - Simulate synthetic outbreak data
-- Convert indivdual linelist data to incidence over time
+- Convert linelist data into incidence over time
 - Create epidemic curves from incidence data
 ::::::::::::::::::::::::::::::::::::::::::::::::
 
 ## Introduction
 
-In an analytic pipeline, exploratory data analysis (EDA) is an important step before formal modelling. EDA helps 
-determine relationships between variables and summarize their main characteristics, often by means of data visualization. 
+In an analytic pipeline, exploratory data analysis (EDA) is an important step before formal modelling. EDA helps determine relationships between variables and summarize their main characteristics, often by means of data visualization. 
 
 This episode focuses on EDA of outbreak data using R packages. 
 A key aspect of EDA in epidemic analysis is 'person, place and time'. It is useful to identify how observed events - such as confirmed cases, hospitalizations, deaths, and recoveries - change over time, and how these vary across different locations and demographic factors, including gender, age, and more.
 
-Let's start by loading the package `{incidence2}` to aggregate linelist data according to specific characteristics, and visualize the resulting epidemic curves (epicurves) that plot the number of new events (i.e. incidence) over time.
- We'll use `{simulist}` to simulate some outbreak data to analyse,  and `{tracetheme}` for figure formatting.
- We'll use the pipe `%>%` to connect some of their functions, including others from the packages `{dplyr}` and 
- `{ggplot2}`, so let's also call to the tidyverse package:
+Let's start by loading the `{incidence2}` package to aggregate the linelist data according to specific characteristics, and visualize the resulting epidemic curves (epicurves) that plot the number of new events (i.e. case incidence over time). 
+We'll use the `{simulist}` package to simulate the outbreak data to analyse,  and `{tracetheme}` for figure formatting. We'll use the pipe operator (`%>%`) to connect some of their functions, including others from the `{dplyr}` and `{ggplot2}` packages, so let's also call to the {tidyverse} package.
 
 ```{r,eval=TRUE,message=FALSE,warning=FALSE}
 # Load packages
@@ -40,23 +37,10 @@ library(tracetheme) # For formatting figures
 library(tidyverse) # For {dplyr} and {ggplot2} functions and the pipe %>%
 ```
 
-::::::::::::::::::: checklist
-
-### The double-colon
-
-The double-colon `::` in R lets you call a specific function from a package without loading the entire package into the current environment. 
-
-For example, `dplyr::filter(data, condition)` uses `filter()` from the `{dplyr}` package.
-This help us remember package functions and avoid namespace conflicts.
-
-:::::::::::::::::::
-
  
 ## Synthetic outbreak data
 
-To illustrate the process of conducting EDA on outbreak data, we will generate a line list 
-for a hypothetical disease outbreak utilizing the `{simulist}` package. `{simulist}` generates simulation data for outbreak according to a given configuration. 
-Its minimal configuration can generate a linelist, as shown in the below code chunk:
+To illustrate the process of conducting EDA on outbreak data, we will generate a line list for a hypothetical disease outbreak utilizing the `{simulist}` package. `{simulist}` generates simulated data for outbreak according to a given configuration. Its minimal configuration can generate a linelist, as shown in the below code chunk:
 
 ```{r}
 # Simulate linelist data for an outbreak with size between 1000 and 1500
@@ -68,28 +52,23 @@ sim_data <- simulist::sim_linelist(outbreak_size = c(1000, 1500)) %>%
 sim_data
 ```
 
-This linelist dataset has entries on individual-level simulated events during the outbreak.
+This linelist dataset has simulated entries on individual-level events during an outbreak.
 
 ::::::::::::::::::: spoiler
 
 ## Additional Resources on Outbreak Data
 
-The above is the default configuration of `{simulist}`, so includes a number of assumptions about the transmissibility and severity of the pathogen. If you want to know more about `sim_linelist()` and other functionalities
-check the [documentation website](https://epiverse-trace.github.io/simulist/).
+The above is the default configuration of `{simulist}`. It includes a number of assumptions about the transmissibility and severity of the pathogen. If you want to know more about the `simulist::sim_linelist()` function and other functionalities check the [documentation website](https://epiverse-trace.github.io/simulist/).
 
-You can also find data sets from real emergencies from the past at the [`{outbreaks}` R package](https://www.reconverse.org/outbreaks/).
+You can also find data sets from past real outbreaks within the [`{outbreaks}`](https://www.reconverse.org/outbreaks/) R package.
 
 :::::::::::::::::::
 
 
 
-## Aggregating
+## Aggregating the data
 
-Often we want to analyse and visualise the number of events that occur on a particular day or week, rather than focusing on individual cases. This requires grouping linelist 
-data into incidence data. The [incidence2]((https://www.reconverse.org/incidence2/articles/incidence2.html){.external target="_blank"}) 
-package offers a useful function called `incidence2::incidence()` for grouping case data, usually based around dated events 
-and/or other characteristics. The code chunk provided below demonstrates the creation of an `<incidence2>` class object from the 
-simulated  Ebola `linelist` data based on the date of onset.
+Often we want to analyse and visualise the number of events that occur on a particular day or week, rather than focusing on individual cases. This requires grouping the linelist data into incidence data. The [{incidence2}]((https://www.reconverse.org/incidence2/articles/incidence2.html){.external target="_blank"}) package offers a useful function called `incidence2::incidence()` for grouping case data, usually based around dated events and/or other characteristics. The code chunk provided below demonstrates the creation of an `<incidence2>` class object from the simulated  Ebola `linelist` data based on the date of onset.
 
 ```{r}
 # Create an incidence object by aggregating case data based on the date of onset
@@ -102,8 +81,8 @@ daily_incidence <- incidence2::incidence(
 # View the incidence data
 daily_incidence
 ```
-With the `{incidence2}` package, you can specify the desired interval (e.g. day, week) and categorize cases by one or 
-more factors. Below is a code snippet demonstrating weekly cases grouped by the date of onset, sex, and type of case.
+
+With the `{incidence2}` package, you can specify the desired interval (e.g. day, week) and categorize cases by one or more factors. Below is a code snippet demonstrating weekly cases grouped by the date of onset, sex, and type of case.
 
 ```{r}
 # Group incidence data by week, accounting for sex and case type
@@ -119,15 +98,15 @@ weekly_incidence
 ```
 
 ::::::::::::::::::::::::::::::::::::: callout
-## Dates Completion  
-When cases are grouped by different factors, it's possible that the events involving these groups may have different date ranges in the 
-resulting `incidence2` object. The `incidence2` package provides a function called `complete_dates()` to ensure that an
- incidence object has the same range of dates for each group. By default, missing counts for a particular group will be filled with 0 for that date.
+
+## Dates Completion 
+
+When cases are grouped by different factors, it's possible that the events involving these groups may have different date ranges in the resulting `incidence2` object. The `{incidence2}` package provides a function called `incidence2::complete_dates()` to ensure that an incidence object has the same range of dates for each group. By default, missing counts for a particular group will be filled with 0 for that date.
  
-This functionality is also available as an argument within `incidence2::incidence()` adding `complete_dates = TRUE`.
+This functionality is also available within the `incidence2::incidence()` function by setting the value of the `complete_dates` to `TRUE`.
 
 ```{r}
-# Create an incidence object grouped by sex, aggregating daily
+# Create a daily incidence object grouped by sex
 daily_incidence_2 <- incidence2::incidence(
   sim_data,
   date_index = "date_onset",
@@ -154,16 +133,15 @@ daily_incidence_2_complete <- incidence2::complete_dates(
 ::::::::::::::::::::::::::::::::::::: challenge 
 
 ## Challenge 1: Can you do it?
- - **Task**: Aggregate `sim_data` linelist based on admission date and case outcome in __biweekly__
-  intervals, and save the results in an object called `biweekly_incidence`.
+
+ - **Task**: Calculate the __biweekly__ incidence of cases from the `sim_data` linelist based on their admission date and  outcome. Save the result in an object called `biweekly_incidence`.
 
 ::::::::::::::::::::::::::::::::::::::::::::::::
 
 ## Visualization
 
-The `incidence2` object can be visualized using the `plot()` function from the base R package. 
-The resulting graph is referred to as an epidemic curve, or epi-curve for short. The following code 
-snippets generate epi-curves for the `daily_incidence` and `weekly_incidence` incidence objects mentioned above.
+The `incidence2` objects can be visualized using the `plot()` function from the base R package. 
+The resulting graph is referred to as an epidemic curve, or epi-curve for short. The following code snippets generate epi-curves for the `daily_incidence` and `weekly_incidence` incidence objects mentioned above.
 
 ```{r}
 # Plot daily incidence data
@@ -172,7 +150,8 @@ base::plot(daily_incidence) +
     x = "Time (in days)", # x-axis label
     y = "Dialy cases" # y-axis label
   ) +
-  tracetheme::theme_trace() # Apply the custom trace theme
+  theme_bw()
+  # tracetheme::theme_trace() # Apply the custom trace theme
 ``` 
 
 
@@ -183,33 +162,35 @@ base::plot(weekly_incidence) +
     x = "Time (in weeks)", # x-axis label
     y = "weekly cases" # y-axis label
   ) +
-  tracetheme::theme_trace() # Apply the custom trace theme
+  theme_bw()
+  # tracetheme::theme_trace() # Apply the custom trace theme
 ``` 
 
 :::::::::::::::::::::::: callout
 
 #### Easy aesthetics
 
-We invite you to skim the `{incidence2}` package ["Get started" vignette](https://www.reconverse.org/incidence2/articles/incidence2.html). Find how you can use arguments within `plot()` to provide aesthetics to your incidence2 class objects.
+We invite you to take a look at the `{incidence2}` [package vignette](https://www.reconverse.org/incidence2/articles/incidence2.html). Find how you can use the arguments within the `plot()` function to provide aesthetics to your incidence2 class objects.
 
 ```{r}
 base::plot(weekly_incidence, fill = "sex")
 ```
 
-Some of them include `show_cases = TRUE`, `angle = 45`, and `n_breaks = 5`. Feel free to give them a try.
+Some of them include `show_cases = TRUE`, `angle = 45`, and `n_breaks = 5`. Try them and see how they impact on the resulting plot.
 
 ::::::::::::::::::::::::
 
 ::::::::::::::::::::::::::::::::::::: challenge 
 
 ## Challenge 2: Can you do it?
- - **Task**: Visualize `biweekly_incidence` object.
+
+ - **Task**: Visualize the `biweekly_incidence` object.
 
 ::::::::::::::::::::::::::::::::::::::::::::::::
 
 ## Curve of cumulative cases
 
-The cumulative number of cases can be calculated using the `cumulate()` function from an `incidence2` object and visualized, as in the example below.
+The cumulative number of cases can be calculated using the `incidence2::cumulate()` function on an `incidence2` object and visualized it, as in the example below.
 
 ```{r}
 # Calculate cumulative incidence
@@ -221,7 +202,8 @@ base::plot(cum_df) +
     x = "Time (in days)", # x-axis label
     y = "weekly cases" # y-axis label
   ) +
-  tracetheme::theme_trace() # Apply the custom trace theme
+  theme_bw()
+  # tracetheme::theme_trace() # Apply the custom trace theme
 ```
 
 Note that this function preserves grouping, i.e., if the `incidence2` object contains groups, it will accumulate the cases accordingly.
@@ -230,14 +212,13 @@ Note that this function preserves grouping, i.e., if the `incidence2` object con
 ::::::::::::::::::::::::::::::::::::: challenge 
 
 ## Challenge 3: Can you do it?
- - **Task**: Visulaize the cumulatie cases from `biweekly_incidence` object.
+ - **Task**: Visulaize the cumulative cases from the `biweekly_incidence` object.
 
 ::::::::::::::::::::::::::::::::::::::::::::::::
 
-##  Peak estimation
+##  Peak time estimation
 
-You can estimate the peak -- the time with the highest number of recorded cases-- using the `estimate_peak()` function from the {incidence2} package. 
-This function employs a bootstrapping method to determine the peak time (i.e. by resampling dates with replacement, resulting in a distribution of estimated peak times).
+You can estimate the peak -- the time with the highest number of recorded cases -- using the `incidence2::estimate_peak()` function from the {incidence2} package. This function uses a bootstrapping method to determine the peak time (i.e. by resampling dates with replacement, resulting in a distribution of estimated peak times).
 
 ```{r}
 # Estimate the peak of the daily incidence data
@@ -252,21 +233,21 @@ peak <- incidence2::estimate_peak(
 # Display the estimated peak
 print(peak)
 ```
-This example demonstrates how to estimate the peak time using the `estimate_peak()` function at $95%$ 
-confidence interval and using 100 bootstrap samples. 
+
+This example demonstrates how to estimate the peak time using the `incidence2::estimate_peak()` function at $95%$ confidence interval and using 100 bootstrap samples. 
 
 ::::::::::::::::::::::::::::::::::::: challenge 
 
 ## Challenge 4: Can you do it?
- - **Task**: Estimate the peak time from `biweekly_incidence` object.
+ - **Task**: Estimate the peak time from the `biweekly_incidence` object.
 
 ::::::::::::::::::::::::::::::::::::::::::::::::
 
 
 ## Visualization with ggplot2
 
 
-`{incidence2}` produces basic plots for epicurves, but additional work is required to create well-annotated graphs. However, using the `{ggplot2}` package, you can generate more sophisticated and epicurves with more flexibility in annotation.
+`{incidence2}` produces basic plots for epicurves, but additional work is required to create well-annotated graphs. However, using the `{ggplot2}` package, you can generate more sophisticated epicurves, with more flexibility in annotation.
 `{ggplot2}` is a comprehensive package with many functionalities. However, we will focus on three key elements for producing epicurves: histogram plots, scaling date axes and their labels, and general plot theme annotation.
 The example below demonstrates how to configure these three elements for a simple `{incidence2}` object.
 
@@ -316,7 +297,7 @@ ggplot2::ggplot(data = daily_incidence) +
 Use the `group` option in the mapping function to visualize an epicurve with different groups. If there is more than one grouping factor, use the `facet_wrap()` option, as demonstrated in the example below:
 
 ```{r}
-# Plot daily incidence by sex with facets
+# Plot daily incidence faceted by sex  
 ggplot2::ggplot(data = daily_incidence_2) +
   geom_histogram(
     mapping = aes(
@@ -357,14 +338,15 @@ ggplot2::ggplot(data = daily_incidence_2) +
 ::::::::::::::::::::::::::::::::::::: challenge 
 
 ## Challenge 5: Can you do it?
- - **Task**: Produce an annotated figure for biweekly_incidence using `{ggplot2}` package.
+
+ - **Task**: Produce an annotated figure for the `biweekly_incidence` object using the `{ggplot2}` package.
 
 ::::::::::::::::::::::::::::::::::::::::::::::::
 
 ::::::::::::::::::::::::::::::::::::: keypoints 
 
 - Use `{simulist}` package to generate synthetic outbreak data
-- Use `{incidence2}` package to aggregate case data based on a date event, and produce epidemic curves. 
+- Use `{incidence2}` package to aggregate case data based on a date event, and other variables to produce epidemic curves. 
 - Use `{ggplot2}` package to produce better annotated epicurves. 
 
 ::::::::::::::::::::::::::::::::::::::::::::::::