-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathREADME.Rmd
More file actions
474 lines (361 loc) · 17.5 KB
/
README.Rmd
File metadata and controls
474 lines (361 loc) · 17.5 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
fig.width = 6,
fig.height = 4,
fig.cap = "",
dpi = 60
)
abc_path <- system.file("examples/basics_abc.rds", package = "demografr")
if (!file.exists(abc_path))
stop("Run vignette-01-basics first to generate data", call. = FALSE)
library(demografr)
library(slendr)
init_env(quiet = TRUE)
```
# _demografr_: Simulation-based inference for population genetics
<!-- badges: start -->
[](https://cran.r-project.org/package=slendr) [](https://cran.r-project.org/package=slendr) [](https://github.com/bodkan/demografr/actions) [](http://beta.mybinder.org/v2/gh/bodkan/demografr/main?urlpath=rstudio) [](https://app.codecov.io/github/bodkan/demografr?branch=main)
<!-- badges: end -->
⚠️⚠️⚠️
**Please note that until a peer reviewed publication describing the
_demografr_ package is published, the software should be regarded as
experimental and potentially unstable.**
⚠️️⚠️⚠️
---
**You can now read our preprint on [bioRxiv](https://www.biorxiv.org/content/10.64898/2025.12.18.694482v1)!**
All comments and feedback are welcome!
---
The goal of _demografr_ is to simplify and streamline the development of
simulation-based inference pipelines in population genetics and evolutionary
biology, such as [Approximate Bayesian Computation](https://en.wikipedia.org/wiki/Approximate_Bayesian_computation)
(ABC) or parameter grid inferences, and make them more reproducible.
_demografr_ also aims to make these inferences orders of magnitude faster and
more efficient by leveraging the [tree sequence](https://tskit.dev/learn/) as
an internal data structure and computation engine.
Unlike traditional ABC and other simulation-based approaches, which generally
involve custom-built pipelines and scripts for population genetic simulation
and computation of summary statistics, _demografr_ makes it possible to perform
simulation, computation of summary statistics, and the inference itself
within a single reproducible R script. By eliminating
the need to write custom simulation code and scripting for integration of
various population genetic tools for computing summary statistics, it lowers
the barrier for new users and facilitates reproducibility for everyone
regardless of their level of experience by eliminating many common sources of
bugs.
### How does _demografr_ help with ABC?
_demografr_ streamlines every step of a typical ABC pipeline by leveraging the [_slendr_](https://github.com/bodkan/slendr/) framework as a building block
for simulation and data analysis, making it possible to write complete
simulation-based workflows entirely in R. Specifically:
1. _slendr_'s intuitive, interactive
[interface for definning population genetic models](https://bodkan.net/slendr/articles/vignette-04-nonspatial-models.html)
makes it easy to encode even complex demographic models with only a bare minimum
of R knowledge needed.
2. _demografr_ makes it possible to encode prior distributions of parameters
using familiar R interface resembling standard probabilistic statements, and
provides an automated function which simulates ABC replicates drawing
parameters from priors in a trivial, one-step manner. Automated routines for
exploring model parameter grids in settings other than ABC are also provided.
3. Because _slendr_ embraces the [tree sequence](https://tskit.dev/learn/) as
its default internal data structure, most population genetic statistics can be
computed directly on such tree sequences using R functions which are part of
_slendr_'s statistical library. A tree sequence is never saved to disk and no
conversion between file formats is required, which significantly speeds up
every workflow. That said, using custom simulation scripts as well as computation
of summary statistics in third party software is also possible (while still
leveraging the consistent _demografr_ workflow approach and most of its standard
simulation and inference functions).
4. _demografr_ facilitates tight integration with the powerful R package
[_abc_](https://cran.r-project.org/package=abc) by automatically feeding it
simulation data for inference and diagnostics. At the moment, the _abc_
package represents a core inference engine of _demografr_. Other inference
engines and statistical approaches will be developed in the future.
## Installation
Once the _demografr_ package appears on CRAN, you will be able to install
the latest released version with the following command:
``` r
install.packages("demografr")
```
However, especially early after an initial release, it might be worth keeping
an eye on the [changelog](https://bodkan.net/demografr/news/) for
bugfixes and improvements, which you can always obtain by installing the development
version of _demografr_ with:
``` r
devtools::install_github("bodkan/demografr")
```
Note that this requires an R package _devtools_, which you can install simply
by running `install.packages("devtools")`.
<!-- ### Testing the R package in an online RStudio session -->
<!-- You can open an RStudio session and test examples from the [vignettes](https://bodkan.net/demografr/articles/) directly in your web -->
<!-- browser by clicking this button (no installation is needed!): -->
<!-- [](http://beta.mybinder.org/v2/gh/bodkan/demografr/main?urlpath=rstudio) -->
<!-- **In case the RStudio instance appears to be starting very slowly, please be -->
<!-- patient (Binder is a freely available service with limited computational -->
<!-- resources).** If Binder crashes, try reloading the web page, which will restart -->
<!-- the cloud session. -->
<!-- Once you get a browser-based RStudio session, you can navigate to the -->
<!-- `vignettes/` directory and test the examples on your own! -->
## An example ABC pipeline
**Note:** A much more detailed explanation of this toy example can be found
in the [following vignette](https://bodkan.net/demografr/articles/vignette-01-basics.html).
Or in the _demografr_ manuscript available on bioRxiv. Please take the code
below as a bare minimum demonstration of some basic functionality of the package.
Imagine that we sequenced genomes of individuals from populations "A", "B",
"C", and "D".
Let's also assume that we know that the populations are phylogenetically
related in the following way, with an indicated gene-flow event at a certain
time in the past, but we don't know anything else (i.e., we have no idea about
the $N_e$ of the populations, their split times, or the proportion of gene flow):
```{r ape_tree, echo=FALSE, fig.width=5, fig.height=3.5}
orig_par <- par(no.readonly = TRUE)
par(mar = c(0, 0, 0, 0))
tree <- ape::read.tree(text="(A,(B,(C,D)));")
plot(tree)
arrows(2.5, 2, 2.5, 3, col="blue")
par(orig_par)
```
After sequencing the genomes of individuals from these populations, we computed
the nucleotide diversity in each of them, their pairwise genetic
divergence, and $f_4$ statistic, and observed the following values
(which we saved in two standard R data frames). In a traditional ABC setting,
this computation and table formatting would typically be done in some 3rd party
software like PLINK, ADMIXTOOLS, etc.
1. Nucleotide diversity in each population:
```{r}
observed_diversity <- read.table(system.file("examples/basics_diversity.tsv", package = "demografr"), header = TRUE)
observed_diversity
```
2. Pairwise divergence between populations X and Y:
```{r}
observed_divergence <- read.table(system.file("examples/basics_divergence.tsv", package = "demografr"), header = TRUE)
observed_divergence
```
3. Value of the following $f_4$-statistic:
```{r}
observed_f4 <- read.table(system.file("examples/basics_f4.tsv", package = "demografr"), header = TRUE)
observed_f4
```
**Note that the value of each given statistic is given as the last column of
each data frame, with additional columns providing relevant "metadata". Keeping
consistent naming throughout an inference pipeline is extremely important for
_demografr_, and goes a long way towards reproducibility and towards avoiding
sneaky bugs!** For this reason, _demografr_ often loudly complains if something
looks inconsistent. Better safe than sorry!
### A complete ABC pipeline in a single R script
This is how we would use _demografr_ to estimate the $N_e$, split times for all
populations, as well as the rate of the indicated gene-flow event with
Approximate Bayesian Computation in a single R script. Again, detailed
description of each component can be found throughout our vignettes, and a
motivation for various design decisions is also detailed in the _demografr_
manuscript.
```{r, eval=FALSE}
library(demografr)
library(slendr)
# running setup_env() first might be necessary to set up slendr's internal
# simulation environment
init_env()
# set up parallelization across all CPUs on the current machine
library(future)
plan(multisession, workers = availableCores())
#--------------------------------------------------------------------------------
# bind data frames with empirical summary statistics into a named list
observed <- list(
diversity = observed_diversity,
divergence = observed_divergence,
f4 = observed_f4
)
#--------------------------------------------------------------------------------
# define a model generating function using the slendr interface
# (each of the function parameters correspond to a parameter we want to infer)
model <- function(Ne_A, Ne_B, Ne_C, Ne_D, T_AB, T_BC, T_CD, gf_BC) {
A <- population("A", time = 1, N = Ne_A)
B <- population("B", time = T_AB, N = Ne_B, parent = A)
C <- population("C", time = T_BC, N = Ne_C, parent = B)
D <- population("D", time = T_CD, N = Ne_D, parent = C)
gf <- gene_flow(from = B, to = C, start = 9000, end = 9301, rate = gf_BC)
model <- compile_model(
populations = list(A, B, C, D), gene_flow = gf,
generation_time = 1, simulation_length = 10000,
direction = "forward", serialize = FALSE
)
samples <- schedule_sampling(
model, times = 10000,
list(A, 25), list(B, 25), list(C, 25), list(D, 25),
strict = TRUE
)
# when a specific sampling schedule is to be used, both model and samples
# must be returned by the function
return(list(model, samples))
}
#--------------------------------------------------------------------------------
# setup priors for model parameters
priors <- list(
Ne_A ~ runif(1000, 3000),
Ne_B ~ runif(100, 1500),
Ne_C ~ runif(5000, 10000),
Ne_D ~ runif(2000, 7000),
T_AB ~ runif(1, 4000),
T_BC ~ runif(3000, 9000),
T_CD ~ runif(5000, 10000),
gf_BC ~ runif(0, 0.3)
)
#--------------------------------------------------------------------------------
# define summary functions to be computed on simulated data (must be of the
# same format as the summary statistics computed on empirical data)
compute_diversity <- function(ts) {
samples <- ts_names(ts, split = "pop")
ts_diversity(ts, sample_sets = samples)
}
compute_divergence <- function(ts) {
samples <- ts_names(ts, split = "pop")
ts_divergence(ts, sample_sets = samples)
}
compute_f4 <- function(ts) {
samples <- ts_names(ts, split = "pop")
A <- samples["A"]; B <- samples["B"]
C <- samples["C"]; D <- samples["D"]
ts_f4(ts, A, B, C, D)
}
# the summary functions must be also bound to an R list named in the same
# way as the empirical summary statistics
functions <- list(
diversity = compute_diversity,
divergence = compute_divergence,
f4 = compute_f4
)
#--------------------------------------------------------------------------------
# validate the individual ABC components for correctness and consistency
validate_abc(model, priors, functions, observed,
sequence_length = 1e6, recombination_rate = 1e-8)
#--------------------------------------------------------------------------------
# run ABC simulations
data <- simulate_abc(
model, priors, functions, observed, iterations = 10000,
sequence_length = 50e6, recombination_rate = 1e-8, mutation_rate = 1e-8
)
#--------------------------------------------------------------------------------
# infer posterior distributions of parameters using the abc R package
abc <- run_abc(data, engine = "abc", tol = 0.01, method = "neuralnet")
```
```{r, echo=FALSE, eval=file.exists(abc_path)}
abc <- readRDS(abc_path)
```
## Examining posterior distributions of parameters
After we run this R script, we end up with an object called `abc` (see the
last row of the script). This
object contains the complete information about the results of our inference.
In particular, it carries the posterior samples for our parameters of interest
($N_e$ of populations and their split times).
What can we do with this, to get an idea about the most likely parameters of
the assumed evolutionary history of our populations of interest?
For instance, we can get a summary table of all parameter posteriors with the
function `extract_summary()`:
```{r, warning=FALSE}
extract_summary(abc)
```
We can also specify a subset of model parameters to select, or provide a
regular expression for this subsetting:
```{r, warning=FALSE}
extract_summary(abc, param = "gf_BC")
```
Of course, we can also visualize the posterior distributions. Rather than
plotting many different distributions at once, let's first check out the
posterior distributions of inferred $N_e$ values:
```{r, posterior_Ne, warning=FALSE, fig.width=8, fig.height=5}
plot_posterior(abc, param = "Ne")
```
Similarly, we can take a look at the inferred posteriors of the split times:
```{r, posterior_Tsplit, warning=FALSE, fig.width=8, fig.height=5}
plot_posterior(abc, param = "T")
```
And, finally, the rate of gene flow:
```{r, posterior_gf, warning=FALSE, fig.width=8, fig.height=5}
plot_posterior(abc, param = "gf") + ggplot2::coord_cartesian(xlim = c(0, 1))
```
Additionally, we have the full diagnostic functionality of the
[_abc_](https://cran.r-project.org/package=abc) R package at our disposal:
```{r, diagnostic_Ne, fig.width=10, fig.height=8}
plot(abc, param = "Ne_C")
```
Many diagnostic and model selection functions implemented by _abc_ are also
supported by _demografr_. For more information, see
[this vignette](https://bodkan.net/demografr/articles/vignette-06-diagnostics.html).
## Additional functionality
_demografr_ also provides a couple of functions designed to make
development (and troubleshooting) a little easier.
For instance, assuming we have `priors` set up as above, we can visualize the
prior distribution(s) like this:
```{r, echo=FALSE}
priors <- list(
Ne_A ~ runif(1000, 3000),
Ne_B ~ runif(100, 1500),
Ne_C ~ runif(5000, 10000),
Ne_D ~ runif(2000, 7000),
T_AB ~ runif(1, 4000),
T_BC ~ runif(3000, 9000),
T_CD ~ runif(5000, 10000),
gf_BC ~ runif(0, 0.3)
)
```
```{r, prior_Ne}
plot_prior(priors, "Ne")
```
To make developing complete pipelines more efficient, _demografr_ also provides
means to test and evaluate their individual components even further. For instance,
the function `simulate_model()` simulates data from a single simulation run:
```{r}
#| echo: false
model <- function(Ne_A, Ne_B, Ne_C, Ne_D, T_AB, T_BC, T_CD, gf_BC) {
A <- population("A", time = 1, N = Ne_A)
B <- population("B", time = T_AB, N = Ne_B, parent = A)
C <- population("C", time = T_BC, N = Ne_C, parent = B)
D <- population("D", time = T_CD, N = Ne_D, parent = C)
gf <- gene_flow(from = B, to = C, start = 9000, end = 9301, rate = gf_BC)
model <- compile_model(
populations = list(A, B, C, D), gene_flow = gf,
generation_time = 1, simulation_length = 10000,
direction = "forward"
)
samples <- schedule_sampling(
model, times = 10000,
list(A, 25), list(B, 25), list(C, 25), list(D, 25),
strict = TRUE
)
return(list(model, samples))
}
priors <- list(
Ne_A ~ runif(1000, 3000), Ne_B ~ runif(100, 1500), Ne_C ~ runif(5000, 10000), Ne_D ~ runif(2000, 7000),
T_AB ~ runif(1, 4000), T_BC ~ runif(3000, 9000), T_CD ~ runif(5000, 10000),
gf_BC ~ runif(0, 0.3)
)
compute_diversity <- function(ts) { samples <- ts_names(ts, split = "pop"); ts_diversity(ts, sample_sets = samples) }
compute_divergence <- function(ts) { samples <- ts_names(ts, split = "pop"); ts_divergence(ts, sample_sets = samples) }
compute_f4 <- function(ts) {
samples <- ts_names(ts, split = "pop")
A <- samples["A"]; B <- samples["B"]; C <- samples["C"]; D <- samples["D"]
ts_f4(ts, A, B, C, D)
}
functions <- list(diversity = compute_diversity, divergence = compute_divergence, f4 = compute_f4)
```
```{r}
ts <- simulate_model(model, priors, sequence_length = 1e6, recombination_rate = 1e-8, mutation_rate = 1e-8)
ts
```
With this one simulation data instance `ts` (in this case, a tree-sequence
object of which the above-mentioned function
`simulate_abc()` would produce millions, making troubleshooting challenging and
slow),
we can apply individual summary statistic functions using another helper
function `summarise_data()`:
```{r}
summarise_data(ts, functions)
```
By comparing the format of this result to the observed data (given in the
list `observed` at the beginning of this page), we can make sure
that both simulated and observed summary statistics are mutually compatible,
and can be thus directly compared in later steps of the ABC workflow.
**See [the reference](https://bodkan.net/demografr/reference/) for a complete
overview of the functionality of this package.**