Read through of stats docs

apragsdale · apragsdale · commit 547bdc8524b1 · 2026-03-22T17:54:28.000-05:00
diff --git a/docs/stats.md b/docs/stats.md
@@ -694,49 +694,47 @@ and boolean expressions (e.g., {math}`(x > 0)`) are interpreted as 0/1.
 
 The {meth}`~TreeSequence.ld_matrix` method provides an interface to
 a collection of two-locus statistics with predefined summary functions (see
-{ref}`sec_stats_two_locus_summary_functions`) and `site` and `branch`
-{ref}`modes <sec_stats_mode>`. The LD matrix method differs from other
+{ref}`sec_stats_two_locus_summary_functions`).
+The LD matrix method differs from other
 statistics methods in that it provides a unified API with an argument to
 specify different two-locus summaries of the data. It otherwise behaves
 similarly to most other functions with respect to `sample_sets` and `indexes`.
 
-Two-locus statistics can be computed using two modes, either `site` or
-`branch`, and these should be interpreted in the same way as these modes in the
-single-site statistics. That is, the `site` mode computes LD over observed
-mutations at pairs of sites, while the `branch` model computes expected
-LD conditioned on pairs of trees.
+Two-locus statistics can be computed using two {ref}`modes <sec_stats_mode>`,
+either `site` or `branch`, and these should be interpreted in the same way as
+these modes in the single-site statistics. That is, the `site` mode computes LD
+over observed alleles at pairs of sites, while the `branch` model computes
+expected LD conditioned on pairs of trees.
 
 (sec_stats_two_locus_site)=
 
 #### Site mode
 
-The `site` mode computes two-locus statistics summarized over alleles between
+The `"site"` mode computes two-locus statistics summarized over alleles between
 all pairs of specified sites. The default behavior, leaving `sites`
-unspecified, will compute a matrix for all pairs of sites, with
-one row and column for each site in the tree sequence (i.e., an n x n
-matrix where n is the number of sites in the tree sequence). We can also
+unspecified, will compute a matrix for all pairs of sites, with one row and
+column for each site in the tree sequence (i.e., an {math}`n \times n` matrix
+where {math}`n` is the number of sites in the tree sequence). We can also
 restrict the output to a subset of sites, either by specifying a single vector
-for both rows and columns or a pair of vectors for the row sites and column
-sites separately.
+of site indexes for both rows and columns or a pair of vectors for the row
+sites and column sites separately.
 
 The following computes a matrix of the {math}`r^2` measure of linkage
 disequilibrium (LD) computed pairwise between the first 4 sites in the tree
-sequence among all samples. In our computations, row sites are used as the row
-("left-hand") loci and column sites are used as the column ("right-hand")
-locus, and with a single list of sites specified, we obtain a symmetric square
-matrix. 
+sequence among all samples. The `sites` must be given as a list of lists, and
+with a single list of sites specified, we obtain a symmetric square matrix. 
 
 ```{code-cell} ipython3
 ld = ts.ld_matrix(sites=[[0, 1, 2, 3]])
 print(ld)
 ```
 
 The following demonstrates how we can specify the row and column sites
-independently of each other. We're specifying 3 columns and 2 rows, which
+independently of each other. We're specifying 2 rows and 3 columns, which
 computes a subset of the matrix shown above.
 
 ```{code-cell} ipython3
-ld = ts.ld_matrix(sites=[[0, 1], [1, 2, 3]])
+ld = ts.ld_matrix(sites=[[1, 2], [1, 2, 3]])
 print(ld)
 ```
 
@@ -772,21 +770,21 @@ Out of all of the available summary functions, only {math}`r^2` uses
 `hap_weighted` normalisation, with the remainder using uniform weighting
 (`total_weighted`).
 
-Within this framework, statistics may be either
-polarised or unpolarised. For statistics that are polarised, we compute
-statistic values for pairs of derived alleles. (For this purpose, the "derived" alleles
-at a site are all alleles except that stored as the ``ancestral_state`` for the site.)
-Unpolarised statistics compute
-statistics over all pairs of alleles, derived and ancestral. In either case,
-the result is averaged over these values, using a weighting
-scheme described below. The option for polarisation is not exposed to the user,
-and we list which statistics are polarised below.
+Within this framework, statistics may be either polarised or unpolarised. For
+statistics that are polarised, we compute statistic values for pairs of derived
+alleles. (For this purpose, the "derived" alleles at a site are all alleles
+except that stored as the ``ancestral_state`` for the site.) Unpolarised
+statistics compute statistics over all pairs of alleles, derived and ancestral.
+In either case, the result is averaged over these values, using one of the
+weighting scheme (described below for each statistics). The option for
+polarisation is not exposed to the user, and we list which statistics are
+polarised below.
 
 (sec_stats_two_locus_branch)=
 
 #### Branch mode
 
-The `branch` mode computes expected two-locus statistics between pairs of
+The `"branch"` mode computes expected two-locus statistics between pairs of
 trees, conditioned on the marginal topologies and branch lengths of those
 trees. The trees for which we compute statistics are specified by positions,
 and for a pair of positions we consider all possible haplotypes that could be
@@ -795,12 +793,12 @@ generated by a single mutation occurring on each of the two trees.
 For two trees, one with {math}`n` branches and the other with {math}`m`
 branches, there are {math}`nm` possible pairs of branches that may carry the
 pair of mutations. For each pair, we compute the two-locus statistic, and then
-sum these values weighted by the product of the two branch lengths. Given the
-two mutations occur, this accounts for the relative probability that the two
-mutations fall on any pair of branches.
+sum these values weighted by the product of the two branch lengths. Given that
+the two mutations occur, this accounts for the relative probability that the
+two mutations fall on any pair of branches.
 
 In other words, imagine we place two mutations uniformly, one on each tree, and
-then compute the statistic; the branch mode computes the expected value of the
+then compute the statistic. The branch mode computes the expected value of the
 statistic over this process, multiplied by the product of the total branch
 lengths of each tree. This weighting accounts for mutational opportunity, so that
 the sum of the branch-mode statistic over all positions in a genomic region,
@@ -828,7 +826,20 @@ ld = ts.ld_matrix(
 print(ld)
 ```
 
-Again, we can specify the row and column trees separately.
+We note that these values are quite large: as described above, the statistic is
+scaled by the product of the total branch lengths of each pair of trees. To
+compute the expected {math}`r^2` value for a pair of mutations that each land
+uniformly on the pair of trees, we can divide by the product of the total
+branch lengths:
+
+```{code-cell} ipython3
+total_branch_lengths = [tree.total_branch_length for tree in ts.trees()]
+prod_branch_lengths = np.outer(total_branch_lengths, total_branch_lengths)
+print(ld / prod_branch_lengths[0:4, 0:4])
+```
+
+As with the `"site"` mode above, we can specify the row and column trees
+separately.
 
 ```{code-cell} ipython3
 breakpoints = ts.breakpoints(as_array=True)
@@ -859,15 +870,15 @@ are selected in the same way (with the `stat` argument), and these are limited
 to a handful of statistics (see
 {ref}`sec_stats_two_locus_summary_functions_two_way`). The dimension-dropping
 rules for the result follow the rest of the tskit stats API in that a single
-list or tuple will produce a single two-dimensional matrix, while list of these
-will produce a three-dimensional array, with the first dimension of length
-equal to the length of the list.
+list or tuple will produce a single two-dimensional matrix, while a list of
+these will produce a three-dimensional array, with the first dimension of
+length equal to the length of the list.
 
 For example, to compute the {math}`r^2` LD matrix over a subset of samples in
 the tree sequence (such as sample nodes 0 through 7), we would specify the
 samples as follows:
 
-```
+```{code-cell} ipython3
 ts = msprime.sim_ancestry(
     20,
     population_size=10000,
@@ -876,7 +887,7 @@ ts = msprime.sim_ancestry(
     random_seed=12)
 ts = msprime.sim_mutations(ts, rate=2e-8, random_seed=12)
 
-ld = ts.ld_matrix(mode=“site”, sample_sets=range(8))
+ld = ts.ld_matrix(mode="site", sample_sets=range(8))
 print(ld)
 ```
 
@@ -896,27 +907,27 @@ ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=[(0, 1)]) # -> 3
 #### Why are there `nan` values in the LD matrix?
 
 For some statistics, it is possible to observe `nan` entries in the LD matrix,
-which can be surprising or numerically impact downstream analyses. A `nan`
+which can be surprising and may numerically impact downstream analyses. A `nan`
 entry occurs if the denominator of a ratio statistic (including {math}`r` and
 {math}`r^2`) is zero, indicating that one or both of the alleles in the pair is
-fixed or absent in the sample set under consideration. This can happen for
+fixed or absent in the given sample set(s). This can happen for
 a number of reasons:
 
-- The mutation models allows for reversible mutations, so a back mutation at
-  a site has resulted in a single allele despite multiple mutations in the
+- Some mutation models allow for reversible mutations, so a back mutation at
+  a site can result in a single allele despite multiple mutations in the
   history of the sample.
 - LD is computed for a subsample of individuals, and some sites are not
   variable among the sample nodes in the subsample.
 - A mutation exists above the root of the local tree, so that all samples carry
   the mutation, and one or more sites are not variable.
 
-The `branch` mode will also return `nan` values if there are branches in either
-tree on which a mutation would not result in a polymorphism within a sample
-set.
+The `branch` mode will also return `nan` values for ratio statistics if there
+are branches in either tree on which a mutation would not result in
+a polymorphism within a sample set.
 
 (sec_stats_two_locus_sample_one_way_stats)=
 
-##### One-way Statistics
+#### One-way Statistics
 
 One-way statistics are summaries of two loci in a single sample set, using
 a triple of haplotype counts {math}`\{n_{AB}, n_{Ab}, n_{aB}\}` and the size of
@@ -925,7 +936,7 @@ notation represent alternate alleles.
 
 (sec_stats_two_locus_sample_two_way_stats)=
 
-##### Two-way Statistics
+#### Two-way Statistics
 
 Two-way statistics are summaries of haplotype counts between two sample sets,
 which operate on the three haplotype counts (as in one-way stats, above)
@@ -937,7 +948,8 @@ covariance of {math}`D` between sample sets.
 
 Only a subset of our summary functions are two-way statistics (see
 {ref}`sec_two_locus_summary_functions_two_way`). Note that the unbiased two-way
-statistics expect non-overlapping sample sets, and we do not make any
+statistics expect non-overlapping sample sets (see [Ragsdale and Gravel
+(2020)](https://doi.org/10.1093/molbev/msz265)), and we do not make any
 assertions about the sample sets and assume that `i` and `j` represent disjoint
 sets of samples (see also the note in {meth}`~TreeSequence.divergence`).
 
@@ -958,7 +970,7 @@ and {math}`n_B = n_{AB} + n_{aB}`, with frequencies {math}`p` found by dividing
 by {math}`n`.
 
 `D`
-: {math}`f(n_{AB}, n_{Ab}, n_{aB}, n) = p_{AB}p_{ab} - p_{Ab}p_{aB}`
+: {math}`f(n_{AB}, n_{Ab}, n_{aB}, n) = p_{AB}p_{ab} - p_{Ab}p_{aB} \, (=p_{AB} - p_A p_B)`
 
   This statistic is polarised, as the unpolarised result, which averages over
   allele labelings, is zero. Uses the `total` weighting method.
@@ -1015,7 +1027,7 @@ Two-way statistics are indexed by sample sets {math}`i, j` and compute values
 using haplotype counts within pairs of sample sets.
 
 `D2`
-: {math}`f(n_{AB}, n_{Ab}, n_{aB}, n) = D_i * D_j`,
+: {math}`f(n_{AB}, n_{Ab}, n_{aB}, n) = D_i D_j`,
 
   where {math}`D_i` denotes {math}`D` computed within sample set {math}`i`,
   and {math}`D` is defined above. Unpolarised, `total` weighted.