Skip to content

Commit 547bdc8

Browse files
committed
Read through of stats docs
1 parent 536f144 commit 547bdc8

File tree

1 file changed

+63
-51
lines changed

1 file changed

+63
-51
lines changed

docs/stats.md

Lines changed: 63 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -694,49 +694,47 @@ and boolean expressions (e.g., {math}`(x > 0)`) are interpreted as 0/1.
694694

695695
The {meth}`~TreeSequence.ld_matrix` method provides an interface to
696696
a collection of two-locus statistics with predefined summary functions (see
697-
{ref}`sec_stats_two_locus_summary_functions`) and `site` and `branch`
698-
{ref}`modes <sec_stats_mode>`. The LD matrix method differs from other
697+
{ref}`sec_stats_two_locus_summary_functions`).
698+
The LD matrix method differs from other
699699
statistics methods in that it provides a unified API with an argument to
700700
specify different two-locus summaries of the data. It otherwise behaves
701701
similarly to most other functions with respect to `sample_sets` and `indexes`.
702702

703-
Two-locus statistics can be computed using two modes, either `site` or
704-
`branch`, and these should be interpreted in the same way as these modes in the
705-
single-site statistics. That is, the `site` mode computes LD over observed
706-
mutations at pairs of sites, while the `branch` model computes expected
707-
LD conditioned on pairs of trees.
703+
Two-locus statistics can be computed using two {ref}`modes <sec_stats_mode>`,
704+
either `site` or `branch`, and these should be interpreted in the same way as
705+
these modes in the single-site statistics. That is, the `site` mode computes LD
706+
over observed alleles at pairs of sites, while the `branch` model computes
707+
expected LD conditioned on pairs of trees.
708708

709709
(sec_stats_two_locus_site)=
710710

711711
#### Site mode
712712

713-
The `site` mode computes two-locus statistics summarized over alleles between
713+
The `"site"` mode computes two-locus statistics summarized over alleles between
714714
all pairs of specified sites. The default behavior, leaving `sites`
715-
unspecified, will compute a matrix for all pairs of sites, with
716-
one row and column for each site in the tree sequence (i.e., an n x n
717-
matrix where n is the number of sites in the tree sequence). We can also
715+
unspecified, will compute a matrix for all pairs of sites, with one row and
716+
column for each site in the tree sequence (i.e., an {math}`n \times n` matrix
717+
where {math}`n` is the number of sites in the tree sequence). We can also
718718
restrict the output to a subset of sites, either by specifying a single vector
719-
for both rows and columns or a pair of vectors for the row sites and column
720-
sites separately.
719+
of site indexes for both rows and columns or a pair of vectors for the row
720+
sites and column sites separately.
721721

722722
The following computes a matrix of the {math}`r^2` measure of linkage
723723
disequilibrium (LD) computed pairwise between the first 4 sites in the tree
724-
sequence among all samples. In our computations, row sites are used as the row
725-
("left-hand") loci and column sites are used as the column ("right-hand")
726-
locus, and with a single list of sites specified, we obtain a symmetric square
727-
matrix.
724+
sequence among all samples. The `sites` must be given as a list of lists, and
725+
with a single list of sites specified, we obtain a symmetric square matrix.
728726

729727
```{code-cell} ipython3
730728
ld = ts.ld_matrix(sites=[[0, 1, 2, 3]])
731729
print(ld)
732730
```
733731

734732
The following demonstrates how we can specify the row and column sites
735-
independently of each other. We're specifying 3 columns and 2 rows, which
733+
independently of each other. We're specifying 2 rows and 3 columns, which
736734
computes a subset of the matrix shown above.
737735

738736
```{code-cell} ipython3
739-
ld = ts.ld_matrix(sites=[[0, 1], [1, 2, 3]])
737+
ld = ts.ld_matrix(sites=[[1, 2], [1, 2, 3]])
740738
print(ld)
741739
```
742740

@@ -772,21 +770,21 @@ Out of all of the available summary functions, only {math}`r^2` uses
772770
`hap_weighted` normalisation, with the remainder using uniform weighting
773771
(`total_weighted`).
774772

775-
Within this framework, statistics may be either
776-
polarised or unpolarised. For statistics that are polarised, we compute
777-
statistic values for pairs of derived alleles. (For this purpose, the "derived" alleles
778-
at a site are all alleles except that stored as the ``ancestral_state`` for the site.)
779-
Unpolarised statistics compute
780-
statistics over all pairs of alleles, derived and ancestral. In either case,
781-
the result is averaged over these values, using a weighting
782-
scheme described below. The option for polarisation is not exposed to the user,
783-
and we list which statistics are polarised below.
773+
Within this framework, statistics may be either polarised or unpolarised. For
774+
statistics that are polarised, we compute statistic values for pairs of derived
775+
alleles. (For this purpose, the "derived" alleles at a site are all alleles
776+
except that stored as the ``ancestral_state`` for the site.) Unpolarised
777+
statistics compute statistics over all pairs of alleles, derived and ancestral.
778+
In either case, the result is averaged over these values, using one of the
779+
weighting scheme (described below for each statistics). The option for
780+
polarisation is not exposed to the user, and we list which statistics are
781+
polarised below.
784782

785783
(sec_stats_two_locus_branch)=
786784

787785
#### Branch mode
788786

789-
The `branch` mode computes expected two-locus statistics between pairs of
787+
The `"branch"` mode computes expected two-locus statistics between pairs of
790788
trees, conditioned on the marginal topologies and branch lengths of those
791789
trees. The trees for which we compute statistics are specified by positions,
792790
and for a pair of positions we consider all possible haplotypes that could be
@@ -795,12 +793,12 @@ generated by a single mutation occurring on each of the two trees.
795793
For two trees, one with {math}`n` branches and the other with {math}`m`
796794
branches, there are {math}`nm` possible pairs of branches that may carry the
797795
pair of mutations. For each pair, we compute the two-locus statistic, and then
798-
sum these values weighted by the product of the two branch lengths. Given the
799-
two mutations occur, this accounts for the relative probability that the two
800-
mutations fall on any pair of branches.
796+
sum these values weighted by the product of the two branch lengths. Given that
797+
the two mutations occur, this accounts for the relative probability that the
798+
two mutations fall on any pair of branches.
801799

802800
In other words, imagine we place two mutations uniformly, one on each tree, and
803-
then compute the statistic; the branch mode computes the expected value of the
801+
then compute the statistic. The branch mode computes the expected value of the
804802
statistic over this process, multiplied by the product of the total branch
805803
lengths of each tree. This weighting accounts for mutational opportunity, so that
806804
the sum of the branch-mode statistic over all positions in a genomic region,
@@ -828,7 +826,20 @@ ld = ts.ld_matrix(
828826
print(ld)
829827
```
830828

831-
Again, we can specify the row and column trees separately.
829+
We note that these values are quite large: as described above, the statistic is
830+
scaled by the product of the total branch lengths of each pair of trees. To
831+
compute the expected {math}`r^2` value for a pair of mutations that each land
832+
uniformly on the pair of trees, we can divide by the product of the total
833+
branch lengths:
834+
835+
```{code-cell} ipython3
836+
total_branch_lengths = [tree.total_branch_length for tree in ts.trees()]
837+
prod_branch_lengths = np.outer(total_branch_lengths, total_branch_lengths)
838+
print(ld / prod_branch_lengths[0:4, 0:4])
839+
```
840+
841+
As with the `"site"` mode above, we can specify the row and column trees
842+
separately.
832843

833844
```{code-cell} ipython3
834845
breakpoints = ts.breakpoints(as_array=True)
@@ -859,15 +870,15 @@ are selected in the same way (with the `stat` argument), and these are limited
859870
to a handful of statistics (see
860871
{ref}`sec_stats_two_locus_summary_functions_two_way`). The dimension-dropping
861872
rules for the result follow the rest of the tskit stats API in that a single
862-
list or tuple will produce a single two-dimensional matrix, while list of these
863-
will produce a three-dimensional array, with the first dimension of length
864-
equal to the length of the list.
873+
list or tuple will produce a single two-dimensional matrix, while a list of
874+
these will produce a three-dimensional array, with the first dimension of
875+
length equal to the length of the list.
865876

866877
For example, to compute the {math}`r^2` LD matrix over a subset of samples in
867878
the tree sequence (such as sample nodes 0 through 7), we would specify the
868879
samples as follows:
869880

870-
```
881+
```{code-cell} ipython3
871882
ts = msprime.sim_ancestry(
872883
20,
873884
population_size=10000,
@@ -876,7 +887,7 @@ ts = msprime.sim_ancestry(
876887
random_seed=12)
877888
ts = msprime.sim_mutations(ts, rate=2e-8, random_seed=12)
878889
879-
ld = ts.ld_matrix(mode=site, sample_sets=range(8))
890+
ld = ts.ld_matrix(mode="site", sample_sets=range(8))
880891
print(ld)
881892
```
882893

@@ -896,27 +907,27 @@ ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=[(0, 1)]) # -> 3
896907
#### Why are there `nan` values in the LD matrix?
897908

898909
For some statistics, it is possible to observe `nan` entries in the LD matrix,
899-
which can be surprising or numerically impact downstream analyses. A `nan`
910+
which can be surprising and may numerically impact downstream analyses. A `nan`
900911
entry occurs if the denominator of a ratio statistic (including {math}`r` and
901912
{math}`r^2`) is zero, indicating that one or both of the alleles in the pair is
902-
fixed or absent in the sample set under consideration. This can happen for
913+
fixed or absent in the given sample set(s). This can happen for
903914
a number of reasons:
904915

905-
- The mutation models allows for reversible mutations, so a back mutation at
906-
a site has resulted in a single allele despite multiple mutations in the
916+
- Some mutation models allow for reversible mutations, so a back mutation at
917+
a site can result in a single allele despite multiple mutations in the
907918
history of the sample.
908919
- LD is computed for a subsample of individuals, and some sites are not
909920
variable among the sample nodes in the subsample.
910921
- A mutation exists above the root of the local tree, so that all samples carry
911922
the mutation, and one or more sites are not variable.
912923

913-
The `branch` mode will also return `nan` values if there are branches in either
914-
tree on which a mutation would not result in a polymorphism within a sample
915-
set.
924+
The `branch` mode will also return `nan` values for ratio statistics if there
925+
are branches in either tree on which a mutation would not result in
926+
a polymorphism within a sample set.
916927

917928
(sec_stats_two_locus_sample_one_way_stats)=
918929

919-
##### One-way Statistics
930+
#### One-way Statistics
920931

921932
One-way statistics are summaries of two loci in a single sample set, using
922933
a triple of haplotype counts {math}`\{n_{AB}, n_{Ab}, n_{aB}\}` and the size of
@@ -925,7 +936,7 @@ notation represent alternate alleles.
925936

926937
(sec_stats_two_locus_sample_two_way_stats)=
927938

928-
##### Two-way Statistics
939+
#### Two-way Statistics
929940

930941
Two-way statistics are summaries of haplotype counts between two sample sets,
931942
which operate on the three haplotype counts (as in one-way stats, above)
@@ -937,7 +948,8 @@ covariance of {math}`D` between sample sets.
937948

938949
Only a subset of our summary functions are two-way statistics (see
939950
{ref}`sec_two_locus_summary_functions_two_way`). Note that the unbiased two-way
940-
statistics expect non-overlapping sample sets, and we do not make any
951+
statistics expect non-overlapping sample sets (see [Ragsdale and Gravel
952+
(2020)](https://doi.org/10.1093/molbev/msz265)), and we do not make any
941953
assertions about the sample sets and assume that `i` and `j` represent disjoint
942954
sets of samples (see also the note in {meth}`~TreeSequence.divergence`).
943955

@@ -958,7 +970,7 @@ and {math}`n_B = n_{AB} + n_{aB}`, with frequencies {math}`p` found by dividing
958970
by {math}`n`.
959971

960972
`D`
961-
: {math}`f(n_{AB}, n_{Ab}, n_{aB}, n) = p_{AB}p_{ab} - p_{Ab}p_{aB}`
973+
: {math}`f(n_{AB}, n_{Ab}, n_{aB}, n) = p_{AB}p_{ab} - p_{Ab}p_{aB} \, (=p_{AB} - p_A p_B)`
962974

963975
This statistic is polarised, as the unpolarised result, which averages over
964976
allele labelings, is zero. Uses the `total` weighting method.
@@ -1015,7 +1027,7 @@ Two-way statistics are indexed by sample sets {math}`i, j` and compute values
10151027
using haplotype counts within pairs of sample sets.
10161028

10171029
`D2`
1018-
: {math}`f(n_{AB}, n_{Ab}, n_{aB}, n) = D_i * D_j`,
1030+
: {math}`f(n_{AB}, n_{Ab}, n_{aB}, n) = D_i D_j`,
10191031

10201032
where {math}`D_i` denotes {math}`D` computed within sample set {math}`i`,
10211033
and {math}`D` is defined above. Unpolarised, `total` weighted.

0 commit comments

Comments
 (0)