@@ -694,49 +694,47 @@ and boolean expressions (e.g., {math}`(x > 0)`) are interpreted as 0/1.
694694
695695The {meth}` ~TreeSequence.ld_matrix ` method provides an interface to
696696a collection of two-locus statistics with predefined summary functions (see
697- {ref}` sec_stats_two_locus_summary_functions ` ) and ` site ` and ` branch `
698- {ref} ` modes <sec_stats_mode> ` . The LD matrix method differs from other
697+ {ref}` sec_stats_two_locus_summary_functions ` ).
698+ The LD matrix method differs from other
699699statistics methods in that it provides a unified API with an argument to
700700specify different two-locus summaries of the data. It otherwise behaves
701701similarly to most other functions with respect to ` sample_sets ` and ` indexes ` .
702702
703- Two-locus statistics can be computed using two modes, either ` site ` or
704- ` branch ` , and these should be interpreted in the same way as these modes in the
705- single-site statistics. That is, the ` site ` mode computes LD over observed
706- mutations at pairs of sites, while the ` branch ` model computes expected
707- LD conditioned on pairs of trees.
703+ Two-locus statistics can be computed using two {ref} ` modes <sec_stats_mode> ` ,
704+ either ` site ` or ` branch ` , and these should be interpreted in the same way as
705+ these modes in the single-site statistics. That is, the ` site ` mode computes LD
706+ over observed alleles at pairs of sites, while the ` branch ` model computes
707+ expected LD conditioned on pairs of trees.
708708
709709(sec_stats_two_locus_site)=
710710
711711#### Site mode
712712
713- The ` site ` mode computes two-locus statistics summarized over alleles between
713+ The ` " site" ` mode computes two-locus statistics summarized over alleles between
714714all pairs of specified sites. The default behavior, leaving ` sites `
715- unspecified, will compute a matrix for all pairs of sites, with
716- one row and column for each site in the tree sequence (i.e., an n x n
717- matrix where n is the number of sites in the tree sequence). We can also
715+ unspecified, will compute a matrix for all pairs of sites, with one row and
716+ column for each site in the tree sequence (i.e., an {math} ` n \times n ` matrix
717+ where {math} ` n ` is the number of sites in the tree sequence). We can also
718718restrict the output to a subset of sites, either by specifying a single vector
719- for both rows and columns or a pair of vectors for the row sites and column
720- sites separately.
719+ of site indexes for both rows and columns or a pair of vectors for the row
720+ sites and column sites separately.
721721
722722The following computes a matrix of the {math}` r^2 ` measure of linkage
723723disequilibrium (LD) computed pairwise between the first 4 sites in the tree
724- sequence among all samples. In our computations, row sites are used as the row
725- ("left-hand") loci and column sites are used as the column ("right-hand")
726- locus, and with a single list of sites specified, we obtain a symmetric square
727- matrix.
724+ sequence among all samples. The ` sites ` must be given as a list of lists, and
725+ with a single list of sites specified, we obtain a symmetric square matrix.
728726
729727``` {code-cell} ipython3
730728ld = ts.ld_matrix(sites=[[0, 1, 2, 3]])
731729print(ld)
732730```
733731
734732The following demonstrates how we can specify the row and column sites
735- independently of each other. We're specifying 3 columns and 2 rows , which
733+ independently of each other. We're specifying 2 rows and 3 columns , which
736734computes a subset of the matrix shown above.
737735
738736``` {code-cell} ipython3
739- ld = ts.ld_matrix(sites=[[0, 1 ], [1, 2, 3]])
737+ ld = ts.ld_matrix(sites=[[1, 2 ], [1, 2, 3]])
740738print(ld)
741739```
742740
@@ -772,21 +770,21 @@ Out of all of the available summary functions, only {math}`r^2` uses
772770` hap_weighted ` normalisation, with the remainder using uniform weighting
773771(` total_weighted ` ).
774772
775- Within this framework, statistics may be either
776- polarised or unpolarised. For statistics that are polarised, we compute
777- statistic values for pairs of derived alleles. (For this purpose, the "derived" alleles
778- at a site are all alleles except that stored as the `` ancestral_state `` for the site.)
779- Unpolarised statistics compute
780- statistics over all pairs of alleles, derived and ancestral. In either case,
781- the result is averaged over these values, using a weighting
782- scheme described below. The option for polarisation is not exposed to the user,
783- and we list which statistics are polarised below.
773+ Within this framework, statistics may be either polarised or unpolarised. For
774+ statistics that are polarised, we compute statistic values for pairs of derived
775+ alleles. (For this purpose, the "derived" alleles at a site are all alleles
776+ except that stored as the `` ancestral_state `` for the site.) Unpolarised
777+ statistics compute statistics over all pairs of alleles, derived and ancestral.
778+ In either case, the result is averaged over these values, using one of the
779+ weighting scheme (described below for each statistics). The option for
780+ polarisation is not exposed to the user, and we list which statistics are
781+ polarised below.
784782
785783(sec_stats_two_locus_branch)=
786784
787785#### Branch mode
788786
789- The ` branch ` mode computes expected two-locus statistics between pairs of
787+ The ` " branch" ` mode computes expected two-locus statistics between pairs of
790788trees, conditioned on the marginal topologies and branch lengths of those
791789trees. The trees for which we compute statistics are specified by positions,
792790and for a pair of positions we consider all possible haplotypes that could be
@@ -795,12 +793,12 @@ generated by a single mutation occurring on each of the two trees.
795793For two trees, one with {math}` n ` branches and the other with {math}` m `
796794branches, there are {math}` nm ` possible pairs of branches that may carry the
797795pair of mutations. For each pair, we compute the two-locus statistic, and then
798- sum these values weighted by the product of the two branch lengths. Given the
799- two mutations occur, this accounts for the relative probability that the two
800- mutations fall on any pair of branches.
796+ sum these values weighted by the product of the two branch lengths. Given that
797+ the two mutations occur, this accounts for the relative probability that the
798+ two mutations fall on any pair of branches.
801799
802800In other words, imagine we place two mutations uniformly, one on each tree, and
803- then compute the statistic; the branch mode computes the expected value of the
801+ then compute the statistic. The branch mode computes the expected value of the
804802statistic over this process, multiplied by the product of the total branch
805803lengths of each tree. This weighting accounts for mutational opportunity, so that
806804the sum of the branch-mode statistic over all positions in a genomic region,
@@ -828,7 +826,20 @@ ld = ts.ld_matrix(
828826print(ld)
829827```
830828
831- Again, we can specify the row and column trees separately.
829+ We note that these values are quite large: as described above, the statistic is
830+ scaled by the product of the total branch lengths of each pair of trees. To
831+ compute the expected {math}` r^2 ` value for a pair of mutations that each land
832+ uniformly on the pair of trees, we can divide by the product of the total
833+ branch lengths:
834+
835+ ``` {code-cell} ipython3
836+ total_branch_lengths = [tree.total_branch_length for tree in ts.trees()]
837+ prod_branch_lengths = np.outer(total_branch_lengths, total_branch_lengths)
838+ print(ld / prod_branch_lengths[0:4, 0:4])
839+ ```
840+
841+ As with the ` "site" ` mode above, we can specify the row and column trees
842+ separately.
832843
833844``` {code-cell} ipython3
834845breakpoints = ts.breakpoints(as_array=True)
@@ -859,15 +870,15 @@ are selected in the same way (with the `stat` argument), and these are limited
859870to a handful of statistics (see
860871{ref}` sec_stats_two_locus_summary_functions_two_way ` ). The dimension-dropping
861872rules for the result follow the rest of the tskit stats API in that a single
862- list or tuple will produce a single two-dimensional matrix, while list of these
863- will produce a three-dimensional array, with the first dimension of length
864- equal to the length of the list.
873+ list or tuple will produce a single two-dimensional matrix, while a list of
874+ these will produce a three-dimensional array, with the first dimension of
875+ length equal to the length of the list.
865876
866877For example, to compute the {math}` r^2 ` LD matrix over a subset of samples in
867878the tree sequence (such as sample nodes 0 through 7), we would specify the
868879samples as follows:
869880
870- ```
881+ ``` {code-cell} ipython3
871882ts = msprime.sim_ancestry(
872883 20,
873884 population_size=10000,
@@ -876,7 +887,7 @@ ts = msprime.sim_ancestry(
876887 random_seed=12)
877888ts = msprime.sim_mutations(ts, rate=2e-8, random_seed=12)
878889
879- ld = ts.ld_matrix(mode=“ site” , sample_sets=range(8))
890+ ld = ts.ld_matrix(mode=" site" , sample_sets=range(8))
880891print(ld)
881892```
882893
@@ -896,27 +907,27 @@ ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=[(0, 1)]) # -> 3
896907#### Why are there ` nan ` values in the LD matrix?
897908
898909For some statistics, it is possible to observe ` nan ` entries in the LD matrix,
899- which can be surprising or numerically impact downstream analyses. A ` nan `
910+ which can be surprising and may numerically impact downstream analyses. A ` nan `
900911entry occurs if the denominator of a ratio statistic (including {math}` r ` and
901912{math}` r^2 ` ) is zero, indicating that one or both of the alleles in the pair is
902- fixed or absent in the sample set under consideration . This can happen for
913+ fixed or absent in the given sample set(s) . This can happen for
903914a number of reasons:
904915
905- - The mutation models allows for reversible mutations, so a back mutation at
906- a site has resulted in a single allele despite multiple mutations in the
916+ - Some mutation models allow for reversible mutations, so a back mutation at
917+ a site can result in a single allele despite multiple mutations in the
907918 history of the sample.
908919- LD is computed for a subsample of individuals, and some sites are not
909920 variable among the sample nodes in the subsample.
910921- A mutation exists above the root of the local tree, so that all samples carry
911922 the mutation, and one or more sites are not variable.
912923
913- The ` branch ` mode will also return ` nan ` values if there are branches in either
914- tree on which a mutation would not result in a polymorphism within a sample
915- set.
924+ The ` branch ` mode will also return ` nan ` values for ratio statistics if there
925+ are branches in either tree on which a mutation would not result in
926+ a polymorphism within a sample set.
916927
917928(sec_stats_two_locus_sample_one_way_stats)=
918929
919- ##### One-way Statistics
930+ #### One-way Statistics
920931
921932One-way statistics are summaries of two loci in a single sample set, using
922933a triple of haplotype counts {math}` \{n_{AB}, n_{Ab}, n_{aB}\} ` and the size of
@@ -925,7 +936,7 @@ notation represent alternate alleles.
925936
926937(sec_stats_two_locus_sample_two_way_stats)=
927938
928- ##### Two-way Statistics
939+ #### Two-way Statistics
929940
930941Two-way statistics are summaries of haplotype counts between two sample sets,
931942which operate on the three haplotype counts (as in one-way stats, above)
@@ -937,7 +948,8 @@ covariance of {math}`D` between sample sets.
937948
938949Only a subset of our summary functions are two-way statistics (see
939950{ref}` sec_two_locus_summary_functions_two_way ` ). Note that the unbiased two-way
940- statistics expect non-overlapping sample sets, and we do not make any
951+ statistics expect non-overlapping sample sets (see [ Ragsdale and Gravel
952+ (2020)] ( https://doi.org/10.1093/molbev/msz265 ) ), and we do not make any
941953assertions about the sample sets and assume that ` i ` and ` j ` represent disjoint
942954sets of samples (see also the note in {meth}` ~TreeSequence.divergence ` ).
943955
@@ -958,7 +970,7 @@ and {math}`n_B = n_{AB} + n_{aB}`, with frequencies {math}`p` found by dividing
958970by {math}` n ` .
959971
960972` D `
961- : {math}` f(n_{AB}, n_{Ab}, n_{aB}, n) = p_{AB}p_{ab} - p_{Ab}p_{aB} `
973+ : {math}` f(n_{AB}, n_{Ab}, n_{aB}, n) = p_{AB}p_{ab} - p_{Ab}p_{aB} \, (=p_{AB} - p_A p_B) `
962974
963975 This statistic is polarised, as the unpolarised result, which averages over
964976 allele labelings, is zero. Uses the ` total ` weighting method.
@@ -1015,7 +1027,7 @@ Two-way statistics are indexed by sample sets {math}`i, j` and compute values
10151027using haplotype counts within pairs of sample sets.
10161028
10171029` D2 `
1018- : {math}` f(n_{AB}, n_{Ab}, n_{aB}, n) = D_i * D_j ` ,
1030+ : {math}` f(n_{AB}, n_{Ab}, n_{aB}, n) = D_i D_j ` ,
10191031
10201032 where {math}` D_i ` denotes {math}` D ` computed within sample set {math}` i ` ,
10211033 and {math}` D ` is defined above. Unpolarised, ` total ` weighted.
0 commit comments