Skip to content

Documentation for LD matrix methods#3416

Merged
petrelharp merged 1 commit intotskit-dev:mainfrom
apragsdale:ld_matrix_docs
Mar 27, 2026
Merged

Documentation for LD matrix methods#3416
petrelharp merged 1 commit intotskit-dev:mainfrom
apragsdale:ld_matrix_docs

Conversation

@apragsdale
Copy link
Copy Markdown
Contributor

See #3353.

This largely pulls material from an existing, open PR to complete minimal documentation for the LD matrix methods currently available in tskit. I've made edits for clarity from the original PR, and removed some material that is possibly confusing or more information than needed for a user.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.92%. Comparing base (afaf3b9) to head (e60b287).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3416   +/-   ##
=======================================
  Coverage   91.92%   91.92%           
=======================================
  Files          37       37           
  Lines       32153    32153           
  Branches     5143     5143           
=======================================
  Hits        29556    29556           
  Misses       2264     2264           
  Partials      333      333           
Flag Coverage Δ
C 82.70% <ø> (ø)
c-python 77.34% <ø> (ø)
python-tests 96.40% <ø> (ø)
python-tests-no-jit 33.22% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
Python API 98.69% <ø> (ø)
Python C interface 91.23% <ø> (ø)
C library 88.86% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had a quick read through and generally looks great. I've spotted a few typos and left a few take-it-or-leave-it comments.

I haven't thought through the details at all through, and I think it needs a careful review from @petrelharp for that.

@apragsdale
Copy link
Copy Markdown
Contributor Author

Thanks so much for the quick review, @jeromekelleher. I agree with your suggestions and will fix things up accordingly.

@lkirk
Copy link
Copy Markdown
Contributor

lkirk commented Mar 9, 2026

I think this will also need a changelog entry, since adding documentation will announce that this API is now public. Does that happen here or in a later part of the release pipeline?

@petrelharp
Copy link
Copy Markdown
Contributor

That could happen here!

@jeromekelleher
Copy link
Copy Markdown
Member

We can add to the changelog here or not, whichever is easier. We do need a review from you though @petrelharp as I'm not on top of the stats details.

@jeromekelleher jeromekelleher added this to the 1.1.0 milestone Mar 10, 2026
@jeromekelleher
Copy link
Copy Markdown
Member

I've created a milestone for version 1.1, which will basically be this new LD API plus the new metadata codec. Would be good to get it shipped in the next week or so if possible.

Copy link
Copy Markdown
Contributor

@petrelharp petrelharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See suggested edits (and feel free to say I'm wrong on any).

@petrelharp
Copy link
Copy Markdown
Contributor

petrelharp commented Mar 14, 2026

This is great, thanks! Remaining things:

  • This should explain more precisely how branch mode is computed and what its interpretation is as an expected value.
  • How's it work with multiallelic sites? What exactly are the weighting schemes?
  • How's polarization work?

Background:

@petrelharp
Copy link
Copy Markdown
Contributor

Okay, just checking the details of branch mode:

Okay, so: for a given pair of branches (on the same or different trees), the sample counts wAb, waB, wAB are the same for any pair of mutation falling on those branches; so the expected contribution to average LD determined by the summary funciton f per unit sequence length of this pair of branches is f(w_{Ab}, w_{aB}, w_{AB}, n) multiplied by the lengths of the two branches.

So, for computation, I think we should say very clearly somewhere that branch mode works by summing over all pairs of branches the value of f( ) multiplied by the lengths of the two branches.

@petrelharp
Copy link
Copy Markdown
Contributor

As for interpretation (ie in what sense is it the expected value):

First off, I'm confused about a dumb silly thing. If I simulate a short tree sequence and compute ld_matrix (with r2 and site mode) I get a bunch of nans. Why? (This is confusing because the denominator for r2 is p_A * p_B * (1 - p_A) * (1 - p_B); and msprime only generates mutations with 0 < p < 1.)

The nan thing makes me think I'm missing some basic and important thing. But if my understanding above on how it is computed is correct here are some equivalent options for what branch mode means:

  1. Put down two Poisson processes of mutations on each tree, and compute the stat between each pair of mutations; add those up across pairs of mutations; then branch mode is the expected value of this sum.
  2. Pick two mutations uniformly, one on each tree; compute the stat; then branch mode is the expected stat multiplied by the total branch lengths of each tree.
  3. Compute the average stat across all mutations using a large number of infinite-sites mutations at rate mu per bp. The result (for large mu) is mu multiplied by the sum across pairs of trees of the branch stat for those trees multiplied by the spans of the two trees.

@petrelharp
Copy link
Copy Markdown
Contributor

For the multiallelic question, the computations happen here and here; based on where they're called from I'm assuming that the first function is just a quicker special-purpose version of the second one for both biallelic sites.

(ps I'm assuming you know this all, but I forget everything and it's nice to go look at the details)

So, it is summing over all pairs of alleles but then normalised.

@petrelharp
Copy link
Copy Markdown
Contributor

I think we just need a few more paragraphs, documenting exactly what's being calculated, as above. Can you take a stab at this @apragsdale?

@apragsdale
Copy link
Copy Markdown
Contributor Author

I think we just need a few more paragraphs, documenting exactly what's being calculated, as above. Can you take a stab at this @apragsdale?

Hi @petrelharp - yes, of course! I'll tackle these today and this weekend.

Thank you for such detailed comments. With some earlier rounds of revision on these docs, I had gone back and forth with including too much vs too little detail, and probably ended up falling on the too-little side of it.

I also agree the nan calculations are confusing, and I remember @lkirk and I having discussions about that last year. I think we had a good explanation/work-around for it. I'll dig up those notes, because it will be important to document as well. I'm also returning to this from many months away from the code, so it will be good to remind myself of all of it as well, and make sure it's documented properly.

@petrelharp
Copy link
Copy Markdown
Contributor

I hope it's easy to ressurrect some of those earlier versions, then?

I also agree the nan calculations are confusing,

I'm not sure if they are? I imagine it's just "0/0 = nan", but then I'm confused why that happens?

@apragsdale
Copy link
Copy Markdown
Contributor Author

apragsdale commented Mar 15, 2026

First off, I'm confused about a dumb silly thing. If I simulate a short tree sequence and compute ld_matrix (with r2 and site mode) I get a bunch of nans. Why? (This is confusing because the denominator for r2 is p_A * p_B * (1 - p_A) * (1 - p_B); and msprime only generates mutations with 0 < p < 1.)

I'm not sure if they are? I imagine it's just "0/0 = nan", but then I'm confused why that happens?

Peter- does this occur in site mode, as you say above? We shouldn't get nans if msprime is only generating mutations with 0 < p < 1, and if we compute r^2 over all samples. However, if we specify a subsample as our sample set, the method computes LD at all sites still, and for some of those sites a mutation may not be found in the subsamples. Then p could equal 0 or 1. In this case, you'd get nans for any pair of mutations including one or two of such sites.

But I want to make sure that we are thinking about the same scenario that generates a nan.

In [23]: ts = msprime.sim_ancestry(10, population_size=1e4, recombination_rate=1e-8, sequence_length=1e4, random_seed=1)

In [24]: ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=1)

In [28]: ts.ld_matrix()
Out[28]: 
array([[1.        , 0.00277008, 0.00277008, 0.00277008],
       [0.00277008, 1.        , 0.00277008, 0.00277008],
       [0.00277008, 0.00277008, 1.        , 0.00277008],
       [0.00277008, 0.00277008, 0.00277008, 1.        ]])

In [36]: ts.ld_matrix(sample_sets=range(12))
Out[36]: 
array([[1.        ,        nan,        nan, 0.00826446],
       [       nan,        nan,        nan,        nan],
       [       nan,        nan,        nan,        nan],
       [0.00826446,        nan,        nan, 1.        ]])

Edit: In any case, this behavior should be documented.

@petrelharp
Copy link
Copy Markdown
Contributor

Here's what I did:

>>> ts = msprime.sim_ancestry(3, sequence_length=10, recombination_rate=0.1, random_seed=123)
>>> mts = msprime.sim_mutations(ts, rate=0.1, random_seed=456)
>>> mts.ld_matrix()
array([[1.   , 0.44 ,   nan, 0.44 , 0.44 , 0.125, 0.44 ],
       [0.44 , 1.   ,   nan, 1.   , 1.   , 0.1  , 1.   ],
       [  nan,   nan,   nan,   nan,   nan,   nan,   nan],
       [0.44 , 1.   ,   nan, 1.   , 1.   , 0.1  , 1.   ],
       [0.44 , 1.   ,   nan, 1.   , 1.   , 0.1  , 1.   ],
       [0.125, 0.1  ,   nan, 0.1  , 0.1  , 1.   , 0.1  ],
       [0.44 , 1.   ,   nan, 1.   , 1.   , 0.1  , 1.   ]])
>>> mts.num_sites
7
>>> mts.genotype_matrix()
array([[1, 0, 0, 0, 2, 0],
       [0, 1, 1, 1, 1, 1],
       [0, 0, 0, 0, 0, 0],
       [0, 1, 1, 1, 1, 1],
       [0, 1, 1, 1, 1, 1],
       [0, 0, 1, 0, 1, 0],
       [1, 0, 0, 0, 0, 0]], dtype=int32)
>>> mts.genotype_matrix().sum(axis=1)
array([3, 5, 0, 5, 5, 2, 1])

Oh! I see: "msprime does not simulate sites where mutations that aren't polymorphic" but it does simulate situations where backmutations can create monomorphic sites despite there being mutations:

>>> mts.site(2)
Site(id=2, position=3.0, ancestral_state='A', mutations=[
 Mutation(id=3, site=2, node=0, derived_state='T', parent=-1, metadata=b'', time=1.549923340261233, edge=21, inherited_state='A'), 
 Mutation(id=4, site=2, node=0, derived_state='A', parent=3, metadata=b'', time=1.0122678066613313, edge=21, inherited_state='T')], metadata=b'')

Okay, that's all cleared up. I ran into this so quick because I had an unrealistic mutation rate. However, this would be a good thing to mention in the section on multiple alleles.

@apragsdale
Copy link
Copy Markdown
Contributor Author

Ah, yes - that makes sense. I think I'll add a note about that as well then. Do you think the example that I've added to the docs about subsampling and nan values is overkill, or helpful?

@petrelharp
Copy link
Copy Markdown
Contributor

Ah, yes - that makes sense. I think I'll add a note about that as well then. Do you think the example that I've added to the docs about subsampling and nan values is overkill, or helpful?

You mean this bit? I don't think this level of detail is overkill - any careful detail we add now will save us hours in the future tracking down some confused user's queries - but I do have some comments on that section:

  • Perhaps separate this out into a sub(sub)^n-section on missing values? It could go at the end? That would help with concerns about this being too much detail.
  • I am hesitant to suggest "just simplify to make your problems go away" since you introduce a BUNCH of annoying issues by simplifying: for instance, the columns of the result correspond to the sites in the simplified ts not the original, so if you're not careful matching them up to where on the genome you're looking at is annoying. I would tend to just say "hey this is one reason you get nans" and not recommend the simplification fix.
  • This is a different reason for getting nans than what I encountered; probably the reason I got nans should also be mentioned? Like "you'll get nans if any sites are monomorphic in the sample. here are two reasons you'd get this from msprime" (but recall that monomorphic sites are totally normal in real data).

@lkirk lkirk mentioned this pull request Mar 17, 2026
3 tasks
@petrelharp
Copy link
Copy Markdown
Contributor

Okay! I had another careful pass through - apologies for the great many comments, but I think this stuff is neat and I want to make sure we write down exactly what's being computed. (And, there's a bunch of nitpicks.)

@apragsdale
Copy link
Copy Markdown
Contributor Author

Thanks @petrelharp - I think I've addressed all/most of your comments here!

@petrelharp petrelharp mentioned this pull request Mar 22, 2026
Copy link
Copy Markdown
Contributor

@petrelharp petrelharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - just some minor suggestions/comments (and one that I'm postponing for a different discussion). But: have you built the docs locally? I see some messed up formatting.

@apragsdale
Copy link
Copy Markdown
Contributor Author

A close reading also found a few inconsistencies in the docstring. I hope the small fixes I made improve some of the clarity.

I've also rebased here. Let me know if I should also go ahead and squash, if that's needed.

Copy link
Copy Markdown
Contributor

@petrelharp petrelharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ready to merge with these edits (and possibly your edits to my edits). Could you deal with these and then squash down the commits, @apragsdale?

@petrelharp petrelharp added this pull request to the merge queue Mar 27, 2026
Merged via the queue into tskit-dev:main with commit 2f26dc6 Mar 27, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants