Documentation for LD matrix methods#3416
Conversation
271bcaa to
e2f50d6
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3416 +/- ##
=======================================
Coverage 91.92% 91.92%
=======================================
Files 37 37
Lines 32153 32153
Branches 5143 5143
=======================================
Hits 29556 29556
Misses 2264 2264
Partials 333 333
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
jeromekelleher
left a comment
There was a problem hiding this comment.
I've had a quick read through and generally looks great. I've spotted a few typos and left a few take-it-or-leave-it comments.
I haven't thought through the details at all through, and I think it needs a careful review from @petrelharp for that.
|
Thanks so much for the quick review, @jeromekelleher. I agree with your suggestions and will fix things up accordingly. |
e2f50d6 to
7e2b62a
Compare
|
I think this will also need a changelog entry, since adding documentation will announce that this API is now public. Does that happen here or in a later part of the release pipeline? |
|
That could happen here! |
|
We can add to the changelog here or not, whichever is easier. We do need a review from you though @petrelharp as I'm not on top of the stats details. |
|
I've created a milestone for version 1.1, which will basically be this new LD API plus the new metadata codec. Would be good to get it shipped in the next week or so if possible. |
7e2b62a to
bae4065
Compare
petrelharp
left a comment
There was a problem hiding this comment.
See suggested edits (and feel free to say I'm wrong on any).
|
This is great, thanks! Remaining things:
Background:
|
|
Okay, just checking the details of branch mode:
Okay, so: for a given pair of branches (on the same or different trees), the sample counts So, for computation, I think we should say very clearly somewhere that branch mode works by summing over all pairs of branches the value of f( ) multiplied by the lengths of the two branches. |
|
As for interpretation (ie in what sense is it the expected value): First off, I'm confused about a dumb silly thing. If I simulate a short tree sequence and compute The
|
|
For the multiallelic question, the computations happen here and here; based on where they're called from I'm assuming that the first function is just a quicker special-purpose version of the second one for both biallelic sites. (ps I'm assuming you know this all, but I forget everything and it's nice to go look at the details) So, it is summing over all pairs of alleles but then normalised. |
|
I think we just need a few more paragraphs, documenting exactly what's being calculated, as above. Can you take a stab at this @apragsdale? |
Hi @petrelharp - yes, of course! I'll tackle these today and this weekend. Thank you for such detailed comments. With some earlier rounds of revision on these docs, I had gone back and forth with including too much vs too little detail, and probably ended up falling on the too-little side of it. I also agree the nan calculations are confusing, and I remember @lkirk and I having discussions about that last year. I think we had a good explanation/work-around for it. I'll dig up those notes, because it will be important to document as well. I'm also returning to this from many months away from the code, so it will be good to remind myself of all of it as well, and make sure it's documented properly. |
|
I hope it's easy to ressurrect some of those earlier versions, then?
I'm not sure if they are? I imagine it's just "0/0 = nan", but then I'm confused why that happens? |
Peter- does this occur in site mode, as you say above? We shouldn't get nans if msprime is only generating mutations with 0 < p < 1, and if we compute r^2 over all samples. However, if we specify a subsample as our sample set, the method computes LD at all sites still, and for some of those sites a mutation may not be found in the subsamples. Then p could equal 0 or 1. In this case, you'd get nans for any pair of mutations including one or two of such sites. But I want to make sure that we are thinking about the same scenario that generates a nan. Edit: In any case, this behavior should be documented. |
|
Here's what I did: Oh! I see: "msprime does not simulate sites where mutations that aren't polymorphic" but it does simulate situations where backmutations can create monomorphic sites despite there being mutations: Okay, that's all cleared up. I ran into this so quick because I had an unrealistic mutation rate. However, this would be a good thing to mention in the section on multiple alleles. |
|
Ah, yes - that makes sense. I think I'll add a note about that as well then. Do you think the example that I've added to the docs about subsampling and nan values is overkill, or helpful? |
You mean this bit? I don't think this level of detail is overkill - any careful detail we add now will save us hours in the future tracking down some confused user's queries - but I do have some comments on that section:
|
|
Okay! I had another careful pass through - apologies for the great many comments, but I think this stuff is neat and I want to make sure we write down exactly what's being computed. (And, there's a bunch of nitpicks.) |
|
Thanks @petrelharp - I think I've addressed all/most of your comments here! |
petrelharp
left a comment
There was a problem hiding this comment.
Looks good - just some minor suggestions/comments (and one that I'm postponing for a different discussion). But: have you built the docs locally? I see some messed up formatting.
71e5fd5 to
536f144
Compare
|
A close reading also found a few inconsistencies in the docstring. I hope the small fixes I made improve some of the clarity. I've also rebased here. Let me know if I should also go ahead and squash, if that's needed. |
ba98ec2 to
2e07676
Compare
petrelharp
left a comment
There was a problem hiding this comment.
I think this is ready to merge with these edits (and possibly your edits to my edits). Could you deal with these and then squash down the commits, @apragsdale?
2b70dd3 to
e60b287
Compare
See #3353.
This largely pulls material from an existing, open PR to complete minimal documentation for the LD matrix methods currently available in tskit. I've made edits for clarity from the original PR, and removed some material that is possibly confusing or more information than needed for a user.