Skip to content

Commit 104c7ba

Browse files
benjefferypetrelharphyanwong
authored
Improve IBD documentation and IdentitySegments docs (#3330)
* Improve IBD documentation and IdentitySegments docs * edits --------- Co-authored-by: peter <petrel.harp@gmail.com> Co-authored-by: Yan Wong <yan.wong@bdi.ox.ac.uk>
1 parent c70f591 commit 104c7ba

File tree

3 files changed

+200
-26
lines changed

3 files changed

+200
-26
lines changed

docs/ibd.md

Lines changed: 187 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -20,16 +20,12 @@ kernelspec:
2020
# Identity by descent
2121

2222
The {meth}`.TreeSequence.ibd_segments` method allows us to compute
23-
segments of identity by descent.
23+
segments of identity by descent along a tree sequence.
2424

2525
:::{note}
2626
This documentation page is preliminary
2727
:::
2828

29-
:::{todo}
30-
Relate the concept of identity by descent to the MRCA spans in the tree sequence.
31-
:::
32-
3329
## Examples
3430

3531
Let's take a simple tree sequence to illustrate the {meth}`.TreeSequence.ibd_segments`
@@ -77,7 +73,26 @@ coordinate ``[left, right)`` and ancestral node ``a`` iff the most
7773
recent common ancestor of the segment ``[left, right)`` in nodes ``u``
7874
and ``v`` is ``a``, and the segment has been inherited along the same
7975
genealogical path (ie. it has not been broken by recombination). The
80-
segments returned are the longest possible ones.
76+
definition of a "genealogical path" used here is
77+
the sequence of edges, rather than nodes.
78+
So, for instance, if ``u`` inherits a segment ``[x, z)`` from ``a``,
79+
but that inheritance is represented by two edges,
80+
one spanning ``[x, y)`` and the other spanning ``[y, z)``,
81+
then this represents two genealogical paths,
82+
and any IBD segments would be split at ``y``.
83+
In other words, the method assumes that the end
84+
of an edge represents a recombination,
85+
an assumption that may not reflect how the tree sequence
86+
is used -- see below for more discussion.
87+
88+
This definition is purely genealogical: it depends only on the tree
89+
sequence topology and node times, and does not inspect allelic
90+
states or mutations. In particular, if we compute the MRCA of ``(u, v)``
91+
in each tree along the sequence, then (up to the additional refinement
92+
by genealogical path) the IBD segments are those
93+
that share the same ancestor and paths to that
94+
ancestor. Intervals in which ``u`` and ``v`` lie in different roots
95+
have no MRCA and therefore do not contribute IBD segments.
8196

8297
Consider the IBD segments that we get from our example tree sequence:
8398

@@ -109,7 +124,11 @@ By default this class only stores the high-level summaries of the
109124
IBD segments discovered. As we can see in this example, we have a
110125
total of six segments and
111126
the total span (i.e., the sum lengths of the genomic intervals spanned
112-
by IBD segments) is 30.
127+
by IBD segments) is 30. In this default mode the object does not
128+
store information about individual sample pairs, and methods that
129+
inspect per-pair information (such as indexing with ``[(a, b)]`` or
130+
iterating over the mapping) will raise an
131+
``IdentityPairsNotStoredError``.
113132

114133
If required, we can get more detailed information about particular
115134
segment pairs and the actual segments using the ``store_pairs``
@@ -148,8 +167,12 @@ segments = ts.ibd_segments(store_pairs=True, store_segments=True)
148167
segments[(0, 1)]
149168
```
150169

151-
The {class}`.IdentitySegmentList` behaves like a Python list,
152-
where each element is an instance of {class}`.IdentitySegment`.
170+
When ``store_segments=True``, the {class}`.IdentitySegmentList` behaves
171+
like a Python list, where each element is an instance of
172+
{class}`.IdentitySegment`. When only ``store_pairs=True`` is specified,
173+
the number of segments and their total span are still available, but
174+
attempting to iterate over the list or access the per-segment arrays
175+
will raise an ``IdentitySegmentsNotStoredError``.
153176

154177
:::{warning}
155178
The order of segments in an {class}`.IdentitySegmentList`
@@ -168,19 +191,24 @@ By default we get the IBD segments between all pairs of
168191
{ref}`sample<sec_data_model_definitions_sample>` nodes.
169192

170193
#### IBD within a sample set
194+
171195
We can reduce this to pairs within a specific set using the
172196
``within`` argument:
173197

174-
175-
```{eval-rst}
176-
.. todo:: More detail and better examples here.
177-
```
178-
179198
```{code-cell}
180199
segments = ts.ibd_segments(within=[0, 2], store_pairs=True)
181200
print(list(segments.keys()))
182201
```
183202

203+
Here we have restricted attention to the samples with node IDs 0 and 2,
204+
so only the pair ``(0, 2)`` appears in the result. In general:
205+
206+
- ``within`` should be a one-dimensional array-like of node IDs
207+
(typically sample nodes). All unordered pairs from this set are
208+
considered.
209+
- If ``within`` is omitted (the default), all nodes flagged as samples
210+
in the node table are used.
211+
184212
#### IBD between sample sets
185213

186214
We can also compute IBD **between** sample sets:
@@ -190,6 +218,17 @@ segments = ts.ibd_segments(between=[[0,1], [2]], store_pairs=True)
190218
print(list(segments.keys()))
191219
```
192220

221+
In this example we have two sample sets, ``[0, 1]`` and ``[2]``, so the
222+
identity segments are computed only for pairs in which one sample lies
223+
in the first set and the other lies in the second. More generally:
224+
225+
- ``between`` should be a list of non-overlapping lists of node IDs.
226+
- All pairs ``(u, v)`` are considered such that ``u`` and ``v`` belong
227+
to different sample sets.
228+
229+
The ``within`` and ``between`` arguments are mutually exclusive: passing
230+
both at the same time raises a :class:`ValueError`.
231+
193232
:::{seealso}
194233
See the {meth}`.TreeSequence.ibd_segments` documentation for
195234
more details.
@@ -200,6 +239,138 @@ more details.
200239
The ``max_time`` and ``min_span`` arguments allow us to constrain the
201240
segments that we consider.
202241

203-
```{eval-rst}
204-
.. todo:: Add examples for these arguments.
242+
The ``max_time`` argument specifies an upper bound on the time of the
243+
common ancestor node: only IBD segments whose MRCA node has a time
244+
no greater than ``max_time`` are returned.
245+
246+
The ``min_span`` argument filters by genomic length: only segments with
247+
span strictly greater than ``min_span`` are included.
248+
249+
For example, working with ``ts2`` as the following tree sequence:
250+
251+
```{code-cell}
252+
:tags: [hide-input]
253+
254+
import io
255+
256+
nodes = io.StringIO(
257+
"""\
258+
id is_sample time
259+
0 1 0
260+
1 1 0
261+
2 0 1
262+
3 0 3
263+
"""
264+
)
265+
edges = io.StringIO(
266+
"""\
267+
left right parent child
268+
0 4 2 0,1
269+
4 10 3 0,1
270+
"""
271+
)
272+
ts2 = tskit.load_text(nodes=nodes, edges=edges, strict=False)
273+
SVG(ts2.draw_svg())
274+
```
275+
276+
There are two segments:
277+
```{code-cell}
278+
segments = ts2.ibd_segments(store_segments=True)
279+
print("all segments:", list(segments.values())[0])
280+
```
281+
... but only the left-hand one is more recent than 2 time units ago:
282+
```{code-cell}
283+
segments_recent = ts2.ibd_segments(max_time=2, store_segments=True)
284+
print("max_time=1.2:", list(segments_recent.values())[0])
285+
```
286+
... and only the right-hand one is longer than 5 units.
287+
```{code-cell}
288+
289+
segments_long = ts2.ibd_segments(min_span=5, store_segments=True)
290+
print("min_span=0.5:", list(segments_long.values())[0])
291+
```
292+
293+
So: the full result contains two IBD segments for the single sample
294+
pair, one inherited via ancestor 2 over ``[0, 4)`` and one via
295+
ancestor 3 over ``[4, 10)``. The ``max_time`` constraint removes the
296+
segment inherited from the older ancestor (time 3), while the
297+
``min_span`` constraint keeps only the longer of the two segments.
298+
299+
### More on the "pathwise" definition of IBD segments
300+
301+
We said above that the definition of IBD used by
302+
{meth}`.TreeSequence.ibd_segments` says that a given segment
303+
must be inherited from the MRCA along a single genealogical path,
304+
and that "genealogical paths" are defined *edgewise*.
305+
This can lead to surprising consequences.
306+
307+
Returning to our example above:
308+
```{code-cell}
309+
:tags: [hide-input]
310+
311+
SVG(ts.draw_svg())
312+
```
313+
there are two IBD segments between ``1`` and ``2``:
314+
```{code-cell}
315+
segments = ts.ibd_segments(within=[1, 2], store_pairs=True)
316+
for pair, value in segments.items():
317+
print(pair, "::", value)
318+
```
319+
This might be surprising, because the MRCA of ``1`` and ``2``
320+
is node ``4`` over the entire sequence.
321+
In fact, some definitions of IBD segments
322+
would have this as a single segment,
323+
because the MRCA does not change,
324+
even if there are distinct genealogical paths.
325+
326+
The reason this is split into two segments
327+
is because the path from ``4`` to ``2`` changes:
328+
on the left-hand segment ``[0, 2)``, the node ``2``
329+
inherits from node ``4``
330+
via node ``3``, while on the right-hand segment ``[2, 10)``
331+
it inherits from node ``4`` directly.
332+
The tree sequence doesn't say directly whether node ``2``
333+
also inherits from node ``3`` on the right-hand segment,
334+
so whether or not this should be one IBD segment or two
335+
depends on our interpretation
336+
of what's stored in the tree sequence.
337+
As discussed in
338+
[Fritze et al](https://doi.org/10.1093/genetics/iyaf198),
339+
most tree sequence simulators (at time of writing)
340+
will produce this tree sequence even if node ``2``
341+
does in fact inherit from ``3`` over the entire sequence.
342+
Using {meth}`.TreeSequence.extend_haplotypes` will
343+
"put the unary nodes back":
344+
```{code-cell}
345+
ets = ts.extend_haplotypes()
346+
SVG(ets.draw_svg())
347+
```
348+
and once this is done, there is only a single IBD segment:
349+
```{code-cell}
350+
segments = ets.ibd_segments(within=[1, 2], store_pairs=True)
351+
for pair, value in segments.items():
352+
print(pair, "::", value)
205353
```
354+
So, extending haplotypes may produce IBD segments
355+
more in line with theory, if the desired definition if IBD
356+
is the "pathwise" definition.
357+
However, this will also probably introduce erroneous
358+
portions of IBD segments,
359+
so caution is needed.
360+
Another approach would be to merge adjacent segments of IBD
361+
that have the same MRCA.
362+
363+
Summarizing this section --
364+
there is a confusing array of possible definitions
365+
of what it means to be "an IBD segment";
366+
and these may be extracted from a tree sequence
367+
in subtly different ways.
368+
How much of a problem is this?
369+
The answer depends on the precise situation,
370+
but it seems likely that in practice,
371+
differences due to definition are small
372+
relative to errors due to tree sequence inference.
373+
Indeed, empirical haplotype-matching methods
374+
for identifying IBD segments can differ substantially
375+
depending on the values of various hyperparameters.
376+
More work is needed to develop a complete picture.

python/tskit/tables.py

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2737,8 +2737,8 @@ def assert_equals(self, other, *, ignore_timestamps=False):
27372737
@dataclasses.dataclass(eq=True, order=True)
27382738
class IdentitySegment:
27392739
"""
2740-
A single segment of identity spanning a genomic interval for a
2741-
a specific ancestor node.
2740+
A single segment of identity by descent spanning a genomic interval
2741+
for a specific ancestor node.
27422742
"""
27432743

27442744
left: float
@@ -2758,7 +2758,7 @@ def span(self) -> float:
27582758

27592759
class IdentitySegmentList(collections.abc.Iterable, collections.abc.Sized):
27602760
"""
2761-
A summary of identity segments for some pair of samples in a
2761+
A summary of identity-by-descent segments for some pair of samples in a
27622762
:class:`.IdentitySegments` result. If the ``store_segments`` argument
27632763
has been specified to :meth:`.TreeSequence.ibd_segments`, this class
27642764
can be treated as a sequence of :class:`.IdentitySegment` objects.
@@ -2769,7 +2769,9 @@ class IdentitySegmentList(collections.abc.Iterable, collections.abc.Sized):
27692769
27702770
If ``store_segments`` is False, only the overall summary values
27712771
such as :attr:`.IdentitySegmentList.total_span` and ``len()`` are
2772-
available.
2772+
available. Attempting to iterate over the list or access per-segment
2773+
arrays (``left``, ``right``, or ``node``) in this case will raise an
2774+
``IdentitySegmentsNotStoredError``.
27732775
27742776
.. warning:: The order of segments within an IdentitySegmentList is
27752777
arbitrary and may change in the future
@@ -2833,7 +2835,7 @@ def node(self):
28332835
class IdentitySegments(collections.abc.Mapping):
28342836
"""
28352837
A class summarising and optionally storing the segments of identity
2836-
by state returned by :meth:`.TreeSequence.ibd_segments`. See the
2838+
by descent returned by :meth:`.TreeSequence.ibd_segments`. See the
28372839
:ref:`sec_identity` for more information and examples.
28382840
28392841
Along with the documented methods and attributes, the class supports
@@ -2845,9 +2847,10 @@ class IdentitySegments(collections.abc.Mapping):
28452847
for a given instance of this class are determined by the
28462848
``store_pairs`` and ``store_segments`` arguments provided to
28472849
:meth:`.TreeSequence.ibd_segments`. For example, attempting
2848-
to access per-sample pair information if ``store_pairs``
2849-
is False will result in a (hopefully informative) error being
2850-
raised.
2850+
to access per-sample pair information (such as indexing with
2851+
``[(a, b)]``, iterating over the mapping, or accessing
2852+
:attr:`.IdentitySegments.pairs`) if ``store_pairs`` is False will
2853+
result in an ``IdentityPairsNotStoredError`` being raised.
28512854
28522855
.. warning:: This class should not be instantiated directly.
28532856
"""

python/tskit/trees.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10597,7 +10597,7 @@ def ibd_segments(
1059710597
represented by the tree sequence.
1059810598
1059910599
:param list within: A list of node IDs defining set of nodes that
10600-
we finding IBD segments for. If not specified, this defaults to
10600+
we find IBD segments for. If not specified, this defaults to
1060110601
all samples in the tree sequence.
1060210602
:param list[list] between: A list of lists of sample node IDs. Given
1060310603
two sample sets A and B, only IBD segments will be returned such
@@ -10612,7 +10612,7 @@ def ibd_segments(
1061210612
segment) is greater than this value will be included. (Default=0)
1061310613
:param bool store_pairs: If True store information separately for each
1061410614
pair of samples ``(a, b)`` that are found to be IBD. Otherwise
10615-
store summary information about all sample apirs. (Default=False)
10615+
store summary information about all sample pairs. (Default=False)
1061610616
:param bool store_segments: If True store each IBD segment
1061710617
``(left, right, c)`` and associate it with the corresponding
1061810618
sample pair ``(a, b)``. If True, implies ``store_pairs``.

0 commit comments

Comments
 (0)