Improve IBD documentation and IdentitySegments docs (#3330)

benjeffery · petrelharp · hyanwong · web-flow · commit 104c7ba2765f · 2026-01-06T09:57:25.000Z
* Improve IBD documentation and IdentitySegments docs

* edits

---------

Co-authored-by: peter &lt;petrel.harp@gmail.com&gt;
Co-authored-by: Yan Wong &lt;yan.wong@bdi.ox.ac.uk&gt;
diff --git a/docs/ibd.md b/docs/ibd.md
@@ -20,16 +20,12 @@ kernelspec:
 # Identity by descent
 
 The {meth}`.TreeSequence.ibd_segments` method allows us to compute
-segments of identity by descent.
+segments of identity by descent along a tree sequence.
 
 :::{note}
 This documentation page is preliminary
 :::
 
-:::{todo}
-Relate the concept of identity by descent to the MRCA spans in the tree sequence.
-:::
-
 ## Examples
 
 Let's take a simple tree sequence to illustrate the {meth}`.TreeSequence.ibd_segments`
@@ -77,7 +73,26 @@ coordinate ``[left, right)`` and ancestral node ``a`` iff the most
 recent common ancestor of the segment ``[left, right)`` in nodes ``u``
 and ``v`` is ``a``, and the segment has been inherited along the same
 genealogical path (ie. it has not been broken by recombination). The
-segments returned are the longest possible ones.
+definition of a "genealogical path" used here is
+the sequence of edges, rather than nodes.
+So, for instance, if ``u`` inherits a segment ``[x, z)`` from ``a``,
+but that inheritance is represented by two edges,
+one spanning ``[x, y)`` and the other spanning ``[y, z)``,
+then this represents two genealogical paths,
+and any IBD segments would be split at ``y``.
+In other words, the method assumes that the end
+of an edge represents a recombination,
+an assumption that may not reflect how the tree sequence
+is used -- see below for more discussion.
+
+This definition is purely genealogical: it depends only on the tree
+sequence topology and node times, and does not inspect allelic
+states or mutations. In particular, if we compute the MRCA of ``(u, v)``
+in each tree along the sequence, then (up to the additional refinement
+by genealogical path) the IBD segments are those
+that share the same ancestor and paths to that
+ancestor. Intervals in which ``u`` and ``v`` lie in different roots
+have no MRCA and therefore do not contribute IBD segments.
 
 Consider the IBD segments that we get from our example tree sequence:
 
@@ -109,7 +124,11 @@ By default this class only stores the high-level summaries of the
 IBD segments discovered. As we can see in this example, we have a
 total of six segments and
 the total span (i.e., the sum lengths of the genomic intervals spanned
-by IBD segments) is 30.
+by IBD segments) is 30. In this default mode the object does not
+store information about individual sample pairs, and methods that
+inspect per-pair information (such as indexing with ``[(a, b)]`` or
+iterating over the mapping) will raise an
+``IdentityPairsNotStoredError``.
 
 If required, we can get more detailed information about particular
 segment pairs and the actual segments using the ``store_pairs``
@@ -148,8 +167,12 @@ segments = ts.ibd_segments(store_pairs=True, store_segments=True)
 segments[(0, 1)]
 ```
 
-The {class}`.IdentitySegmentList` behaves like a Python list,
-where each element is an instance of {class}`.IdentitySegment`.
+When ``store_segments=True``, the {class}`.IdentitySegmentList` behaves
+like a Python list, where each element is an instance of
+{class}`.IdentitySegment`. When only ``store_pairs=True`` is specified,
+the number of segments and their total span are still available, but
+attempting to iterate over the list or access the per-segment arrays
+will raise an ``IdentitySegmentsNotStoredError``.
 
 :::{warning}
 The order of segments in an {class}`.IdentitySegmentList`
@@ -168,19 +191,24 @@ By default we get the IBD segments between all pairs of
 {ref}`sample<sec_data_model_definitions_sample>` nodes.
 
 #### IBD within a sample set
+
 We can reduce this to pairs within a specific set using the
 ``within`` argument:
 
-
-```{eval-rst}
-.. todo:: More detail and better examples here.
-```
-
 ```{code-cell}
 segments = ts.ibd_segments(within=[0, 2], store_pairs=True)
 print(list(segments.keys()))
 ```
 
+Here we have restricted attention to the samples with node IDs 0 and 2,
+so only the pair ``(0, 2)`` appears in the result. In general:
+
+- ``within`` should be a one-dimensional array-like of node IDs
+  (typically sample nodes). All unordered pairs from this set are
+  considered.
+- If ``within`` is omitted (the default), all nodes flagged as samples
+  in the node table are used.
+
 #### IBD between sample sets
 
 We can also compute IBD **between** sample sets:
@@ -190,6 +218,17 @@ segments = ts.ibd_segments(between=[[0,1], [2]], store_pairs=True)
 print(list(segments.keys()))
 ```
 
+In this example we have two sample sets, ``[0, 1]`` and ``[2]``, so the
+identity segments are computed only for pairs in which one sample lies
+in the first set and the other lies in the second. More generally:
+
+- ``between`` should be a list of non-overlapping lists of node IDs.
+- All pairs ``(u, v)`` are considered such that ``u`` and ``v`` belong
+  to different sample sets.
+
+The ``within`` and ``between`` arguments are mutually exclusive: passing
+both at the same time raises a :class:`ValueError`.
+
 :::{seealso}
 See the {meth}`.TreeSequence.ibd_segments` documentation for
 more details.
@@ -200,6 +239,138 @@ more details.
 The ``max_time`` and ``min_span`` arguments allow us to constrain the
 segments that we consider.
 
-```{eval-rst}
-.. todo:: Add examples for these arguments.
+The ``max_time`` argument specifies an upper bound on the time of the
+common ancestor node: only IBD segments whose MRCA node has a time
+no greater than ``max_time`` are returned.
+
+The ``min_span`` argument filters by genomic length: only segments with
+span strictly greater than ``min_span`` are included.
+
+For example, working with ``ts2`` as the following tree sequence:
+
+```{code-cell}
+:tags: [hide-input]
+
+import io
+
+nodes = io.StringIO(
+    """\
+    id      is_sample   time
+    0       1           0
+    1       1           0
+    2       0           1
+    3       0           3
+    """
+)
+edges = io.StringIO(
+    """\
+    left    right   parent  child
+    0      4     2       0,1
+    4     10     3       0,1
+    """
+)
+ts2 = tskit.load_text(nodes=nodes, edges=edges, strict=False)
+SVG(ts2.draw_svg())
+```
+
+There are two segments:
+```{code-cell}
+segments = ts2.ibd_segments(store_segments=True)
+print("all segments:", list(segments.values())[0])
+```
+... but only the left-hand one is more recent than 2 time units ago:
+```{code-cell}
+segments_recent = ts2.ibd_segments(max_time=2, store_segments=True)
+print("max_time=1.2:", list(segments_recent.values())[0])
+```
+... and only the right-hand one is longer than 5 units.
+```{code-cell}
+
+segments_long = ts2.ibd_segments(min_span=5, store_segments=True)
+print("min_span=0.5:", list(segments_long.values())[0])
+```
+
+So: the full result contains two IBD segments for the single sample
+pair, one inherited via ancestor 2 over ``[0, 4)`` and one via
+ancestor 3 over ``[4, 10)``. The ``max_time`` constraint removes the
+segment inherited from the older ancestor (time 3), while the
+``min_span`` constraint keeps only the longer of the two segments.
+
+### More on the "pathwise" definition of IBD segments
+
+We said above that the definition of IBD used by
+{meth}`.TreeSequence.ibd_segments` says that a given segment
+must be inherited from the MRCA along a single genealogical path,
+and that "genealogical paths" are defined *edgewise*.
+This can lead to surprising consequences.
+
+Returning to our example above:
+```{code-cell}
+:tags: [hide-input]
+
+SVG(ts.draw_svg())
+```
+there are two IBD segments between ``1`` and ``2``:
+```{code-cell}
+segments = ts.ibd_segments(within=[1, 2], store_pairs=True)
+for pair, value in segments.items():
+    print(pair, "::", value)
+```
+This might be surprising, because the MRCA of ``1`` and ``2``
+is node ``4`` over the entire sequence.
+In fact, some definitions of IBD segments
+would have this as a single segment,
+because the MRCA does not change,
+even if there are distinct genealogical paths.
+
+The reason this is split into two segments
+is because the path from ``4`` to ``2`` changes:
+on the left-hand segment ``[0, 2)``, the node ``2``
+inherits from node ``4``
+via node ``3``, while on the right-hand segment ``[2, 10)``
+it inherits from node ``4`` directly.
+The tree sequence doesn't say directly whether node ``2``
+also inherits from node ``3`` on the right-hand segment,
+so whether or not this should be one IBD segment or two
+depends on our interpretation
+of what's stored in the tree sequence.
+As discussed in 
+[Fritze et al](https://doi.org/10.1093/genetics/iyaf198),
+most tree sequence simulators (at time of writing)
+will produce this tree sequence even if node ``2``
+does in fact inherit from ``3`` over the entire sequence.
+Using {meth}`.TreeSequence.extend_haplotypes` will
+"put the unary nodes back":
+```{code-cell}
+ets = ts.extend_haplotypes()
+SVG(ets.draw_svg())
+```
+and once this is done, there is only a single IBD segment:
+```{code-cell}
+segments = ets.ibd_segments(within=[1, 2], store_pairs=True)
+for pair, value in segments.items():
+    print(pair, "::", value)
 ```
+So, extending haplotypes may produce IBD segments
+more in line with theory, if the desired definition if IBD
+is the "pathwise" definition.
+However, this will also probably introduce erroneous
+portions of IBD segments,
+so caution is needed.
+Another approach would be to merge adjacent segments of IBD
+that have the same MRCA.
+
+Summarizing this section --
+there is a confusing array of possible definitions
+of what it means to be "an IBD segment";
+and these may be extracted from a tree sequence
+in subtly different ways.
+How much of a problem is this?
+The answer depends on the precise situation,
+but it seems likely that in practice,
+differences due to definition are small
+relative to errors due to tree sequence inference.
+Indeed, empirical haplotype-matching methods
+for identifying IBD segments can differ substantially
+depending on the values of various hyperparameters.
+More work is needed to develop a complete picture.
diff --git a/python/tskit/tables.py b/python/tskit/tables.py
@@ -2737,8 +2737,8 @@ def assert_equals(self, other, *, ignore_timestamps=False):
 @dataclasses.dataclass(eq=True, order=True)
 class IdentitySegment:
     """
-    A single segment of identity spanning a genomic interval for a
-    a specific ancestor node.
+    A single segment of identity by descent spanning a genomic interval
+    for a specific ancestor node.
     """
 
     left: float
@@ -2758,7 +2758,7 @@ def span(self) -> float:
 
 class IdentitySegmentList(collections.abc.Iterable, collections.abc.Sized):
     """
-    A summary of identity segments for some pair of samples in a
+    A summary of identity-by-descent segments for some pair of samples in a
     :class:`.IdentitySegments` result. If the ``store_segments`` argument
     has been specified to :meth:`.TreeSequence.ibd_segments`, this class
     can be treated as a sequence of :class:`.IdentitySegment` objects.
@@ -2769,7 +2769,9 @@ class IdentitySegmentList(collections.abc.Iterable, collections.abc.Sized):
 
     If ``store_segments`` is False, only the overall summary values
     such as :attr:`.IdentitySegmentList.total_span` and ``len()`` are
-    available.
+    available. Attempting to iterate over the list or access per-segment
+    arrays (``left``, ``right``, or ``node``) in this case will raise an
+    ``IdentitySegmentsNotStoredError``.
 
     .. warning:: The order of segments within an IdentitySegmentList is
         arbitrary and may change in the future
@@ -2833,7 +2835,7 @@ def node(self):
 class IdentitySegments(collections.abc.Mapping):
     """
     A class summarising and optionally storing the segments of identity
-    by state returned by :meth:`.TreeSequence.ibd_segments`. See the
+    by descent returned by :meth:`.TreeSequence.ibd_segments`. See the
     :ref:`sec_identity` for more information and examples.
 
     Along with the documented methods and attributes, the class supports
@@ -2845,9 +2847,10 @@ class IdentitySegments(collections.abc.Mapping):
        for a given instance of this class are determined by the
        ``store_pairs`` and ``store_segments`` arguments provided to
        :meth:`.TreeSequence.ibd_segments`. For example, attempting
-       to access per-sample pair information if ``store_pairs``
-       is False will result in a (hopefully informative) error being
-       raised.
+       to access per-sample pair information (such as indexing with
+       ``[(a, b)]``, iterating over the mapping, or accessing
+       :attr:`.IdentitySegments.pairs`) if ``store_pairs`` is False will
+       result in an ``IdentityPairsNotStoredError`` being raised.
 
     .. warning:: This class should not be instantiated directly.
     """
diff --git a/python/tskit/trees.py b/python/tskit/trees.py
@@ -10597,7 +10597,7 @@ def ibd_segments(
         represented by the tree sequence.
 
         :param list within: A list of node IDs defining set of nodes that
-            we finding IBD segments for. If not specified, this defaults to
+            we find IBD segments for. If not specified, this defaults to
             all samples in the tree sequence.
         :param list[list] between: A list of lists of sample node IDs. Given
             two sample sets A and B, only IBD segments will be returned such
@@ -10612,7 +10612,7 @@ def ibd_segments(
             segment) is greater than this value will be included. (Default=0)
         :param bool store_pairs: If True store information separately for each
             pair of samples ``(a, b)`` that are found to be IBD. Otherwise
-            store summary information about all sample apirs. (Default=False)
+            store summary information about all sample pairs. (Default=False)
         :param bool store_segments: If True store each IBD segment
             ``(left, right, c)`` and associate it with the corresponding
             sample pair ``(a, b)``. If True, implies ``store_pairs``.