@@ -20,16 +20,12 @@ kernelspec:
2020# Identity by descent
2121
2222The {meth}` .TreeSequence.ibd_segments ` method allows us to compute
23- segments of identity by descent.
23+ segments of identity by descent along a tree sequence .
2424
2525:::{note}
2626This documentation page is preliminary
2727:::
2828
29- :::{todo}
30- Relate the concept of identity by descent to the MRCA spans in the tree sequence.
31- :::
32-
3329## Examples
3430
3531Let's take a simple tree sequence to illustrate the {meth}` .TreeSequence.ibd_segments `
@@ -77,7 +73,26 @@ coordinate ``[left, right)`` and ancestral node ``a`` iff the most
7773recent common ancestor of the segment `` [left, right) `` in nodes `` u ``
7874and `` v `` is `` a `` , and the segment has been inherited along the same
7975genealogical path (ie. it has not been broken by recombination). The
80- segments returned are the longest possible ones.
76+ definition of a "genealogical path" used here is
77+ the sequence of edges, rather than nodes.
78+ So, for instance, if `` u `` inherits a segment `` [x, z) `` from `` a `` ,
79+ but that inheritance is represented by two edges,
80+ one spanning `` [x, y) `` and the other spanning `` [y, z) `` ,
81+ then this represents two genealogical paths,
82+ and any IBD segments would be split at `` y `` .
83+ In other words, the method assumes that the end
84+ of an edge represents a recombination,
85+ an assumption that may not reflect how the tree sequence
86+ is used -- see below for more discussion.
87+
88+ This definition is purely genealogical: it depends only on the tree
89+ sequence topology and node times, and does not inspect allelic
90+ states or mutations. In particular, if we compute the MRCA of `` (u, v) ``
91+ in each tree along the sequence, then (up to the additional refinement
92+ by genealogical path) the IBD segments are those
93+ that share the same ancestor and paths to that
94+ ancestor. Intervals in which `` u `` and `` v `` lie in different roots
95+ have no MRCA and therefore do not contribute IBD segments.
8196
8297Consider the IBD segments that we get from our example tree sequence:
8398
@@ -109,7 +124,11 @@ By default this class only stores the high-level summaries of the
109124IBD segments discovered. As we can see in this example, we have a
110125total of six segments and
111126the total span (i.e., the sum lengths of the genomic intervals spanned
112- by IBD segments) is 30.
127+ by IBD segments) is 30. In this default mode the object does not
128+ store information about individual sample pairs, and methods that
129+ inspect per-pair information (such as indexing with `` [(a, b)] `` or
130+ iterating over the mapping) will raise an
131+ `` IdentityPairsNotStoredError `` .
113132
114133If required, we can get more detailed information about particular
115134segment pairs and the actual segments using the `` store_pairs ``
@@ -148,8 +167,12 @@ segments = ts.ibd_segments(store_pairs=True, store_segments=True)
148167segments[(0, 1)]
149168```
150169
151- The {class}` .IdentitySegmentList ` behaves like a Python list,
152- where each element is an instance of {class}` .IdentitySegment ` .
170+ When `` store_segments=True `` , the {class}` .IdentitySegmentList ` behaves
171+ like a Python list, where each element is an instance of
172+ {class}` .IdentitySegment ` . When only `` store_pairs=True `` is specified,
173+ the number of segments and their total span are still available, but
174+ attempting to iterate over the list or access the per-segment arrays
175+ will raise an `` IdentitySegmentsNotStoredError `` .
153176
154177:::{warning}
155178The order of segments in an {class}` .IdentitySegmentList `
@@ -168,19 +191,24 @@ By default we get the IBD segments between all pairs of
168191{ref}` sample<sec_data_model_definitions_sample> ` nodes.
169192
170193#### IBD within a sample set
194+
171195We can reduce this to pairs within a specific set using the
172196`` within `` argument:
173197
174-
175- ``` {eval-rst}
176- .. todo:: More detail and better examples here.
177- ```
178-
179198``` {code-cell}
180199segments = ts.ibd_segments(within=[0, 2], store_pairs=True)
181200print(list(segments.keys()))
182201```
183202
203+ Here we have restricted attention to the samples with node IDs 0 and 2,
204+ so only the pair `` (0, 2) `` appears in the result. In general:
205+
206+ - `` within `` should be a one-dimensional array-like of node IDs
207+ (typically sample nodes). All unordered pairs from this set are
208+ considered.
209+ - If `` within `` is omitted (the default), all nodes flagged as samples
210+ in the node table are used.
211+
184212#### IBD between sample sets
185213
186214We can also compute IBD ** between** sample sets:
@@ -190,6 +218,17 @@ segments = ts.ibd_segments(between=[[0,1], [2]], store_pairs=True)
190218print(list(segments.keys()))
191219```
192220
221+ In this example we have two sample sets, `` [0, 1] `` and `` [2] `` , so the
222+ identity segments are computed only for pairs in which one sample lies
223+ in the first set and the other lies in the second. More generally:
224+
225+ - `` between `` should be a list of non-overlapping lists of node IDs.
226+ - All pairs `` (u, v) `` are considered such that `` u `` and `` v `` belong
227+ to different sample sets.
228+
229+ The `` within `` and `` between `` arguments are mutually exclusive: passing
230+ both at the same time raises a :class:` ValueError ` .
231+
193232:::{seealso}
194233See the {meth}` .TreeSequence.ibd_segments ` documentation for
195234more details.
@@ -200,6 +239,138 @@ more details.
200239The `` max_time `` and `` min_span `` arguments allow us to constrain the
201240segments that we consider.
202241
203- ``` {eval-rst}
204- .. todo:: Add examples for these arguments.
242+ The `` max_time `` argument specifies an upper bound on the time of the
243+ common ancestor node: only IBD segments whose MRCA node has a time
244+ no greater than `` max_time `` are returned.
245+
246+ The `` min_span `` argument filters by genomic length: only segments with
247+ span strictly greater than `` min_span `` are included.
248+
249+ For example, working with `` ts2 `` as the following tree sequence:
250+
251+ ``` {code-cell}
252+ :tags: [hide-input]
253+
254+ import io
255+
256+ nodes = io.StringIO(
257+ """\
258+ id is_sample time
259+ 0 1 0
260+ 1 1 0
261+ 2 0 1
262+ 3 0 3
263+ """
264+ )
265+ edges = io.StringIO(
266+ """\
267+ left right parent child
268+ 0 4 2 0,1
269+ 4 10 3 0,1
270+ """
271+ )
272+ ts2 = tskit.load_text(nodes=nodes, edges=edges, strict=False)
273+ SVG(ts2.draw_svg())
274+ ```
275+
276+ There are two segments:
277+ ``` {code-cell}
278+ segments = ts2.ibd_segments(store_segments=True)
279+ print("all segments:", list(segments.values())[0])
280+ ```
281+ ... but only the left-hand one is more recent than 2 time units ago:
282+ ``` {code-cell}
283+ segments_recent = ts2.ibd_segments(max_time=2, store_segments=True)
284+ print("max_time=1.2:", list(segments_recent.values())[0])
285+ ```
286+ ... and only the right-hand one is longer than 5 units.
287+ ``` {code-cell}
288+
289+ segments_long = ts2.ibd_segments(min_span=5, store_segments=True)
290+ print("min_span=0.5:", list(segments_long.values())[0])
291+ ```
292+
293+ So: the full result contains two IBD segments for the single sample
294+ pair, one inherited via ancestor 2 over `` [0, 4) `` and one via
295+ ancestor 3 over `` [4, 10) `` . The `` max_time `` constraint removes the
296+ segment inherited from the older ancestor (time 3), while the
297+ `` min_span `` constraint keeps only the longer of the two segments.
298+
299+ ### More on the "pathwise" definition of IBD segments
300+
301+ We said above that the definition of IBD used by
302+ {meth}` .TreeSequence.ibd_segments ` says that a given segment
303+ must be inherited from the MRCA along a single genealogical path,
304+ and that "genealogical paths" are defined * edgewise* .
305+ This can lead to surprising consequences.
306+
307+ Returning to our example above:
308+ ``` {code-cell}
309+ :tags: [hide-input]
310+
311+ SVG(ts.draw_svg())
312+ ```
313+ there are two IBD segments between `` 1 `` and `` 2 `` :
314+ ``` {code-cell}
315+ segments = ts.ibd_segments(within=[1, 2], store_pairs=True)
316+ for pair, value in segments.items():
317+ print(pair, "::", value)
318+ ```
319+ This might be surprising, because the MRCA of `` 1 `` and `` 2 ``
320+ is node `` 4 `` over the entire sequence.
321+ In fact, some definitions of IBD segments
322+ would have this as a single segment,
323+ because the MRCA does not change,
324+ even if there are distinct genealogical paths.
325+
326+ The reason this is split into two segments
327+ is because the path from `` 4 `` to `` 2 `` changes:
328+ on the left-hand segment `` [0, 2) `` , the node `` 2 ``
329+ inherits from node `` 4 ``
330+ via node `` 3 `` , while on the right-hand segment `` [2, 10) ``
331+ it inherits from node `` 4 `` directly.
332+ The tree sequence doesn't say directly whether node `` 2 ``
333+ also inherits from node `` 3 `` on the right-hand segment,
334+ so whether or not this should be one IBD segment or two
335+ depends on our interpretation
336+ of what's stored in the tree sequence.
337+ As discussed in
338+ [ Fritze et al] ( https://doi.org/10.1093/genetics/iyaf198 ) ,
339+ most tree sequence simulators (at time of writing)
340+ will produce this tree sequence even if node `` 2 ``
341+ does in fact inherit from `` 3 `` over the entire sequence.
342+ Using {meth}` .TreeSequence.extend_haplotypes ` will
343+ "put the unary nodes back":
344+ ``` {code-cell}
345+ ets = ts.extend_haplotypes()
346+ SVG(ets.draw_svg())
347+ ```
348+ and once this is done, there is only a single IBD segment:
349+ ``` {code-cell}
350+ segments = ets.ibd_segments(within=[1, 2], store_pairs=True)
351+ for pair, value in segments.items():
352+ print(pair, "::", value)
205353```
354+ So, extending haplotypes may produce IBD segments
355+ more in line with theory, if the desired definition if IBD
356+ is the "pathwise" definition.
357+ However, this will also probably introduce erroneous
358+ portions of IBD segments,
359+ so caution is needed.
360+ Another approach would be to merge adjacent segments of IBD
361+ that have the same MRCA.
362+
363+ Summarizing this section --
364+ there is a confusing array of possible definitions
365+ of what it means to be "an IBD segment";
366+ and these may be extracted from a tree sequence
367+ in subtly different ways.
368+ How much of a problem is this?
369+ The answer depends on the precise situation,
370+ but it seems likely that in practice,
371+ differences due to definition are small
372+ relative to errors due to tree sequence inference.
373+ Indeed, empirical haplotype-matching methods
374+ for identifying IBD segments can differ substantially
375+ depending on the values of various hyperparameters.
376+ More work is needed to develop a complete picture.
0 commit comments