-
Notifications
You must be signed in to change notification settings - Fork 36
Description
Hi Open2C team,
I'd like to use pairtools phase and I have some issues running it. I'd like to use it in a first time to reanalyse snHiC data from Collombet et al.,2020.
following this https://pairtools.readthedocs.io/en/latest/examples/pairtools_phase_walkthrough.html
1. Test data set:
Here are my final phase stats (with the small test, i.e. fastq downloaded with --minSpotId 0 --maxSpotId 1000000):
# cat phased.XA.dedup.stats | head -8
total 1429506
total_unmapped 203628
total_single_sided_mapped 278449
total_mapped 947429
total_dups 60249
total_nodups 887180
cis 152576
trans 734604
# cat phased.XA.phase0.stats | head -8
total 3704
total_unmapped 0
total_single_sided_mapped 0
total_mapped 3704
total_dups 257
total_nodups 3447
cis 3304
trans 143
# cat phased.XA.phase1.stats | head -8
total 3102
total_unmapped 0
total_single_sided_mapped 0
total_mapped 3102
total_dups 279
total_nodups 2823
cis 2389
trans 434
# cat phased.XA.unphased.stats | head -8
total 1203742
total_unmapped 0
total_single_sided_mapped 264503
total_mapped 939239
total_dups 59712
total_nodups 879527
cis 125709
trans 753818
# cat phased.XA.trans-phase.stats | head -8
total 1384
total_unmapped 0
total_single_sided_mapped 0
total_mapped 1384
total_dups 1
total_nodups 1383
cis 985
trans 398
I don't get why I don't have the same stats than in your tutorial, but the ratio of cis in phase0 / cis total is about the same range, I have here 2.5% (3304/(125709+985+2389+3304)) of cis in [hase0, in your example you have 1% (535/(52294+177+394+535)).
When I run the same example with the full fastq files (stll with SRR8811373), here are my final stats:
# cat phased.XA.dedup.stats | head -35
total 6923823
total_unmapped 953932
total_single_sided_mapped 1401139
total_mapped 4568752
total_dups 816275
total_nodups 3752477
cis 3045355
trans 707122
# cat phased.XA.phase0.stats | head -20
total 17866
total_unmapped 0
total_single_sided_mapped 0
total_mapped 17866
total_dups 4639
total_nodups 13227
cis 12685
trans 542
# cat phased.XA.phase1.stats | head -20
total 14771
total_unmapped 0
total_single_sided_mapped 0
total_mapped 14771
total_dups 4201
total_nodups 10570
cis 9202
trans 1368
# cat phased.XA.trans-phase.stats | head -20
total 6848
total_unmapped 0
total_single_sided_mapped 0
total_mapped 6848
total_dups 27
total_nodups 6821
cis 4867
trans 1954
# cat phased.XA.unphased.stats | head -20
total 5857878
total_unmapped 0
total_single_sided_mapped 1328611
total_mapped 4529267
total_dups 807408
total_nodups 3721859
cis 2529823
trans 1192036
Here the ratio of cis in phase0 / cis total is 0.4% (12685/(2529823+4867+12685+9202))
Do you know why the ratio of contact mapped on phase0 haplotype drops when I use the full dataset? do you have the same? I was expecting this ratio to be the same independently of using a full dataset or a sample.
2. Comparison with processed data:
So I compared with the number or ratio of pairs that they report in the original dataset (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE129029)
head GSM3691125_2CSE_70.allValidPairs.txt
NB501547:21:HCVWNAFXX:4:11612:3393:4075 chr1 3000757 + chr3 14672639 - 242 HIC_chr1_2 HIC_chr3_31231 34 12 0-0
NB501547:21:HCVWNAFXX:3:11402:2702:9569 chr1 3003299 - chr1 3266239 + 119 HIC_chr1_7 HIC_chr1_786 42 42 2-0
NB501547:21:HCVWNAFXX:3:11612:25249:9112 chr1 3004023 + chr1 10608406 - 261 HIC_chr1_13 HIC_chr1_20557 38 42 0-2
NB501547:21:HCVWNAFXX:3:21510:8573:6102 chr1 3004278 - chr4 20056554 - 182 HIC_chr1_15 HIC_chr4_43236 42 42 0-0
NB501547:21:HCVWNAFXX:2:21109:18612:1489 chr1 3004451 - chr1 3008304 + 175 HIC_chr1_16 HIC_chr1_24 42 42 0-0
NB501547:21:HCVWNAFXX:1:21307:17607:14306 chr1 3005602 - chr1 4002173 - 149 HIC_chr1_18 HIC_chr1_2746 42 42 0-0
NB501547:21:HCVWNAFXX:3:11507:26752:1374 chr1 3011527 - chr1 9745573 + 148 HIC_chr1_35 HIC_chr1_18503 42 42 0-0
NB501547:21:HCVWNAFXX:4:11602:20611:17311 chr1 3013889 - chr3 14672639 - 399 HIC_chr1_40 HIC_chr3_31231 42 30 1-0
# total number of pairs
wc -l GSM3691125_2CSE_70.allValidPairs.txt
1209391 GSM3691125_2CSE_70.allValidPairs.txt
# number of phased pairs (with 1 being BL6, 2 CAST, 0 unassigned)
grep '0-0' GSM3691125_2CSE_70.allValidPairs.txt | wc -l
456762
grep '1-1' GSM3691125_2CSE_70.allValidPairs.txt | wc -l
82362
grep '2-2' GSM3691125_2CSE_70.allValidPairs.txt | wc -l
97820
grep '1-2' GSM3691125_2CSE_70.allValidPairs.txt | wc -l
2773
grep '2-1' GSM3691125_2CSE_70.allValidPairs.txt | wc -l
2356
grep '0-1' GSM3691125_2CSE_70.allValidPairs.txt | wc -l
126764
grep '1-0' GSM3691125_2CSE_70.allValidPairs.txt | wc -l
127739
grep '2-0' GSM3691125_2CSE_70.allValidPairs.txt | wc -l
156166
grep '0-2' GSM3691125_2CSE_70.allValidPairs.txt | wc -l
155665
Then here the ratio of 1-1 over total (= phase0 in pairtools phase) is 6% (82362/1209391)
I guess numbers of aligned reads, detected pairs and cis-contact will vary depending on quality threshold applied and algorithm used, but I don't get why it is so different between HiCPro output and pairtools phase on the full fastq file.
Additional questions:
Can I combine pairtools phase with pairtools dedub by restriction fragment?
Can I combine pairtools phase with filterByCov?
Something like
pairtools parse -> pairtools phase -> pairtools sort -> pairtools dedup amplification duplicates -> pairtools restrict -> pairtools dedup by restriction fragment -> pairtools filterbycov -> pairtools select by phase
Thanks you for your help!
David
Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.17-r1198-dirty
pairtools, version 1.0.2