-
Notifications
You must be signed in to change notification settings - Fork 9
Open
Description
I'm assuming we'd be using 2 bits per bp. There's a couple of common encodings, but I want to suggest
A -> 00
C -> 01
T -> 10
G -> 11
There's a couple of benefits:
- These are the 2nd and 3rd bits of the ASCII encoding of the corresponding base pairs. Conversion from byte strings would be easy.
- Complement by using XOR ...0101010
On a slightly unrelated note, I've worked on some sequence manipulation stuff that use SIMD (eg., here, here for a library that was abandoned). Many of these ideas could be applicable here as well. I'm assuming that we want scalar ops only here because SIMD registers are probably too wide (128 or 256 bits) for handling kmers that are relatively short.
shenwei356
Metadata
Metadata
Assignees
Labels
No labels