Skip to content

Encoding TypeΒ #3

@Daniel-Liu-c0deb0t

Description

@Daniel-Liu-c0deb0t

I'm assuming we'd be using 2 bits per bp. There's a couple of common encodings, but I want to suggest

A -> 00
C -> 01
T -> 10
G -> 11

There's a couple of benefits:

  1. These are the 2nd and 3rd bits of the ASCII encoding of the corresponding base pairs. Conversion from byte strings would be easy.
  2. Complement by using XOR ...0101010

On a slightly unrelated note, I've worked on some sequence manipulation stuff that use SIMD (eg., here, here for a library that was abandoned). Many of these ideas could be applicable here as well. I'm assuming that we want scalar ops only here because SIMD registers are probably too wide (128 or 256 bits) for handling kmers that are relatively short.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions