Add a filter for unfiltered 10x MEX data#1511
Conversation
707a3aa to
7dde827
Compare
gemma-core/src/main/resources/ubic/gemma/core/loader/expression/singleCell/filter-10x-mex.py
Outdated
Show resolved
Hide resolved
00cd4e3 to
d7b967a
Compare
|
The transformations should be prototype beans so that we do not have to inject all the variables manually in the CLI. |
|
produces the following error:
I'm not sure if there's a way around this. I tried modifying the h5 file according to this issue, but it didn't fix the error. In the meantime, we can add:
I know there is some debate bout how many genes/counts must be detected in cells to keep them. Let me know what you think. Alternatively, we can run emptyDrops on the count matrices. |
|
I don't see why we wouldn't be able to produce a Cell Ranger-compatible H5 file from a MEX output. I can investigate that tomorrow. I can add another filter for running a simple |
|
the h5 I generated matches the format matches the documented Cellranger format |
|
It may actually be a missing attribute. Do you have access to one of their files? I think h5dump -A will show you attributes. |
|
the only example I could find is a it's associated with |
c840a24 to
231d26e
Compare
|
It looks like Cell Ranger renanalyze tool expects filtered data, so it won't be suitable for this task. @rachadele has looked around and found https://github.com/MarioniLab/DropletUtils/ which provide the Cell Ranger EmptyDrops-based filter in R, but lacks the OrderMag step that estimates the number of cells. That led her to MarioniLab/DropletUtils#119, but that feature will likely never make it in DropletUtils due to the maintenance burden of keeping up with Cell Ranger. The author of the PR also mentioned https://github.com/COMBINE-lab/QCatch which has a Python reimplementation of that filter in https://github.com/COMBINE-lab/QCatch/tree/main/src/qcatch/find_retained_cells. |
|
I was able to get ordMag (
the test MEX dir output: The error stems from here: not sure why emptyDrops is failing. I tried changing the chemistry to I tried to populate the required fields with what I think is correct for a single MEX directory (1 gel group, mm10 genome, no probe barcodes). Also, I had to change |
|
in the interim, think we should just apply something like: data = scanpy.read_10x_mtx(input_file, prefix=prefix) this will require 500 genes expressed and 500 total UMIs per cell. it should get rid of most if not all of the low-quality barcodes in unfiltered MEX data. |
|
okay I got it to work with a real example. see the pkl can then be loaded and used to filter an anndata object with |
|
Is it possible to write back MEX? It would simplify the loading, otherwise I'd need to add a mechanism for applying the AnnData loader. |
...e/src/main/resources/ubic/gemma/core/loader/expression/singleCell/transform/requirements.txt
Outdated
Show resolved
Hide resolved
gemma-core/src/main/java/ubic/gemma/model/expression/experiment/ExpressionExperiment.java
Outdated
Show resolved
Hide resolved
.../ubic/gemma/core/loader/expression/singleCell/AbstractMexSingleCellDataLoaderConfigurer.java
Show resolved
Hide resolved
gemma-cli/src/main/java/ubic/gemma/cli/completion/CompletionUtils.java
Outdated
Show resolved
Hide resolved
...ain/java/ubic/gemma/core/loader/expression/singleCell/MexSingleCellDataLoaderConfigurer.java
Outdated
Show resolved
Hide resolved
|
The MatrixMarket format from Cell Ranger include software metadata that can be used to detect MEX from Cell Ranger:
|
| return SingleCellDataType.ANNDATA; | ||
| } else { | ||
| throw new UnsupportedOperationException( "Detecting data type for " + transformation.getClass().getName() + " is not supported." ); | ||
| throw new UnsupportedOperationException( "Detecting data type for output of " + transformation.getClass().getName() + " is not supported." ); |
There was a problem hiding this comment.
This can be added immediately in the patch release.
d70cf69 to
a477224
Compare
| * TODO: this will need more work and tests | ||
| */ | ||
| @Nullable | ||
| public static String detect10xChemistry( GeoSample sample ) { |
There was a problem hiding this comment.
@rachadele it would be nice if you could review this section for detecting the chemistry based on product numbers from 10x.
| /** | ||
| * Chemistry used for single-cell sequencing. | ||
| * <p> | ||
| * This affects the 10x MEX data filter. |
There was a problem hiding this comment.
Specify here the implication of passing null.
| method=FilterMethod.ORDMAG_NONAMBIENT, | ||
| recovered_cells=None, | ||
| cell_barcodes=None, | ||
| force_cells=None, |
There was a problem hiding this comment.
@rachadele according to the type annotations, cell_bracodes and force_cells must be set. Is there anything we can do about it?
|
|
||
| np.random.seed(42) | ||
|
|
||
| GENES_TSV = "genes.tsv" |
There was a problem hiding this comment.
We write legacy file with features.tsv.gz, but we don't rewrite them to have the Gene Expression-filled column. Thus, this detection is unnecessary.
I'll fix #1269 and rewrite the files to follow the v3 convention.
Add detection for unfiltered 10x data The MTX file contains JSON metadata that can be used to detect if Cell Ranger was used to generate the file. In addition, we can safely assume that the presence of unused barcodes is an indicator of an unfiltered dataset. If that does not apply in real world, we could also use a threshold on the unused fraction.
ef9a057 to
ee9b565
Compare
Uh oh!
There was an error while loading. Please reload this page.