Add a filter for unfiltered 10x MEX data by arteymix · Pull Request #1511 · PavlidisLab/Gemma

arteymix · 2025-09-12T18:45:59Z

detect unfiltered 10x datasets
add CLI flags to perform or not the filtering with auto detection by default
add a transformation script for filtering MEX data

gemma-core/src/main/resources/ubic/gemma/core/loader/expression/singleCell/filter-10x-mex.py

arteymix · 2025-09-13T17:12:13Z

The transformations should be prototype beans so that we do not have to inject all the variables manually in the CLI.

rachadele · 2025-09-15T18:34:38Z

python filter-10x-mex.py /space/grp/rschwartz/rschwartz/cell_annotation_cortex.nf/results/mus_musculus_subsample_ref_500_2025-09-15_11-28-12/mex/GSE225554/1011636_GSM7050276 GSE225554/1011636_GSM7050276

produces the following error:

[error] Filtered matrices file was generated with an older version of cellranger that is incompatible with reanalyze. Please run cellranger count again to generate a new matrix.

I'm not sure if there's a way around this. I tried modifying the h5 file according to this issue, but it didn't fix the error.

In the meantime, we can add:

sc.pp.fiter_cells(min_counts=X, min_genes=Y)

I know there is some debate bout how many genes/counts must be detected in cells to keep them. Let me know what you think.

Alternatively, we can run emptyDrops on the count matrices.

arteymix · 2025-09-15T19:42:42Z

I don't see why we wouldn't be able to produce a Cell Ranger-compatible H5 file from a MEX output. I can investigate that tomorrow.

I can add another filter for running a simple sc.pp.filter_cells. That should be dealt with in a separate issue though.

rachadele · 2025-09-15T21:04:59Z

the h5 I generated matches the format matches the documented Cellranger format

HDF5 "GSE225554/1011636_GSM7050276/raw_feature_bc_matrix.h5" {
FILE_CONTENTS {
 group      /
 group      /matrix
 dataset    /matrix/barcodes
 dataset    /matrix/data
 group      /matrix/features
 dataset    /matrix/features/feature_type
 dataset    /matrix/features/genome
 dataset    /matrix/features/id
 dataset    /matrix/features/name
 dataset    /matrix/indices
 dataset    /matrix/indptr
 dataset    /matrix/shape
 }
}

arteymix · 2025-09-15T21:36:53Z

It may actually be a missing attribute. Do you have access to one of their files? I think h5dump -A will show you attributes.

rachadele · 2025-09-16T02:08:11Z

the only example I could find is a filtered_peak_bc_matrix.h5 from the 10X ATAC-seq protocol, which is slightly different:

h5dump -n /space/grp/rschwartz/rschwartz/gemma-development/GSM7050281_CTL1__filtered_peak_bc_matrix.h5_
HDF5 "/space/grp/rschwartz/rschwartz/gemma-development/GSM7050281_CTL1_filtered_peak_bc_matrix.h5" {
FILE_CONTENTS {
 group      /
 group      /matrix
 dataset    /matrix/barcodes
 dataset    /matrix/data
 group      /matrix/features
 dataset    /matrix/features/_all_tag_keys
 dataset    /matrix/features/derivation
 dataset    /matrix/features/feature_type
 dataset    /matrix/features/genome
 dataset    /matrix/features/id
 dataset    /matrix/features/name
 dataset    /matrix/indices
 dataset    /matrix/indptr
 dataset    /matrix/shape
 }
}

it's associated with GSE225554. You can look with h5dump -A, but it won't be exactly the same as the gene expression h5 which is what we want.

arteymix · 2025-09-18T16:45:36Z

It looks like Cell Ranger renanalyze tool expects filtered data, so it won't be suitable for this task.

@rachadele has looked around and found https://github.com/MarioniLab/DropletUtils/ which provide the Cell Ranger EmptyDrops-based filter in R, but lacks the OrderMag step that estimates the number of cells. That led her to MarioniLab/DropletUtils#119, but that feature will likely never make it in DropletUtils due to the maintenance burden of keeping up with Cell Ranger.

The author of the PR also mentioned https://github.com/COMBINE-lab/QCatch which has a Python reimplementation of that filter in https://github.com/COMBINE-lab/QCatch/tree/main/src/qcatch/find_retained_cells.

rachadele · 2025-09-18T23:29:35Z

I was able to get ordMag (call_initial_cells) to work, but the emptyDrops step (call_additional_cells) fails. to reproduce, run:

source /space/opt/cellranger/sourceme.bash
python /space/grp/rschwartz/rschwartz/gemma-development/filter-10x-mex.py

the test MEX dir /space/grp/rschwartz/rschwartz/gemma-development/GSM7022367 is hardcoded into the script, along with its genome tag "mm10", feature type, and chemistry.

output:

Found recovered_cells = 850 with loss = 0.0016573829925897502 # this is from ordMag step
Range empty barcodes: 45000 - 90000
Max background UMIs: 0
Failed at attempt to call non-ambient barcodes in GEM well 1

The error stems from here:

https://github.com/10XGenomics/cellranger/blob/6ebad209b8354353b4a9ee3eed1cb248d102af88/lib/python/cellranger/cell_calling_helpers.py#L1118

not sure why emptyDrops is failing. I tried changing the chemistry to SC3Pv2 (see https://github.com/10XGenomics/cellranger/blob/6ebad209b8354353b4a9ee3eed1cb248d102af88/lib/python/cellranger/chemistry_defs.json), but it didn't help.

I tried to populate the required fields with what I think is correct for a single MEX directory (1 gel group, mm10 genome, no probe barcodes).

Also, I had to change mtx_to_matrix_converter.py functions to include the "genome" tag in the CountMatrix class.

rachadele · 2025-09-18T23:43:22Z

in the interim, think we should just apply something like:

data = scanpy.read_10x_mtx(input_file, prefix=prefix)
sc.pp.filter_cells(data, min_genes=500, min_counts=500)

this will require 500 genes expressed and 500 total UMIs per cell. it should get rid of most if not all of the low-quality barcodes in unfiltered MEX data.

rachadele · 2025-09-19T01:24:34Z

okay I got it to work with a real example. see /space/grp/rschwartz/rschwartz/gemma-development/filter-10x-mex.py for filtering barcodes and writing to a .pkl file,

the pkl can then be loaded and used to filter an anndata object with /space/grp/rschwartz/rschwartz/gemma-development/filter-adata.py (in another environment that's less restricive than the cellranger anaconda env)

arteymix · 2025-09-19T03:40:21Z

Is it possible to write back MEX? It would simplify the loading, otherwise I'd need to add a mechanism for applying the AnnData loader.

...e/src/main/resources/ubic/gemma/core/loader/expression/singleCell/transform/requirements.txt

gemma-core/src/main/java/ubic/gemma/model/expression/experiment/ExpressionExperiment.java

.../ubic/gemma/core/loader/expression/singleCell/AbstractMexSingleCellDataLoaderConfigurer.java

gemma-cli/src/main/java/ubic/gemma/cli/completion/CompletionUtils.java

...ain/java/ubic/gemma/core/loader/expression/singleCell/MexSingleCellDataLoaderConfigurer.java

arteymix · 2025-09-19T20:57:34Z

The MatrixMarket format from Cell Ranger include software metadata that can be used to detect MEX from Cell Ranger:

%%MatrixMarket matrix coordinate integer general
%metadata_json: {"software_version": "Cell Ranger cellranger-7.1.0", "format_version": 2}

arteymix · 2025-09-20T04:50:35Z

gemma-cli/src/main/java/ubic/gemma/apps/SingleCellDataTransformCli.java

            return SingleCellDataType.ANNDATA;
        } else {
-            throw new UnsupportedOperationException( "Detecting data type for " + transformation.getClass().getName() + " is not supported." );
+            throw new UnsupportedOperationException( "Detecting data type for output of " + transformation.getClass().getName() + " is not supported." );


This can be added immediately in the patch release.

arteymix · 2025-09-21T05:09:01Z

...core/src/main/java/ubic/gemma/core/loader/expression/geo/singleCell/TenXCellRangerUtils.java

+     * TODO: this will need more work and tests
+     */
+    @Nullable
+    public static String detect10xChemistry( GeoSample sample ) {


@rachadele it would be nice if you could review this section for detecting the chemistry based on product numbers from 10x.

arteymix · 2025-09-21T05:10:54Z

...rc/main/java/ubic/gemma/core/loader/expression/singleCell/MexSingleCellDataLoaderConfig.java

+    /**
+     * Chemistry used for single-cell sequencing.
+     * <p>
+     * This affects the 10x MEX data filter.


Specify here the implication of passing null.

arteymix · 2025-09-21T05:13:11Z

.../src/main/resources/ubic/gemma/core/loader/expression/singleCell/transform/filter-10x-mex.py

+        method=FilterMethod.ORDMAG_NONAMBIENT, 
+        recovered_cells=None,
+        cell_barcodes=None,
+        force_cells=None,


@rachadele according to the type annotations, cell_bracodes and force_cells must be set. Is there anything we can do about it?

arteymix · 2025-09-21T05:15:43Z

.../src/main/resources/ubic/gemma/core/loader/expression/singleCell/transform/filter-10x-mex.py

+
+np.random.seed(42)
+
+GENES_TSV = "genes.tsv"


We write legacy file with features.tsv.gz, but we don't rewrite them to have the Gene Expression-filled column. Thus, this detection is unnecessary.

I'll fix #1269 and rewrite the files to follow the v3 convention.

Add detection for unfiltered 10x data The MTX file contains JSON metadata that can be used to detect if Cell Ranger was used to generate the file. In addition, we can safely assume that the presence of unused barcodes is an indicator of an unfiltered dataset. If that does not apply in real world, we could also use a threshold on the unused fraction.

arteymix force-pushed the feature-10x-filter branch from 707a3aa to 7dde827 Compare September 12, 2025 18:48

arteymix commented Sep 12, 2025

View reviewed changes

gemma-core/src/main/resources/ubic/gemma/core/loader/expression/singleCell/filter-10x-mex.py Outdated Show resolved Hide resolved

arteymix assigned arteymix and rachadele Sep 12, 2025

arteymix added the single cell Issues related to single-cell data support label Sep 12, 2025

arteymix force-pushed the feature-10x-filter branch 4 times, most recently from 00cd4e3 to d7b967a Compare September 12, 2025 21:11

arteymix force-pushed the feature-10x-filter branch from c840a24 to 231d26e Compare September 17, 2025 18:32

arteymix commented Sep 19, 2025

View reviewed changes

...ain/java/ubic/gemma/core/loader/expression/singleCell/MexSingleCellDataLoaderConfigurer.java Outdated Show resolved Hide resolved

arteymix mentioned this pull request Sep 19, 2025

Support 10X Cell Ranger HDF5 format #1097

Open

arteymix commented Sep 20, 2025

View reviewed changes

arteymix force-pushed the feature-10x-filter branch from d70cf69 to a477224 Compare September 20, 2025 05:51

arteymix marked this pull request as ready for review September 21, 2025 05:07

arteymix commented Sep 21, 2025

View reviewed changes

arteymix and others added 5 commits September 22, 2025 09:12

update 10x mex filtering to perform OrdMag and EmptyDrops cell calling

4c929ca

make chemistry optional

60e53d5

Cleanup imports in filter-10x-mex.py script

3a4f35b

Add Rachel to the contributors

ee9b565

arteymix force-pushed the feature-10x-filter branch from ef9a057 to ee9b565 Compare September 22, 2025 16:31

arteymix merged commit 1b77a74 into hotfix-1.32.4 Sep 22, 2025


		np.random.seed(42)

		GENES_TSV = "genes.tsv"

Conversation

arteymix commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

arteymix commented Sep 13, 2025

Uh oh!

rachadele commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arteymix commented Sep 15, 2025

Uh oh!

rachadele commented Sep 15, 2025

Uh oh!

arteymix commented Sep 15, 2025

Uh oh!

rachadele commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arteymix commented Sep 18, 2025

Uh oh!

rachadele commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rachadele commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rachadele commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arteymix commented Sep 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arteymix commented Sep 19, 2025

Uh oh!

arteymix Sep 20, 2025

Choose a reason for hiding this comment

Uh oh!

arteymix Sep 21, 2025

Choose a reason for hiding this comment

Uh oh!

arteymix Sep 21, 2025

Choose a reason for hiding this comment

Uh oh!

arteymix Sep 21, 2025

Choose a reason for hiding this comment

Uh oh!

arteymix Sep 21, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

arteymix commented Sep 12, 2025 •

edited

Loading

rachadele commented Sep 15, 2025 •

edited

Loading

rachadele commented Sep 16, 2025 •

edited

Loading

rachadele commented Sep 18, 2025 •

edited

Loading

rachadele commented Sep 18, 2025 •

edited

Loading

rachadele commented Sep 19, 2025 •

edited

Loading