Skip to content

Add a filter for unfiltered 10x MEX data#1511

Merged
arteymix merged 5 commits intohotfix-1.32.4from
feature-10x-filter
Sep 22, 2025
Merged

Add a filter for unfiltered 10x MEX data#1511
arteymix merged 5 commits intohotfix-1.32.4from
feature-10x-filter

Conversation

@arteymix
Copy link
Copy Markdown
Member

@arteymix arteymix commented Sep 12, 2025

  • detect unfiltered 10x datasets
  • add CLI flags to perform or not the filtering with auto detection by default
  • add a transformation script for filtering MEX data

@arteymix arteymix added the single cell Issues related to single-cell data support label Sep 12, 2025
@arteymix arteymix force-pushed the feature-10x-filter branch 4 times, most recently from 00cd4e3 to d7b967a Compare September 12, 2025 21:11
@arteymix
Copy link
Copy Markdown
Member Author

The transformations should be prototype beans so that we do not have to inject all the variables manually in the CLI.

@rachadele
Copy link
Copy Markdown
Contributor

rachadele commented Sep 15, 2025

python filter-10x-mex.py /space/grp/rschwartz/rschwartz/cell_annotation_cortex.nf/results/mus_musculus_subsample_ref_500_2025-09-15_11-28-12/mex/GSE225554/1011636_GSM7050276 GSE225554/1011636_GSM7050276

produces the following error:

[error] Filtered matrices file was generated with an older version of cellranger that is incompatible with reanalyze. Please run cellranger count again to generate a new matrix.

I'm not sure if there's a way around this. I tried modifying the h5 file according to this issue, but it didn't fix the error.

In the meantime, we can add:

sc.pp.fiter_cells(min_counts=X, min_genes=Y)

I know there is some debate bout how many genes/counts must be detected in cells to keep them. Let me know what you think.

Alternatively, we can run emptyDrops on the count matrices.

@arteymix
Copy link
Copy Markdown
Member Author

I don't see why we wouldn't be able to produce a Cell Ranger-compatible H5 file from a MEX output. I can investigate that tomorrow.

I can add another filter for running a simple sc.pp.filter_cells. That should be dealt with in a separate issue though.

@rachadele
Copy link
Copy Markdown
Contributor

the h5 I generated matches the format matches the documented Cellranger format

HDF5 "GSE225554/1011636_GSM7050276/raw_feature_bc_matrix.h5" {
FILE_CONTENTS {
 group      /
 group      /matrix
 dataset    /matrix/barcodes
 dataset    /matrix/data
 group      /matrix/features
 dataset    /matrix/features/feature_type
 dataset    /matrix/features/genome
 dataset    /matrix/features/id
 dataset    /matrix/features/name
 dataset    /matrix/indices
 dataset    /matrix/indptr
 dataset    /matrix/shape
 }
}

@arteymix
Copy link
Copy Markdown
Member Author

It may actually be a missing attribute. Do you have access to one of their files? I think h5dump -A will show you attributes.

@rachadele
Copy link
Copy Markdown
Contributor

rachadele commented Sep 16, 2025

the only example I could find is a filtered_peak_bc_matrix.h5 from the 10X ATAC-seq protocol, which is slightly different:

h5dump -n /space/grp/rschwartz/rschwartz/gemma-development/GSM7050281_CTL1__filtered_peak_bc_matrix.h5_
HDF5 "/space/grp/rschwartz/rschwartz/gemma-development/GSM7050281_CTL1_filtered_peak_bc_matrix.h5" {
FILE_CONTENTS {
 group      /
 group      /matrix
 dataset    /matrix/barcodes
 dataset    /matrix/data
 group      /matrix/features
 dataset    /matrix/features/_all_tag_keys
 dataset    /matrix/features/derivation
 dataset    /matrix/features/feature_type
 dataset    /matrix/features/genome
 dataset    /matrix/features/id
 dataset    /matrix/features/name
 dataset    /matrix/indices
 dataset    /matrix/indptr
 dataset    /matrix/shape
 }
}

it's associated with GSE225554. You can look with h5dump -A, but it won't be exactly the same as the gene expression h5 which is what we want.

@arteymix
Copy link
Copy Markdown
Member Author

It looks like Cell Ranger renanalyze tool expects filtered data, so it won't be suitable for this task.

@rachadele has looked around and found https://github.com/MarioniLab/DropletUtils/ which provide the Cell Ranger EmptyDrops-based filter in R, but lacks the OrderMag step that estimates the number of cells. That led her to MarioniLab/DropletUtils#119, but that feature will likely never make it in DropletUtils due to the maintenance burden of keeping up with Cell Ranger.

The author of the PR also mentioned https://github.com/COMBINE-lab/QCatch which has a Python reimplementation of that filter in https://github.com/COMBINE-lab/QCatch/tree/main/src/qcatch/find_retained_cells.

@rachadele
Copy link
Copy Markdown
Contributor

rachadele commented Sep 18, 2025

I was able to get ordMag (call_initial_cells) to work, but the emptyDrops step (call_additional_cells) fails. to reproduce, run:

source /space/opt/cellranger/sourceme.bash
python /space/grp/rschwartz/rschwartz/gemma-development/filter-10x-mex.py

the test MEX dir /space/grp/rschwartz/rschwartz/gemma-development/GSM7022367 is hardcoded into the script, along with its genome tag "mm10", feature type, and chemistry.

output:

Found recovered_cells = 850 with loss = 0.0016573829925897502 # this is from ordMag step
Range empty barcodes: 45000 - 90000
Max background UMIs: 0
Failed at attempt to call non-ambient barcodes in GEM well 1

The error stems from here:

https://github.com/10XGenomics/cellranger/blob/6ebad209b8354353b4a9ee3eed1cb248d102af88/lib/python/cellranger/cell_calling_helpers.py#L1118

not sure why emptyDrops is failing. I tried changing the chemistry to SC3Pv2 (see https://github.com/10XGenomics/cellranger/blob/6ebad209b8354353b4a9ee3eed1cb248d102af88/lib/python/cellranger/chemistry_defs.json), but it didn't help.

I tried to populate the required fields with what I think is correct for a single MEX directory (1 gel group, mm10 genome, no probe barcodes).

Also, I had to change mtx_to_matrix_converter.py functions to include the "genome" tag in the CountMatrix class.

@rachadele
Copy link
Copy Markdown
Contributor

rachadele commented Sep 18, 2025

in the interim, think we should just apply something like:

data = scanpy.read_10x_mtx(input_file, prefix=prefix)
sc.pp.filter_cells(data, min_genes=500, min_counts=500)

this will require 500 genes expressed and 500 total UMIs per cell. it should get rid of most if not all of the low-quality barcodes in unfiltered MEX data.

@rachadele
Copy link
Copy Markdown
Contributor

rachadele commented Sep 19, 2025

okay I got it to work with a real example. see /space/grp/rschwartz/rschwartz/gemma-development/filter-10x-mex.py for filtering barcodes and writing to a .pkl file,

the pkl can then be loaded and used to filter an anndata object with /space/grp/rschwartz/rschwartz/gemma-development/filter-adata.py (in another environment that's less restricive than the cellranger anaconda env)

@arteymix
Copy link
Copy Markdown
Member Author

Is it possible to write back MEX? It would simplify the loading, otherwise I'd need to add a mechanism for applying the AnnData loader.

@arteymix
Copy link
Copy Markdown
Member Author

The MatrixMarket format from Cell Ranger include software metadata that can be used to detect MEX from Cell Ranger:

%%MatrixMarket matrix coordinate integer general
%metadata_json: {"software_version": "Cell Ranger cellranger-7.1.0", "format_version": 2}

return SingleCellDataType.ANNDATA;
} else {
throw new UnsupportedOperationException( "Detecting data type for " + transformation.getClass().getName() + " is not supported." );
throw new UnsupportedOperationException( "Detecting data type for output of " + transformation.getClass().getName() + " is not supported." );
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be added immediately in the patch release.

@arteymix arteymix marked this pull request as ready for review September 21, 2025 05:07
* TODO: this will need more work and tests
*/
@Nullable
public static String detect10xChemistry( GeoSample sample ) {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rachadele it would be nice if you could review this section for detecting the chemistry based on product numbers from 10x.

/**
* Chemistry used for single-cell sequencing.
* <p>
* This affects the 10x MEX data filter.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specify here the implication of passing null.

method=FilterMethod.ORDMAG_NONAMBIENT,
recovered_cells=None,
cell_barcodes=None,
force_cells=None,
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rachadele according to the type annotations, cell_bracodes and force_cells must be set. Is there anything we can do about it?


np.random.seed(42)

GENES_TSV = "genes.tsv"
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We write legacy file with features.tsv.gz, but we don't rewrite them to have the Gene Expression-filled column. Thus, this detection is unnecessary.

I'll fix #1269 and rewrite the files to follow the v3 convention.

arteymix and others added 5 commits September 22, 2025 09:12
Add detection for unfiltered 10x data

The MTX file contains JSON metadata that can be used to detect if Cell
Ranger was used to generate the file.

In addition, we can safely assume that the presence of unused barcodes
is an indicator of an unfiltered dataset. If that does not apply in
real world, we could also use a threshold on the unused fraction.
@arteymix arteymix merged commit 1b77a74 into hotfix-1.32.4 Sep 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

single cell Issues related to single-cell data support

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants