[DRAFT] Pugh lab main [just to compare] by curlup · Pull Request #17 · dfci/matchengine-V2

curlup · 2023-06-13T14:32:06Z

No description provided.

…l status

…ensitive

…insensitive match

…egex

curlup · 2023-06-13T14:59:48Z

matchengine/internals/engine.py

        else:
            return {clinical_id: clinical_data['SAMPLE_ID'] for clinical_id, clinical_data in
-                    self._clinical_data.items() if clinical_data['VITAL_STATUS'] == 'alive'}
+                    self._clinical_data.items() if (clinical_data['VITAL_STATUS'] is not None and clinical_data['VITAL_STATUS'].lower() == 'alive')}


good catch, we should have that.

The new ME version moves this logic around a bit. Now it ignores users with VITAL_STATUS "deceased," rather than including only users with VITAL_STATUS "alive" (i.e. everyone is alive by default).

curlup · 2023-06-13T16:11:53Z

matchengine/internals/utilities/query.py

+                    # recompile the query to be case insensitive
+                    # convert the $in into a list of $or conditions so we can use $regex inside a $in
+                    # mongo has a limitation that cannot use $regex within a $in
+                    # using regex
+                    if "ONCOTREE_PRIMARY_DIAGNOSIS_NAME" in query_part.query:
+                        if "$in" in query_part.query['ONCOTREE_PRIMARY_DIAGNOSIS_NAME']:
+                            new_conditions = [
+                                {'ONCOTREE_PRIMARY_DIAGNOSIS_NAME': {'$regex': f'^{old_query}$', '$options': 'i'}} for
+                                old_query in query_part.query['ONCOTREE_PRIMARY_DIAGNOSIS_NAME']['$in']]
+                            del query_part.query['ONCOTREE_PRIMARY_DIAGNOSIS_NAME']  # Remove old query from query_part
+                            query_part.query['$or'] = new_conditions  # Add new conditions to query_part
+                        else:
+                            org_query = query_part.query['ONCOTREE_PRIMARY_DIAGNOSIS_NAME'];
+                            ignore_case_query = {'$regex': f'^{org_query}$', '$options': 'i'}
+                            query_part.query['ONCOTREE_PRIMARY_DIAGNOSIS_NAME'] = ignore_case_query
+
+                    # Exclude documents where 'ONCOTREE_PRIMARY_DIAGNOSIS_NAME' is 'NA'
+                    new_query = {
+                        '$and': [
+                            {join_field: {'$in': list(need_new)}},
+                            query_part.query,
+                            {'ONCOTREE_PRIMARY_DIAGNOSIS_NAME': {'$ne': 'NA'}}
+                        ]
+                    }


@jasonhansel we should discuss

curlup · 2023-06-13T16:16:13Z

matchengine/plugins/DFCITrialMatchDocumentCreator.py


    # add mutation
-    if true_protein is not None:
+    if true_protein is not None and true_protein:


Good catch. @jasonhansel do we already have that fixed or 😬

We should be able to incorporate this change without issues.

curlup · 2023-06-13T16:16:41Z

matchengine/ref/oncotree_mapping.json

    "Renal Angiomyolipoma",
-    "Large Cell Neuroendocrine Carcinoma"
+    "Large Cell Neuroendocrine Carcinoma",
+    "Breast Invasive Carcinoma, NOS"


duplicate with the one above?

jasonhansel · 2023-07-17T15:36:18Z

matchengine/ref/oncotree_mapping.json

    "Invasive Breast Carcinoma",
    "Phyllodes Tumor of the Breast",
    "Breast Invasive Carcinosarcoma, NOS",
+    "Breast Invasive Carcinoma, NOS",


This file is generated programmatically from the OncoTree data. We should check with PMATCH to see why they want something other than what OncoTree provides; the config JSON file allows you to specify a separate path within this folder to use for this mapping.

jasonhansel · 2023-07-17T15:36:34Z

matchengine/plugins/DFCITrialMatchDocumentCreator.py

        new_trial_match.update({'cancer_type_match': get_cancer_type_match(trial_match)})
+        # Add in additional fields we need for frontend
+        if ('arm_description' in trial_match.match_clause_data.match_clause_additional_attributes):
+            new_trial_match.update({'arm_description': trial_match.match_clause_data.match_clause_additional_attributes['arm_description']})


We should be able to incorporate this change without issues.

jasonhansel · 2023-07-17T15:37:11Z

matchengine/plugins/DFCIQueryTransformers.py

+        elif trial_value.upper() == 'FALSE':
+            return QueryTransformerResult({sample_key: 'Negative'}, False)
+        else:
+            return QueryTransformerResult({sample_key: trial_value}, False)


We should be able to incorporate this change without issues.

jasonhansel · 2023-07-17T15:37:19Z

matchengine/main.py

        if run_args.csv_output:
-            me.create_output_csv()
+            from matchengine.internals.utilities.output import create_output_csv
+            create_output_csv(me)


The latest ME version fixes this.

jasonhansel · 2023-07-17T15:38:03Z

matchengine/main.py

    subp_p.add_argument('-t', dest='trial', default=None, help=param_trials_help)
    subp_p.add_argument('-c', dest='clinical', default=None, help=param_clinical_help)
-    subp_p.add_argument('-g', dest='extended_attributes', default=None, help=param_genomic_help)
+    subp_p.add_argument('-g', dest='genomic', default=None, help=param_genomic_help)


While we can fix this, we should be discouraging using this "loading" functionality for anything other than trials, in favor of having users load data into MongoDB directly using e.g. an ETL process of some sort.

jasonhansel · 2023-07-17T15:38:18Z

matchengine/main.py

    subp_p.add_argument('-t', dest='trial', default=None, help=param_trials_help)
    subp_p.add_argument('-c', dest='clinical', default=None, help=param_clinical_help)
-    subp_p.add_argument('-g', dest='extended_attributes', default=None, help=param_genomic_help)
+    subp_p.add_argument('-g', dest='genomic', default=None, help=param_genomic_help)


While we can fix this, we should be discouraging using this "loading" functionality for anything other than trials, in favor of having users load data into MongoDB directly using e.g. an ETL process of some sort.

jasonhansel · 2023-07-17T15:39:01Z

matchengine/internals/utilities/utilities.py

+    # if isinstance(identifier, ObjectId) or identifier is None:
+    #     pass
+    # else:
+    #     sort_array.append(int(identifier.replace("-", "")))


The new version of ME will move this into TrialMatchDocumentCreator where it can be modified more easily.

jasonhansel · 2023-07-17T15:39:26Z

matchengine/internals/utilities/query.py

+                    if matchengine.report_all_clinical_reasons or \
+                            keys.issubset(matchengine.match_criteria_transform.valid_clinical_reasons):
+                        should_add_reason = True
+                if should_add_reason:


The new version of ME moves this code into TrialMatchDocumentCreator where it can be modified more easily (though it seems like the change here may just be a refactor).

jasonhansel · 2023-07-17T15:40:51Z

matchengine/internals/utilities/query.py


                if need_new:
-                    new_query = {'$and': [{join_field: {'$in': list(need_new)}}, query_part.query]}
+                    # recompile the query to be case insensitive


This is something we don't want to incorporate. There may be ways of doing this through the query transformers, but it will never be performant because of the cost of regex lookups. Ideally this could be fixed at the data ingestion layer by (say) lowercasing all inputs and then searching for lowercase cancer types or by making cancer types match oncotree.

jasonhansel · 2023-07-17T15:41:28Z

matchengine/internals/load.py

        else:
            raw_file_data = file_handle.read()
-            if filetype == 'yaml':
+            if filetype == 'yml':


This is just a bugfix that we can incorporate.

jasonhansel · 2023-07-17T15:42:17Z

matchengine/config/dfci_config.json

  "match_trial_link_id": "protocol_no",
  "trial_status_key": {
-    "key_name": null,
+    "key_name": "summary",


Most of the changes to this file can be incorporated without issues. The one exception is the "trial_status_key," which determines how we decide if trials are open or closed; that may be something we need to keep separate for PMATCH.

add value match for wildtype when recording result

CTM-289: fix structural variation matching without report date

mickey-ng added 11 commits October 18, 2021 17:20

fix loading genomic data flag, yml format, and matching clinical vita…

1007557

…l status

Add extra fields to match and fields to resulting match document

64bebc5

Add extra solid tumor mapping, make oncotree diagnosis query case ins…

d121066

…ensitive

add mutation effect to match document

763737f

add molecular function mapping, fix bug with oncotree diagnosis case …

718627d

…insensitive match

all true false mapping, and query to get all genomic results

52f57c3

fix valid clinical reason subset bug without using pop

9e62654

fix case insensitive oncotree diagnosis match with regex to strict match

4994641

rewrite case insensitive match to use or instead of in to pair with r…

ec98770

…egex

make oncotree primary diagnosis query ignore NA in data

fa8c8e8

CTM-217 remove unknown from inactivating mapping

a376538

curlup commented Jun 13, 2023

View reviewed changes

artonio added 4 commits July 4, 2023 13:17

dockerize

526e169

add functions to load trial from var

2d5d5af

add functions to load clinical and genomic via api

a881e01

added map_clinical_to_genomic

fd14bea

jasonhansel reviewed Jul 17, 2023

View reviewed changes

mickey-ng and others added 14 commits July 20, 2023 17:53

add value match for wildtype when recording result

e6e7611

Merge pull request #1 from pughlab/genomic_match_all

ece0b8e

add value match for wildtype when recording result

fix to pass study id from clinical to results

042df3a

CTM-289: fix structural variation matching without report date

cb3f772

Merge pull request #2 from pughlab/genomic_match_all

a19aae9

CTM-289: fix structural variation matching without report date

CTM-293 add short title to result db

b616ade

CTM-303 update oncotree definition

99f5f19

test submodule commit

11a84d3

test upload

887cded

CTM-284 age expression query

a2141f6

test update

510d1b3

fix age expression to use correct field in clinical data

7ded6f2

fix age loading as string instead of int

860240a

fix age matching query transform

71ea8dd

Conversation

curlup commented Jun 13, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants