Skip to content

Updating GWAS Catalog data ingestion#21

Open
DSuveges wants to merge 12 commits intodevfrom
ds_v2d_updates
Open

Updating GWAS Catalog data ingestion#21
DSuveges wants to merge 12 commits intodevfrom
ds_v2d_updates

Conversation

@DSuveges
Copy link
Contributor

@DSuveges DSuveges commented Jun 28, 2022

This is a prototype of the revisited ingestion of the GWAS Catalog data. At this stage it reads the GWAS Catalog association data, selects and parses relevant columns. Some of the logic from the original v2d pipeline is propagated but not all.

TODOs:

  • Parse GWAS Catalog associations.
  • Apply p-value and other filters
  • Parsing snp identifiers
  • Parsing p-values
  • Join with variant annotation
  • Check allele concordance
  • Deduplicate variant data
  • Clustering etc.

! Warning: as the PR is a prototype, all parameters are hardcoded. The script runs stand-alone on dataproc as a pyspark job as follows:

gcloud dataproc jobs submit pyspark   \
    --cluster=ds-single \
    --project=open-targets-eu-dev \
    --region=europe-west1 ingest_GWAS_Catalog.py 

@DSuveges DSuveges requested a review from bruno-ariano June 28, 2022 08:52

# Processing variant related columns:
# - Sorting out current rsID field:
# - Sorting out current rsID field: <- why do we need this? rs identifiers should always come from the GnomAD dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that once you have chromosome, position, ref and alt you can get the rs id from GnomAD. In that case you can remove the lines 247 and 248?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is complicated... if we join the tables by chromosome and position, there might be multiple variants with different alleles. In those cases it could help the rsID to disambiguate the mapping. This needs to be resolved.

.withColumn('sample_size', F.regexp_extract(F.regexp_replace(F.col('samples'), ',', ''), r'[0-9,]+', 0).cast(T.IntegerType()))

# Extracting number of cases:
.withColumn('n_cases', F.when(F.col('samples').contains('cases'), F.col('sample_size')).otherwise(F.lit(0)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you set this to 0 you might need to update also the lines attached below.

Basically if the n_cases is not null it compute the beta as log(beta) assuming that is odd ratio.

study['case_prop'] = study['n_cases'] / study['n_initial']

'oddsr_ci_upper']] = top_loci.apply(extract_effect_sizes, axis=1).apply(pd.Series)

is_beta = pd.isnull(row['case_prop'])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. So you mean, when the cases are non-zero, but the controls are zero, I should resolve the cases to zero? (I think conceptually that would be correct) However the code, which processes these values should be ready to handle edge cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think when one of them is 0 the other should be too.

After this we should probably also change the "is_beta = pd.isnull(row['case_prop'])" concept ?!

I can think of a way to handle that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants