List view
Schema based query planning and response merging validation. A key component of TranQL is the ability to synthesize a virtual global schema describing the shape of Translator knowledge as a graph of BioLink-Model nodes mapped to KGS endpoints. This network is used for query planning, the process of decomposing a query into constituent sub-queries. Answers received are merged into an overall answer graph which is ultimately returned to the caller. This infrastructure exists as a prototype with preliminary automated testing and this work will add rigorous automated testing required of a production quality component.
Overdue by 6 year(s)•Due by March 2, 2020•0/3 issues closedWe will improve the TranQL language’s usability for building general translational applications and also better position us to integrate tools like the logic of the Broad’s Gene List Sharpener by making lists a first class entity in the language. We will also facilitate queries in which, for example, an initial group of genes is associated with diseases and those diseases are, in turn, associated with a second list of genes. We need to distinguish between the two lists in both the question and the response. To accomplish these goals, we will enhance the syntax of the TranQL query language by allowing 1) the specification of lists of CURIEs in the where clause and 2) the use of uniquely named concepts in the select clause.
Overdue by 6 year(s)•Due by March 2, 2020•2/4 issues closedImplement a unified Spark backend pipeline for querying the knowledge graph, ETL of clinical, environmental, socio-economic data, machine learning model training, and serving based on the FHIR PIT pipeline tool. The unified spark pipeline allows us to provide an end-to-end workflow: 1. From raw clinical, environmental, and socio-economic data, and curated knowledge graph data, 2. Extract features for model training 3. Serve model output via a query The machine learning model will provide a way to transform relational tables into n-ary predicates based on learning a function estimator of the joint distribution of the rows in the table, which can be used to incorporate data from ICEES and other data sources such as COHD and clinical profile. This in turn allows us to encode contextual information about our knowledge, such as cohort definition in a uniform manner. For example, currently ICEES's KGS API and COHD's KGS API depends on the ad hoc "query_options" field for contextual information, such as cohort definition and cohort selection. Even though the query graph and the generated knowledge graph are interoperable between the two services, TranQL had to write service specific code to handle the "query_options" which is not interoperable. With n-ary predicates, we can further generalize the representation of contextual information and handle it in a uniform manner in TranQL. For example, if we want to say for patient with ages A, feature B and C are associated, because age is already a feature, we can simply denote this relation as P(A, B, C). This can be generalized to other clinical data sets for which patient or visit level data is available.
Overdue by 6 year(s)•Due by March 2, 2020•0/3 issues closedTranQL will demonstrate a prototype in-memory query capability developed entirely with open source software operating over Translator BioLink-Model compliant graphs. This design will serve as the basis for a horizontally scalable Translator query architecture enabling it to execute computations with general query graphs without constraining the query to specific identifiers as is currently necessary. Current scalability constraints are the result of cost vs. scale tradeoffs entailed in our dependence on proprietary software. For the prototype we will demonstrate ingesting data generated by the KGX data transformation toolkit. We will run Spark on the many core architecture of the Arrival server at RENCI which is part of our Kubernetes cluster and has 160 logical cores and terabytes of SSD storage. This work entails exporting data sets from existing BioLink-Model compliant databases and investigating approaches for importing those data sets into Spark, evaluating existing Spark based graph query libraries, and creating customized installation procedures for installing Spark with libraries required for a KGX data integration pipeline. This will serve as the basis of a KGS OpenAPI providing query to the consortium. At each step, we will compare Spark’s graph query capabilities, maturity, commitment to openness, and tool ecosystem to RedisGraph and other emerging alternatives to determine the most robust course towards reliable, scalable, open source graph query.
Overdue by 6 year(s)•Due by March 2, 2020•0/6 issues closed