-
Notifications
You must be signed in to change notification settings - Fork 0
Merging Hierarchical Data
When merging two databases, we sometimes assume the challenge is syntactic (formatting dates, cleaning strings). However, the true challenge is often semantic alignment. When both databases describe concepts using different hierarchies (taxonomies), simple "find and replace" strategies fail.
In agricultural and biological data, concepts are rarely isolated or simple. They frequently exist in a polyhierarchy (a directed acyclic graph), where a concept like "spring wheat" has multiple parents (e.g., "wheat" and "spring crops").
This guide outlines the set theory logic and the practical construction of a mapping table required to merge such data safely, whether the goal is statistical aggregation (summing values) or classification (assigning labels) to items, e.g. instances of land use.
To understand the complexity of merging these structures, it is helpful to look at a simplified problem. Real-world data is messy; the following model isolates the specific edge cases that cause data loss or double-counting errors.
graph TD
classDef dbS fill:#E27878,stroke:#888,color:#C2C2C2;
classDef dbT fill:#5E9CF2,stroke:#888,color:#C2C2C2;
classDef both fill:#B597C3,stroke:#888,color:#C2C2C2;
classDef none fill:#00000000,stroke:#888,color:#C2C2C2;
cereals(cereals):::dbS
springcrops(spring crops):::dbT
wheat(wheat):::both
rye(rye):::dbT
maize(maize):::dbT
silagemaize(silage maize):::dbS
sweetcorn(sweetcorn):::dbS
springwheat(spring wheat):::dbS
rice(rice):::dbS
soy(soy):::dbT
crops(crops):::none
autumnwheat(autumn wheat):::none
crops --> soy
crops --> springcrops
crops --> cereals
cereals --> wheat
cereals --> rye
cereals --> maize
maize --> silagemaize
maize --> sweetcorn
cereals --> rice
springcrops --> springwheat
wheat --> springwheat
wheat --> autumnwheat
subgraph Legend
direction TB
L0("unused<br>concept"):::none
L1("concept<br>in DB<sub>S</sub>"):::dbS
L2("concept<br>in DB<sub>T</sub>"):::dbT
L3("concept in<br>DB<sub>S</sub> and DB<sub>T</sub>"):::both
end
style Legend fill:none,stroke:#888,stroke-width:1px,stroke-dasharray: 5 5
springwheat ~~~ Legend
sweetcorn ~~~ Legend
In the diagram above, we visualise concepts in two databases:
- the source database (DBS, red).
- the target database (DBT, blue).
Note that these two are only called "source" and "target" because the problem here is framed that way: Our goal is to create a mapping from the concepts in DBS to the ones in DBT.
Note
Directionality matters! With hierarchical data, a mapping from DBS
The usage of a concept in both databases is colored purple. Note how the scheme used here is polyhierarchic, as "spring wheat" has two parents, "spring crops" and "wheat".
To map two databases, we must move beyond string matching and look at the extension of the concepts.
The extension is the set of all real-world objects that fall under a concept definition.
Let
Identity is the anchor point of any merge; the semantic definition is identical in both systems. The sets are equivalent.
In our toy example, this is the case for skos:exactMatch).
Practically, this means we copy the values directly from DBS to DBT.
This is also a desirable relationship for data merging, often called a "roll-up". The source is a proper subset of the target.
For example, skos:broadMatch)
This is the most dangerous relationship. The target is a subset of the source:
For example, skos:narrowMatch).
We must split the value of
This occurs when concepts share a distant ancestor but are mutually exclusive. The intersection is empty
For example,
This occurs when a concept belongs to multiple hierarchies simultaneously.
In our example, SUM(wheat) + SUM(spring crops) because
Here is the table using standard set theory notation summarizing the relations in our toy example.
| Source |
Relation | Target |
Notes | |
|---|---|---|---|---|
| Identity | wheat | wheat | Safe. Direct copy for both aggregation and classification. | |
|
Subsumption (Roll-up) |
sweetcorn, silage maize | maize |
Aggregation: Sum Classification: Valid to label sweetcorn as "maize". |
|
|
Super-ordination (Ambiguity) |
cereals | rye, maize, wheat |
Aggregation: Requires splitting by weights (e.g., Classification: Cannot assign specific label. However, the label is constrained to the set |
|
| Disjointness | rice | — |
Excluded. No valid parent in DBT. |
|
| Polyhierarchy | spring wheat | spring crops, wheat |
Seasonality axis: Valid context for seasonal reports. Biology axis: Valid context for biological reports. Classification: Valid to assign both labels ("spring crop" AND "wheat"). |
In order to not invent a custom language for above's relations, it's best to use the SKOS Simple Knowledge Organization System , the W3C standard for thesauri.
To handle the complexities of directed acyclic graphs and polyhierarchies, the mapping table should be multidimensional, which would include an "axis" column to differentiate which parent we are mapping to, respectively in which context we are mapping a concept to a parent.
In the end, we want to automatically generate a table as the one below.
| Source | SKOS relation | Target | Weight | Axis |
|---|---|---|---|---|
| wheat | skos:exactMatch |
wheat | 1.0 | default |
| sweetcorn | skos:broadMatch |
maize | 1.0 | default |
| silage maize | skos:broadMatch |
maize | 1.0 | default |
| cereals | skos:narrowMatch |
rye | 0.5 | taxonomy |
| cereals | skos:narrowMatch |
wheat | 0.3 | taxonomy |
| cereals | skos:narrowMatch |
maize | 0.2 | taxonomy |
| spring wheat | skos:broadMatch |
spring crops | 1.0 | seasonality |
| spring wheat | skos:broadMatch |
wheat | 1.0 | taxonomy |
| rice | 0.0 | default | ||
| soy | 0.0 | default |
Note that this table doesn't only show us what concepts in DBT a concept in DBS maps (or could map) to, but it also show where there are no targets or no sources that map to a concept.
Finding weights and axes could be quite difficult, but would become necessary for statistical aggregations.
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX : <https://agriculture.ld.admin.ch/crops/>
PREFIX ct: <https://agriculture.ld.admin.ch/crops/cultivationtype/>
PREFIX schema: <http://schema.org/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT DISTINCT
?S ?S_name ?S_id
?relation
?T ?T_name ?T_id
WHERE
{
# 1. DEFINE THE DATE FOR THE VALIDITY CHECK
# We can use NOW() for the current date, or set a fixed date,
# e.g., BIND("2024-01-01"^^xsd:date AS ?reference_date)
BIND(xsd:date(NOW()) AS ?reference_date)
# 2. WRAP THE ORIGINAL QUERY IN A SUBQUERY
# This lets us apply the final filter to the entire result set,
# avoiding repetition.
{
SELECT * WHERE
{
{
# 1. FINDING A MATCH FOR EVERY SOURCE CONCEPT
# In this first block, we start out from the
# source concepts and try to find a matching
# target concept, describing the relation.
{
# 1.1 Find source concepts
?S a :CultivationType ;
schema:name ?S_name ;
:partOf+ ct:14 ;
:hasMembership ?S_membership .
FILTER(LANG(?S_name) = "de")
?S_membership schema:name "AGIS" ;
schema:identifier ?S_id ;
schema:validFrom ?S_validFrom .
OPTIONAL {
?S_membership schema:validTo ?S_validTo .
}
# 1.2 Find target concepts and determine the
# source-to-target relation
OPTIONAL
{
{
# 1.2.1 Identity: The two concepts match
?S :hasMembership ?T_membership .
?T_membership schema:name "NAEBI" ;
schema:identifier ?T_id ;
schema:validFrom ?T_validFrom .
OPTIONAL {
?T_membership schema:validTo ?T_validTo .
}
BIND(skos:exactMatch AS ?relation)
BIND(?S AS ?T)
?T schema:name ?T_name .
FILTER(LANG(?T_name) = "de")
}
UNION
{
# 1.2.2 Subsumption: The source concept is a
# proper subset of the target concept
FILTER NOT EXISTS {
?S :hasMembership / schema:name "NAEBI"
}
?S :partOf+ ?T .
?T schema:name ?T_name ;
:hasMembership ?T_membership .
?T_membership schema:name "NAEBI" ;
schema:identifier ?T_id ;
schema:validFrom ?T_validFrom .
OPTIONAL {
?T_membership schema:validTo ?T_validTo .
}
FILTER(LANG(?T_name)="de")
BIND(skos:broadMatch AS ?relation)
}
UNION
{
# 1.2.3 Super-ordination: The source concept is
# a superset of the target concept
FILTER NOT EXISTS {
?S :hasMembership/schema:name "NAEBI"
}
?T :partOf+ ?S ;
schema:name ?T_name ;
:hasMembership ?T_membership .
FILTER(LANG(?T_name) = "de")
?T_membership schema:name "NAEBI" ;
schema:identifier ?T_id ;
schema:validFrom ?T_validFrom .
OPTIONAL {
?T_membership schema:validTo ?T_validTo .
}
BIND(skos:narrowMatch AS ?relation)
}
}
}
}
UNION
{
# 2. RIGHT-SIDE ORPHANS
# Because there are still some concepts in the
# target database to which no source concept leads,
# we collect them separately and join the previously
# unknown ones to our table.
{
# 2.1 Start with ALL Potential Targets
?T a :CultivationType ;
schema:name ?T_name ;
:partOf+ ct:14 ;
:hasMembership ?T_membership .
?T_membership schema:name "NAEBI" ;
schema:identifier ?T_id ;
schema:validFrom ?T_validFrom .
OPTIONAL {
?T_membership schema:validTo ?T_validTo .
}
FILTER(LANG(?T_name) = "de")
# 2.2 Subtract the "Used" Targets using a Sub-Query
MINUS
{
SELECT DISTINCT ?T
WHERE
{
?S_Inner a :CultivationType ;
:partOf+ ct:14 ;
:hasMembership [ schema:name "AGIS" ] .
{
# 2.2.1 Identity
?S_Inner :hasMembership [ schema:name "NAEBI" ] .
BIND(?S_Inner AS ?T)
}
UNION
{
# 2.2.2 Broad
?S_Inner :partOf+ ?T .
?T :hasMembership [ schema:name "NAEBI" ] .
}
UNION
{
# 2.2.3 Narrow
?T :partOf+ ?S_Inner .
?T :hasMembership [ schema:name "NAEBI" ] .
}
}
}
}
}
}
}
# 3. FILTER THE RESULTS BASED ON THE REFERENCE DATE
# A concept is valid if validFrom <= date AND (validTo is not
# set OR validTo >= date)
# 3.1 Filter for the validity of the source concept ?S (if it exists)
FILTER(
!BOUND(?S) ||
(
?S_validFrom <= ?reference_date &&
(!BOUND(?S_validTo) || ?S_validTo >= ?reference_date)
)
)
# 3.2 Filter for the validity of the target concept ?T (if it exists)
FILTER(
!BOUND(?T) ||
(
?T_validFrom <= ?reference_date &&
(!BOUND(?T_validTo) || ?T_validTo >= ?reference_date)
)
)
}
ORDER BY ?relation ?S ?TRDF master and reference data about crops