Merging Hierarchical Data

Introduction

When merging two databases, we sometimes assume the challenge is syntactic (formatting dates, cleaning strings). However, the true challenge is often semantic alignment. When both databases describe concepts using different hierarchies (taxonomies), simple "find and replace" strategies fail.

In agricultural and biological data, concepts are rarely isolated or simple. They frequently exist in a polyhierarchy (a directed acyclic graph), where a concept like "spring wheat" has multiple parents (e.g., "wheat" and "spring crops").

This guide outlines the set theory logic and the practical construction of a mapping table required to merge such data safely, whether the goal is statistical aggregation (summing values) or classification (assigning labels) to items, e.g. instances of land use.

A toy polyhierarchy

To understand the complexity of merging these structures, it is helpful to look at a simplified problem. Real-world data is messy; the following model isolates the specific edge cases that cause data loss or double-counting errors.

graph TD

    classDef dbS fill:#E27878,stroke:#888,color:#C2C2C2;
    classDef dbT fill:#5E9CF2,stroke:#888,color:#C2C2C2;
    classDef both fill:#B597C3,stroke:#888,color:#C2C2C2;
    classDef none fill:#00000000,stroke:#888,color:#C2C2C2;

    cereals(cereals):::dbS
    springcrops(spring crops):::dbT
    wheat(wheat):::both
    rye(rye):::dbT
    maize(maize):::dbT
    silagemaize(silage maize):::dbS
    sweetcorn(sweetcorn):::dbS
    springwheat(spring wheat):::dbS
    rice(rice):::dbS
    soy(soy):::dbT
    crops(crops):::none
    autumnwheat(autumn wheat):::none

    crops --> soy
    crops --> springcrops
    crops --> cereals
    cereals --> wheat
    cereals --> rye
    cereals --> maize
    maize --> silagemaize
    maize --> sweetcorn
    cereals --> rice
    springcrops --> springwheat
    wheat --> springwheat
    wheat --> autumnwheat

    subgraph Legend
        direction TB
        L0("unused<br>concept"):::none
        L1("concept<br>in DB<sub>S</sub>"):::dbS
        L2("concept<br>in DB<sub>T</sub>"):::dbT
        L3("concept in<br>DB<sub>S</sub> and DB<sub>T</sub>"):::both
    end

    style Legend fill:none,stroke:#888,stroke-width:1px,stroke-dasharray: 5 5
    springwheat ~~~ Legend
    sweetcorn ~~~ Legend

In the diagram above, we visualise concepts in two databases:

the source database (DB_S, red).
the target database (DB_T, blue).

Note that these two are only called "source" and "target" because the problem here is framed that way: Our goal is to create a mapping from the concepts in DB_S to the ones in DB_T.

Note

Directionality matters! With hierarchical data, a mapping from DB_S $\to$ DB_T is not the mathematical inverse of DB_T $\to$ DB_S. A "roll-up" in one direction becomes a "split" in the other.

The usage of a concept in both databases is colored purple. Note how the scheme used here is polyhierarchic, as "spring wheat" has two parents, "spring crops" and "wheat".

Theoretical framework

To map two databases, we must move beyond string matching and look at the extension of the concepts. The extension is the set of all real-world objects that fall under a concept definition. Let $S$ be a concept in our source database (DB_S) and let $T$ be a concept in our target database (DB_T). Here, we are evaluating the set-theoretical relationship between the data points contained in $S$ and $T$.

Identity

Identity is the anchor point of any merge; the semantic definition is identical in both systems. The sets are equivalent.

$$ S = T $$

In our toy example, this is the case for $S = T =$ "wheat", which is also implied by the purple coloring. Since both databases define the crop exactly the same way, we can make a direct translation (skos:exactMatch). Practically, this means we copy the values directly from DB_S to DB_T.

Subsumption

This is also a desirable relationship for data merging, often called a "roll-up". The source is a proper subset of the target.

$$ S \subset T $$

For example, $S=$ "silage maize" (DB_S), $T=$ "maize" (DB_T). This means that every instance of silage maize is necessarily maize, but not vice versa. We can safely aggregate (skos:broadMatch) $S$ into $T$. Doing this, we lose granularity (specificity), but maintain data accuracy.

Super-ordination

This is the most dangerous relationship. The target is a subset of the source:

$$ T \subset S $$

For example, $S=$ "cereals" (DB_S), $T=$ "rye" (DB_T). The source ($S$) contains rye, but it also contains wheat, rice, and maize. Rice doesn't exist in DB_T, but we're still left with the possibility of wheat, rye or maize. We cannot map $S \to T$ without inventing false data. Hence, we need probabilistic disaggregation (skos:narrowMatch). We must split the value of $S$ across multiple targets ($T_1, T_2...$). Either we use a heuristic weight (e.g., "rye represents 15% of cereals in this region"), or in the case of labelling items, we can at least know that the label must be constraint to a smaller subset.

Disjointness

This occurs when concepts share a distant ancestor but are mutually exclusive. The intersection is empty

$$ S \cap T = \emptyset $$

For example, $S=$ "rice" (DB_S) and $T=$ "rye" (DB_T) are completely disjoint, meaning that if an element is in $S$, it is guaranteed to not be in $T$. Even though both are cereals, no rice is oats. In our example, we'd have no possible target for "rice".

Polyhierarchy

This occurs when a concept belongs to multiple hierarchies simultaneously.

$$ S \subset (T_1 \cap T_2) $$

In our example, $S=$ "spring wheat", $T_1=$ "wheat" (biological taxonomy), and $T_2=$ "spring crops" (seasonality). To deal with this, we need faceted mapping. We cannot use a simple lookup. We must introduce an axis column to represent context information. If we were to sum $S$ into both $T_1$ and $T_2$, we create accurate subtotals for specific reports. We cannot calculate SUM(wheat) + SUM(spring crops) because $S$ would be double-counted. However, for the purpose of labellig/multi-class classification, we can safely assign both $T_1$ and $T_2$.

Summary table

Here is the table using standard set theory notation summarizing the relations in our toy example.

	Source $S$	Relation	Target $T$	Notes
Identity	wheat	$=$	wheat	Safe. Direct copy for both aggregation and classification.
Subsumption (Roll-up)	sweetcorn, silage maize	$\subset$	maize	Aggregation: Sum $S$ into $T$. Loss of specificity. Classification: Valid to label sweetcorn as "maize".
Super-ordination (Ambiguity)	cereals	$\supset$	rye, maize, wheat	Aggregation: Requires splitting by weights (e.g., $0.5, 0.3, 0.2$). Classification: Cannot assign specific label. However, the label is constrained to the set ${\texttt{rye}, \texttt{maize}, \texttt{wheat}}$.
Disjointness	rice	$\cap = \emptyset$	—	Excluded. No valid parent in DB_T. $S \cap T = \emptyset$. Data is dropped or moved to some generic "other".
Polyhierarchy	spring wheat	$\subset$	spring crops, wheat	Seasonality axis: Valid context for seasonal reports. Biology axis: Valid context for biological reports. Classification: Valid to assign both labels ("spring crop" AND "wheat").

The practical implementation

In order to not invent a custom language for above's relations, it's best to use the SKOS Simple Knowledge Organization System , the W3C standard for thesauri.

To handle the complexities of directed acyclic graphs and polyhierarchies, the mapping table should be multidimensional, which would include an "axis" column to differentiate which parent we are mapping to, respectively in which context we are mapping a concept to a parent.

In the end, we want to automatically generate a table as the one below.

Source	SKOS relation	Target	Weight	Axis
wheat	`skos:exactMatch`	wheat	1.0	default
sweetcorn	`skos:broadMatch`	maize	1.0	default
silage maize	`skos:broadMatch`	maize	1.0	default
cereals	`skos:narrowMatch`	rye	0.5	taxonomy
cereals	`skos:narrowMatch`	wheat	0.3	taxonomy
cereals	`skos:narrowMatch`	maize	0.2	taxonomy
spring wheat	`skos:broadMatch`	spring crops	1.0	seasonality
spring wheat	`skos:broadMatch`	wheat	1.0	taxonomy
rice			0.0	default
		soy	0.0	default

Note that this table doesn't only show us what concepts in DB_T a concept in DB_S maps (or could map) to, but it also show where there are no targets or no sources that map to a concept.

Finding weights and axes could be quite difficult, but would become necessary for statistical aggregations.

Reproducing these steps in an executable query

Link to the query execution.

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX : <https://agriculture.ld.admin.ch/crops/>
PREFIX ct: <https://agriculture.ld.admin.ch/crops/cultivationtype/>
PREFIX schema: <http://schema.org/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT DISTINCT
   ?S ?S_name ?S_id
   ?relation
   ?T ?T_name ?T_id

WHERE
{
   # 1. DEFINE THE DATE FOR THE VALIDITY CHECK
   #    We can use NOW() for the current date, or set a fixed date, 
   #    e.g., BIND("2024-01-01"^^xsd:date AS ?reference_date)
   BIND(xsd:date(NOW()) AS ?reference_date)

   # 2. WRAP THE ORIGINAL QUERY IN A SUBQUERY
   #    This lets us apply the final filter to the entire result set,
   #    avoiding repetition.
   {
      SELECT * WHERE
      {
         {
            # 1. FINDING A MATCH FOR EVERY SOURCE CONCEPT
            #    In this first block, we start out from the
            #    source concepts and try to find a matching
            #    target concept, describing the relation.
            {
               # 1.1 Find source concepts
            ?S a :CultivationType ;
               schema:name ?S_name ;
               :partOf+ ct:14 ;
                  :hasMembership ?S_membership .
               FILTER(LANG(?S_name) = "de")
               ?S_membership schema:name "AGIS" ;
                  schema:identifier ?S_id ;
                  schema:validFrom ?S_validFrom .
               OPTIONAL {
                  ?S_membership schema:validTo ?S_validTo .
               }

               # 1.2 Find target concepts and determine the
               #     source-to-target relation
               OPTIONAL
               {
                  {
                     # 1.2.1 Identity: The two concepts match
                     ?S :hasMembership ?T_membership .
                     ?T_membership schema:name "NAEBI" ;
                        schema:identifier ?T_id ;
                        schema:validFrom ?T_validFrom .
                     OPTIONAL {
                        ?T_membership schema:validTo ?T_validTo .
                     }
                     BIND(skos:exactMatch AS ?relation)
                     BIND(?S AS ?T)
                     ?T schema:name ?T_name .
                     FILTER(LANG(?T_name) = "de")
                  }

                  UNION
                  
                  {
                     # 1.2.2 Subsumption: The source concept is a
                     #       proper subset of the target concept
                     FILTER NOT EXISTS { 
                        ?S :hasMembership / schema:name "NAEBI"
                     }
                     ?S :partOf+ ?T .
                     ?T schema:name ?T_name ;
                        :hasMembership ?T_membership .
                     ?T_membership schema:name "NAEBI" ;
                        schema:identifier ?T_id ;
                        schema:validFrom ?T_validFrom .
                     OPTIONAL {
                        ?T_membership schema:validTo ?T_validTo .
                     }
                     FILTER(LANG(?T_name)="de")
                     BIND(skos:broadMatch AS ?relation)
                  }

                  UNION
                  
                  {
                     # 1.2.3 Super-ordination: The source concept is
                     #       a superset of the target concept
                     FILTER NOT EXISTS {
                        ?S :hasMembership/schema:name "NAEBI"
                     }
                     ?T :partOf+ ?S ;
                        schema:name ?T_name ;
                        :hasMembership ?T_membership .
                     FILTER(LANG(?T_name) = "de")
                     ?T_membership schema:name "NAEBI" ;
                        schema:identifier ?T_id ;
                        schema:validFrom ?T_validFrom .
                     OPTIONAL {
                        ?T_membership schema:validTo ?T_validTo .
                     }
                     BIND(skos:narrowMatch AS ?relation)
                  }
               }
            }
         }
         UNION
         {
            # 2. RIGHT-SIDE ORPHANS
            #    Because there are still some concepts in the
            #    target database to which no source concept leads,
            #    we collect them separately and join the previously
            #    unknown ones to our table.
            {
               # 2.1 Start with ALL Potential Targets
               ?T a :CultivationType ;
                  schema:name ?T_name ;
                  :partOf+ ct:14 ;
                  :hasMembership ?T_membership .
               ?T_membership schema:name "NAEBI" ;
                  schema:identifier ?T_id ;
                  schema:validFrom ?T_validFrom .
               OPTIONAL {
                  ?T_membership schema:validTo ?T_validTo .
               }
               FILTER(LANG(?T_name) = "de")

               # 2.2 Subtract the "Used" Targets using a Sub-Query
               MINUS
               {
                  SELECT DISTINCT ?T 
                  WHERE
                  {
                     ?S_Inner a :CultivationType ;
                        :partOf+ ct:14 ;
                        :hasMembership [ schema:name "AGIS" ] .

                     {
                        # 2.2.1 Identity
                        ?S_Inner :hasMembership [ schema:name "NAEBI" ] .
                        BIND(?S_Inner AS ?T)
                     }
                     UNION
                     {
                        # 2.2.2 Broad
                        ?S_Inner :partOf+ ?T .
                        ?T :hasMembership [ schema:name "NAEBI" ] .
                     }
                     UNION
                     {
                        # 2.2.3 Narrow
                        ?T :partOf+ ?S_Inner .
                        ?T :hasMembership [ schema:name "NAEBI" ] .
                     }
                  }
               }
            }
         }
      }
   }

   # 3. FILTER THE RESULTS BASED ON THE REFERENCE DATE
   #    A concept is valid if validFrom <= date AND (validTo is not
   #    set OR validTo >= date)

   # 3.1 Filter for the validity of the source concept ?S (if it exists)
   FILTER(
      !BOUND(?S) || 
      (
         ?S_validFrom <= ?reference_date &&
         (!BOUND(?S_validTo) || ?S_validTo >= ?reference_date)
      )
   )

   # 3.2 Filter for the validity of the target concept ?T (if it exists)
   FILTER(
      !BOUND(?T) ||
      (
         ?T_validFrom <= ?reference_date &&
         (!BOUND(?T_validTo) || ?T_validTo >= ?reference_date)
      )
   )
}

ORDER BY ?relation ?S ?T

RDF master and reference data about crops

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merging Hierarchical Data

Introduction

A toy polyhierarchy

Theoretical framework

Identity

Subsumption

Super-ordination

Disjointness

Polyhierarchy

Summary table

The practical implementation

Reproducing these steps in an executable query

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally