Skip to content

Wrong interpretation of xml files analysis configuration #147

@kleag

Description

@kleag

Describe the bug
The configuration of xml files analysis allows to set document ids either from tags content (its text) or tag attributes. But if the tag attributes config is absent or does not contain any attribute for the document tag (even if the id really comes from a dedicated tag), then the doc id is wrongly set and it leaks into the enclosing docset tag.

To Reproduce
Let the doc be

<?xml version="1.0" ?>
<DOCSET>
<TEI>
<idno>abcdef</idno>
<p>A text</p>
</TEI>
</DOCSET>

This config should work:

    <group name="identPrpty" class="StandardDocumentPropertyType">
      <param key="storageType" value="string"/>
      <param key="cardinality" value="mandatory"/>
      <list name="elementNames">
        <item value="idno"/>
      </list>
    </group>

but it does not. We have to add the useless attributeNames list as below:

    <group name="identPrpty" class="StandardDocumentPropertyType">
      <param key="storageType" value="string"/>
      <param key="cardinality" value="mandatory"/>
      <list name="elementNames">
        <item value="idno"/>
      </list>
      <list name="attributeNames">
        <item value="TEI id="/>
      </list>
    </group>

With the config that "works", the decoded mult file ends with

     <properties>
        <property name="ContentId" type="int" value="0"/>
        <property name="NodeId" type="int" value="1"/>
        <property name="StructureId" type="int" value="2"/>
        <property name="offBegPrpty" type="int" value="37"/>
        <property name="offEndPrpty" type="int" value="379"/>
        <property name="encodPrpty" type="string" value="UTF8"/>
        <property name="identPrpty" type="string" value="abcdef"/>
        <property name="srcePrpty" type="string" value="…"/>
        <property name="indexDatePrpty" type="date" value="20220926"/>
      </properties>
    </node>
    <properties>
      <property name="ContentId" type="int" value="0"/>
      <property name="NodeId" type="int" value="1"/>
      <property name="StructureId" type="int" value="1"/>
      <property name="offBegPrpty" type="int" value="31"/>
      <property name="offEndPrpty" type="int" value="386"/>
      <property name="encodPrpty" type="string" value="UTF8"/>
      <property name="srcePrpty" type="string" value=…"/>
      <property name="indexDatePrpty" type="date" value="abcdef"/>
    </properties>
  </node>
</MultimediaDocuments>

while with the config that fails, we get:

     <properties>
        <property name="ContentId" type="int" value="0"/>
        <property name="NodeId" type="int" value="1"/>
        <property name="StructureId" type="int" value="2"/>
        <property name="offBegPrpty" type="int" value="37"/>
        <property name="offEndPrpty" type="int" value="72"/>
        <property name="encodPrpty" type="string" value="UTF8"/>
        <property name="identPrpty" type="string" value="abcdef"/>
        <property name="srcePrpty" type="string" value="…"/>
        <property name="indexDatePrpty" type="date" value="20220926"/>
      </properties>
    </node>
    <properties>
      <property name="ContentId" type="int" value="0"/>
      <property name="NodeId" type="int" value="1"/>
      <property name="StructureId" type="int" value="1"/>
      <property name="offBegPrpty" type="int" value="31"/>
      <property name="offEndPrpty" type="int" value="79"/>
      <property name="encodPrpty" type="string" value="UTF8"/>
      <property name="identPrpty" type="string" value="abcdef"/>
      <property name="srcePrpty" type="string" value="…"/>
      <property name="indexDatePrpty" type="date" value="20220926"/>
    </properties>
  </node>
</MultimediaDocuments>

The identPrpty should not be present in the last tag.

@benlabbe you could be interested by this information.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions