Skip to content

Suggestion: Implement html_to_docbook() function in EML package #353

@sannegovaert

Description

@sannegovaert

Overview

We have developed the html_to_docbook() function in the R package movepub, which converts HTML-formatted text to DocBook markup. We propose to implement this functionality directly within the EML package.

Rationale

Movepub streamlines the publication of animal tracking data from Movebank to the Global Biodiversity Information Facility (GBIF). When converting Movebank metadata to EML with movepub::write_eml(), we aim to support rich text formatting (e.g., bold, hyperlinks) in dataset descriptions.

The only consistent way to provide rich text that:

  • passes EML validation,
  • is accepted by the Integrated Publishing Toolkit (IPT), and
  • displays correctly on GBIF.org

is to follow the EML specification for <para> elements, using a subset of DocBook syntax.

While EML::set_TextType() addresses this partially, it is difficult to use and only works for external files—not for inline text strings.

See related discussion and evaluation of alternatives in movepub issue #101.

Proposed Solution

After reviewing several options, we implemented a custom converter for movepub that transforms HTML syntax into DocBook, splitting paragraphs and headers into separate elements.

Benefits of integrating this into EML:

  • It is a better fit for EML than for movepub, because it is specifically desgined for EML para
  • More users can prepare EML with rich text descriptions using familiar HTML syntax.
  • Ensures interoperability with IPT and GBIF.org.
  • Reduces duplicated effort across EML-related packages.

Implementation reference:

Reproducible example

library(movepub)
html <- "This is <b>bold</b>.\nParagraph 1\n\nParagraph 2<p></p>What follows is a list: <ul><li>Item 1</li><li>Item 2</li></ul>"
html_to_docbook(html)
#> [1] "This is <emphasis>bold</emphasis>."                                                                                                   
#> [2] "Paragraph 1"                                                                                                                          
#> [3] "Paragraph 2"                                                                                                                          
#> [4] "What follows is a list: <itemizedlist><listitem><para>Item 1</para></listitem><listitem><para>Item 2</para></listitem></itemizedlist>"

Created on 2025-10-15 with reprex v2.1.1

Next Steps

  • Would the maintainers be receptive to integrating this converter?
  • We can provide a PR or collaborate on adapting the code to EML’s conventions and requirements.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions