Skip to content

Lutaml integration#66

Draft
andrew2net wants to merge 35 commits intomainfrom
lutaml-integration
Draft

Lutaml integration#66
andrew2net wants to merge 35 commits intomainfrom
lutaml-integration

Conversation

@andrew2net
Copy link
Copy Markdown
Contributor

No description provided.

…w URI

- Changed ID formats in YAML files for CGPM, CIPM, and JCRB meetings to a more descriptive format.
- Updated XML fixture for cctf_recommendation_2009_02.xml to include a new URI element.
- Modified the data_outcomes_parser.rb to use Array instead of Relaton.array for handling PDF links.
- Adjusted date parsing in article_parser_spec.rb to ensure the date is compared as a string.
Context:

GitHub issue #28 (Metrologia parsing) is mostly implemented, but two items from
the comment thread remain unresolved:

1. <back><ref-list> not parsed as relations — Some Metrologia XML files contain
   a <back><ref-list> section with bibliographic references. @ronaldtse confirmed these
   should be parsed as relations (type "cites").

2. Implicit deduplication — The same article can appear in multiple date-stamped archives.
   Currently, Dir glob ordering implicitly overwrites older copies with newer ones, but
   this is fragile and not guaranteed. @ronaldtse said to take the newest copy based on
   the archive date in the folder name.

Changes:

1. Parse <back><ref-list> as "cites" relations
2. Explicit date-based deduplication in fetcher
- Updated contributor organization structure in multiple YAML files to include a subdivision for the committee.
- Added description fields for roles in YAML files to specify the type as "committee".
- Modified the SI brochure YAML and RXL files to reflect the new organization structure.
- Adjusted tests in data_outcomes_parser_spec.rb to validate the new structure and ensure proper parsing of committee details.
next unless from

owner = l.at("./copyright-statement").text.split(" & ").map do |c|
/(?<name>[A-z]+(?:\s[A-z]+)*)/ =~ c

Check warning

Code scanning / CodeQL

Overly permissive regular expression range Medium

Suspicious character range that is equivalent to [A-Z[]^_`a-z].

Copilot Autofix

AI 21 days ago

In general, the fix is to replace ambiguous or overly broad character ranges like A-z with explicit ranges that only cover the intended characters, such as A-Za-z. This removes the unintended extra characters between Z and a in the ASCII table.

In this specific case, on line 294 in lib/relaton/bipm/rawdata_bipm_metrologia/article_parser.rb, the regex:

/(?<name>[A-z]+(?:\s[A-z]+)*)/ =~ c

is clearly intended to match alphabetic words separated by single spaces. To keep the same semantics but avoid the overly permissive range, change both A-z instances to A-Za-z:

/(?<name>[A-Za-z]+(?:\s[A-Za-z]+)*)/ =~ c

This keeps the behavior (names composed of letters and spaces) while ensuring no stray punctuation characters are accidentally matched as part of the name. No additional methods or imports are needed; this is a local change to the regex in the parse_copyright method.

Suggested changeset 1
lib/relaton/bipm/rawdata_bipm_metrologia/article_parser.rb

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/lib/relaton/bipm/rawdata_bipm_metrologia/article_parser.rb b/lib/relaton/bipm/rawdata_bipm_metrologia/article_parser.rb
--- a/lib/relaton/bipm/rawdata_bipm_metrologia/article_parser.rb
+++ b/lib/relaton/bipm/rawdata_bipm_metrologia/article_parser.rb
@@ -291,7 +291,7 @@
           next unless from
 
           owner = l.at("./copyright-statement").text.split(" & ").map do |c|
-            /(?<name>[A-z]+(?:\s[A-z]+)*)/ =~ c
+            /(?<name>[A-Za-z]+(?:\s[A-Za-z]+)*)/ =~ c
             org_name = Relaton::Bib::TypedLocalizedString.new(content: name, language: "en", script: "Latn")
             org = Relaton::Bib::Organization.new name: [org_name]
             Relaton::Bib::ContributionInfo.new(organization: org)
EOF
@@ -291,7 +291,7 @@
next unless from

owner = l.at("./copyright-statement").text.split(" & ").map do |c|
/(?<name>[A-z]+(?:\s[A-z]+)*)/ =~ c
/(?<name>[A-Za-z]+(?:\s[A-Za-z]+)*)/ =~ c
org_name = Relaton::Bib::TypedLocalizedString.new(content: name, language: "en", script: "Latn")
org = Relaton::Bib::Organization.new name: [org_name]
Relaton::Bib::ContributionInfo.new(organization: org)
Copilot is powered by AI and may make mistakes. Always verify output.
next unless from

owner = l.at("./copyright-statement").text.split(" & ").map do |c|
/(?<name>[A-z]+(?:\s[A-z]+)*)/ =~ c

Check warning

Code scanning / CodeQL

Overly permissive regular expression range Medium

Suspicious character range that is equivalent to [A-Z[]^_`a-z].

Copilot Autofix

AI 21 days ago

In general, overly permissive ranges like [A-z] should be replaced with explicit ranges that include only the desired characters, most commonly [A-Za-z] for ASCII letters. This avoids unintentionally matching punctuation between Z and a in the ASCII table.

Here, the problematic line is in parse_copyright in lib/relaton/bipm/rawdata_bipm_metrologia/article_parser.rb:

/(?<name>[A-z]+(?:\s[A-z]+)*)/ =~ c

This appears to intend to capture a person's or organization's name made of alphabetic words separated by single spaces. To fix the issue without changing higher-level functionality, we should tighten both occurrences of [A-z] to [A-Za-z]. The rest of the pattern (+, (?:\s...)*, and the named capture) can remain unchanged. No new imports or methods are required; we are only adjusting the regex literal.

Concretely, in that file, update line 294 so that [A-z] is replaced with [A-Za-z] in both locations within the pattern.

Suggested changeset 1
lib/relaton/bipm/rawdata_bipm_metrologia/article_parser.rb

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/lib/relaton/bipm/rawdata_bipm_metrologia/article_parser.rb b/lib/relaton/bipm/rawdata_bipm_metrologia/article_parser.rb
--- a/lib/relaton/bipm/rawdata_bipm_metrologia/article_parser.rb
+++ b/lib/relaton/bipm/rawdata_bipm_metrologia/article_parser.rb
@@ -291,7 +291,7 @@
           next unless from
 
           owner = l.at("./copyright-statement").text.split(" & ").map do |c|
-            /(?<name>[A-z]+(?:\s[A-z]+)*)/ =~ c
+            /(?<name>[A-Za-z]+(?:\s[A-Za-z]+)*)/ =~ c
             org_name = Relaton::Bib::TypedLocalizedString.new(content: name, language: "en", script: "Latn")
             org = Relaton::Bib::Organization.new name: [org_name]
             Relaton::Bib::ContributionInfo.new(organization: org)
EOF
@@ -291,7 +291,7 @@
next unless from

owner = l.at("./copyright-statement").text.split(" & ").map do |c|
/(?<name>[A-z]+(?:\s[A-z]+)*)/ =~ c
/(?<name>[A-Za-z]+(?:\s[A-Za-z]+)*)/ =~ c
org_name = Relaton::Bib::TypedLocalizedString.new(content: name, language: "en", script: "Latn")
org = Relaton::Bib::Organization.new name: [org_name]
Relaton::Bib::ContributionInfo.new(organization: org)
Copilot is powered by AI and may make mistakes. Always verify output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants