Skip to content

feat(RDF): expose access rights and rights in RDF#3667

Open
maudetes wants to merge 18 commits intofeat/harvest-and-expose-access-rightsfrom
feat/expose-access-rights
Open

feat(RDF): expose access rights and rights in RDF#3667
maudetes wants to merge 18 commits intofeat/harvest-and-expose-access-rightsfrom
feat/expose-access-rights

Conversation

@maudetes
Copy link
Contributor

@maudetes maudetes commented Feb 13, 2026

Related to datagouv/data.gouv.fr#1942

Builds on #3666

Expose access rights and rights in DCAT following ERPD guidelines.

All those information are exposed at the distribution and dataset level, since everything is for now consolidated at the dataset level.

In my understanding, there are conflicting visions (ex: between DCAT guidelines, DCAT-AP guidelines, GeoDCAT-AP cardinalities and ERPD guidelines) regarding exposition of legal information at the dataset level or not.
As explained, we only expose consolidated information, thus we don't fall on the pitfall of having conflicting legal information at different levels.

@maudetes maudetes requested a review from streino February 13, 2026 18:27
DCT.description,
Literal(InspireLimitationCategory(dataset.access_type_reason_category).definition),
)
yield node, DCT.rights
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not DCT.accessRights here? https://inspire.ec.europa.eu/metadata-codelist/LimitationsOnPublicAccess is only about access.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my first attempt, but I think DCT.accessRights is only 0..1 and we already have the one exposed based on dataset.access_type?

Copy link
Contributor

@streino streino Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right...

IIRC the GeoDCAT-AP spec, we should then put dataset.access_type_reason_category in DCT.accessRights instead of dataset.access_type (when available):

https://semiceu.github.io/GeoDCAT-AP/releases/3.1.0/#Dataset.accessrights

For INSPIRE metadata, this property SHOULD be used with the URIs of the "Limitations on public access" code list operated by the INSPIRE Registry [INSPIRE-LPA].

But then I don't know how compatible it is with DCAT, which lists dct:accessRights among "a number of properties are listed with controlled vocabularies that MUST be used for the listed properties", but follows with "implementers are encouraged to publish and to use further region or domain-specific vocabularies that are available online". The fact that GeoDCAT-AP says that the "considerations for expressing legal information follow the guidelines and approach of DCAT-AP" (and therefore I assume don't depart from it) suggests it might be OK to use a more precise vocabulary when available?

If the main target is "pure DCAT", I guess we can't do much more than what you do now. Question is: are we targetting DCAT or GeoDCAT-AP?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're targeting GeoDCAT-AP in the end. However, it should be compatible with DCAT-AP and this is not the case here in my understanding. We can raise this point in GeoDCAT-AP discussions.
In the meantime, we do need to use the EU publication office vocabulary for the European Register for Protected Data (ERPD) (guidelines).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making a note of this for the GeoDCAT-AP pilot discussions.

The difficult choices will be if/when DCAT (or DCAT-AP) and GeoDCAT-AP differ. But let's hope it won't happen too often.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if frequency := frequency_to_rdf(dataset.frequency):
d.set(DCT.accrualPeriodicity, frequency)

for access_right, predicate in rights_to_rdf(dataset, graph):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you mention in #3667 (comment) it's not clear what should go at the dataset level, so I would suggest adding only what draws consensus, to avoid having to remove stuff later?

In addition to the sources you cited, there is an interesting discussion in SEMICeu/GeoDCAT-AP#156, especially the conclusion from SEMICeu/GeoDCAT-AP#156 (comment):

Conclusion (DCAT-AP-VL / Metadata Vlaanderen advice)

For stability and cross-border interoperability, we recommend sticking to the current hierarchy:

Dataset → dct:accessRights (PUBLIC / NON_PUBLIC) for access classification.
Distribution → dct:license + dct:rights (and always an dcat:accessURL).
DataService → dct:license (+ in DCAT-AP-VL also dct:rights) and (in DCAT-AP-VL) dct:accessRights.
Offline/restricted/no-download → model a proxy Distribution via dcat:accessURL and attach rights/licence there.

I would therefore restrain from adding anything other than actual accessRights at the Dataset level. rights_to_rdf in this PR might output license-title-as-rights, which would be inconsistent with having the license only at the resource level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried splitting the legal information accordingly in 5d0bb15

node.set(RDF.type, DCT.RightsStatement)
yield node, DCT.accessRights
if dataset.license and dataset.license.id != DEFAULT_LICENSE["id"]:
yield Literal(dataset.license.title), DCT.rights
Copy link
Contributor

@streino streino Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? Seems the URL in DCT.licence should be enough?
Or could this be added as a description of the licence node, like you do with INSPIRE rights below?

I'm worried about having two different independent nodes providing the same information. We have similar cases in ISO-19139 with {access,use}Constraints and otherConstraints and it's a pain to reconcile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the existing logic from previous resource_to_rdf implementation with a r.add(DCT.rights, Literal(dataset.license.title)).
I think it is useful for licenses without URLs? For example License Not Specified (there are others but, I guess they could have a URL?).
In any case, I think we shouldn't add a DCT.license if we are in the case of not-specified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the new logic in 5d0bb15

Copy link
Contributor

@streino streino Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, in the case of not-specified we don't want a DCT.licence. Are we OK too for all the other entries without URIs in https://www.data.gouv.fr/api/1/datasets/licenses/?

My worry is incorrectly outputting a "true" license with only a label but no URI. In which case I think we should still be outputting it as a DCT.license instead of a DCT.rights. But I'm not sure this case can exist?

So if the definition of a license in udata is "has an URI" then we're good with the current implem. Otherwise we may have to be more careful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I kept the existing logic as I wasn't sure how to improve the situation at the moment.
I can ask for a juridical advise on this but I believe keeping close to what exists is fair for now.

maudetes and others added 2 commits February 17, 2026 10:59
Use these at the appropriate level, ie:
- accessRights for Dataset
- license + rights for Distribution
Running `uvx ty check` raises a few real problems.
- Before 3479 errors
- After 1286 errors (mainly factory generic missing in
[cbe6dfd](cbe6dfd)
and assert200… in
[4be0519](4be0519))

---------

Co-authored-by: maudetes <maudet.estelle@gmail.com>
nicolaskempf57 and others added 10 commits February 18, 2026 14:34
Co-authored-by: maudetes <maudet.estelle@gmail.com>
The real commit is
3220c99
(adding the type logic to `field()`). Other commits are simple find and
replace the dynamic `db.*Field` to the correct direct import.

1. Add correct return type for `field()` wrapper based on the `inner`
Mongo type.
2. Converting dynamic `db.StringField` to simple Mongo `StringField` so
it will work

Allow having the correct type inside models, it's even better than
simple `db.StringField` or even `StringField` during usage. For example:

```python
class DatasetAPI(API):
    @api.doc("get_dataset")
    @api.marshal_with(dataset_fields)
    def get(self, dataset: Dataset):
        """Get a dataset given its identifier"""
        if not dataset.permissions["edit"].can():
            if dataset.private: # <- Here VSCode know that .private is a bool
                api.abort(404)
            elif dataset.deleted:
                api.abort(410, "Dataset has been deleted")
        return dataset
```

With `uvx ty check`: 
- before `Found 3479 diagnostics`
- after `Found 3212 diagnostics`

#### TODO

- [x] Convert all db.* fields
- [ ] Improve `field()` typechecking by creating an object instead of a
dict (so that we can be check inside the API generation)
- [ ] Support `require=True` to add `| None` to nullable fields in Mongo
…3678)

It prevents conflict in case of values that may impact tests in
udata.cfg.
I have some crashes in #3674
(only on Python 3.11, reproducted twice with a retry at the same point
in tests). See
https://app.circleci.com/pipelines/github/opendatateam/udata/8980/workflows/8fdff4fd-aceb-4cc9-8746-4a6667808de8?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-checks-link&utm_content=summary

This solution is also clearly faster! Win between 1 and 2 minutes on the
tests.

I've tried:
- [x] ~~Dropping the collections from `get_all_models()` instead of
droping the db (but some dynamic fake models weren't taken into
consideration)~~
- [x] ~~Dropping the collections from `__sub_classes__` of `Document`
but I still got crashes~~
- [x] Dropping the items inside collections (there was one problem with
one `Fake` model that create an index for `slug=Unique` so other `Fake`
models with the same collection name cannot insert two document with
`slug: null`.)
@streino
Copy link
Contributor

streino commented Feb 23, 2026

@maudetes there's something wrong with the PR (target branch?): 120-file diff 😬

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants