feat(RDF): expose access rights and rights in RDF#3667
feat(RDF): expose access rights and rights in RDF#3667maudetes wants to merge 18 commits intofeat/harvest-and-expose-access-rightsfrom
Conversation
udata/core/dataset/rdf.py
Outdated
| DCT.description, | ||
| Literal(InspireLimitationCategory(dataset.access_type_reason_category).definition), | ||
| ) | ||
| yield node, DCT.rights |
There was a problem hiding this comment.
Why not DCT.accessRights here? https://inspire.ec.europa.eu/metadata-codelist/LimitationsOnPublicAccess is only about access.
There was a problem hiding this comment.
That was my first attempt, but I think DCT.accessRights is only 0..1 and we already have the one exposed based on dataset.access_type?
There was a problem hiding this comment.
Ah right...
IIRC the GeoDCAT-AP spec, we should then put dataset.access_type_reason_category in DCT.accessRights instead of dataset.access_type (when available):
https://semiceu.github.io/GeoDCAT-AP/releases/3.1.0/#Dataset.accessrights
For INSPIRE metadata, this property SHOULD be used with the URIs of the "Limitations on public access" code list operated by the INSPIRE Registry [INSPIRE-LPA].
But then I don't know how compatible it is with DCAT, which lists dct:accessRights among "a number of properties are listed with controlled vocabularies that MUST be used for the listed properties", but follows with "implementers are encouraged to publish and to use further region or domain-specific vocabularies that are available online". The fact that GeoDCAT-AP says that the "considerations for expressing legal information follow the guidelines and approach of DCAT-AP" (and therefore I assume don't depart from it) suggests it might be OK to use a more precise vocabulary when available?
If the main target is "pure DCAT", I guess we can't do much more than what you do now. Question is: are we targetting DCAT or GeoDCAT-AP?
There was a problem hiding this comment.
I think we're targeting GeoDCAT-AP in the end. However, it should be compatible with DCAT-AP and this is not the case here in my understanding. We can raise this point in GeoDCAT-AP discussions.
In the meantime, we do need to use the EU publication office vocabulary for the European Register for Protected Data (ERPD) (guidelines).
There was a problem hiding this comment.
Making a note of this for the GeoDCAT-AP pilot discussions.
The difficult choices will be if/when DCAT (or DCAT-AP) and GeoDCAT-AP differ. But let's hope it won't happen too often.
udata/core/dataset/rdf.py
Outdated
| if frequency := frequency_to_rdf(dataset.frequency): | ||
| d.set(DCT.accrualPeriodicity, frequency) | ||
|
|
||
| for access_right, predicate in rights_to_rdf(dataset, graph): |
There was a problem hiding this comment.
As you mention in #3667 (comment) it's not clear what should go at the dataset level, so I would suggest adding only what draws consensus, to avoid having to remove stuff later?
In addition to the sources you cited, there is an interesting discussion in SEMICeu/GeoDCAT-AP#156, especially the conclusion from SEMICeu/GeoDCAT-AP#156 (comment):
Conclusion (DCAT-AP-VL / Metadata Vlaanderen advice)
For stability and cross-border interoperability, we recommend sticking to the current hierarchy:
Dataset → dct:accessRights (PUBLIC / NON_PUBLIC) for access classification. Distribution → dct:license + dct:rights (and always an dcat:accessURL). DataService → dct:license (+ in DCAT-AP-VL also dct:rights) and (in DCAT-AP-VL) dct:accessRights. Offline/restricted/no-download → model a proxy Distribution via dcat:accessURL and attach rights/licence there.
I would therefore restrain from adding anything other than actual accessRights at the Dataset level. rights_to_rdf in this PR might output license-title-as-rights, which would be inconsistent with having the license only at the resource level.
There was a problem hiding this comment.
I've tried splitting the legal information accordingly in 5d0bb15
udata/core/dataset/rdf.py
Outdated
| node.set(RDF.type, DCT.RightsStatement) | ||
| yield node, DCT.accessRights | ||
| if dataset.license and dataset.license.id != DEFAULT_LICENSE["id"]: | ||
| yield Literal(dataset.license.title), DCT.rights |
There was a problem hiding this comment.
Is this needed? Seems the URL in DCT.licence should be enough?
Or could this be added as a description of the licence node, like you do with INSPIRE rights below?
I'm worried about having two different independent nodes providing the same information. We have similar cases in ISO-19139 with {access,use}Constraints and otherConstraints and it's a pain to reconcile.
There was a problem hiding this comment.
I kept the existing logic from previous resource_to_rdf implementation with a r.add(DCT.rights, Literal(dataset.license.title)).
I think it is useful for licenses without URLs? For example License Not Specified (there are others but, I guess they could have a URL?).
In any case, I think we shouldn't add a DCT.license if we are in the case of not-specified.
There was a problem hiding this comment.
Right, in the case of not-specified we don't want a DCT.licence. Are we OK too for all the other entries without URIs in https://www.data.gouv.fr/api/1/datasets/licenses/?
My worry is incorrectly outputting a "true" license with only a label but no URI. In which case I think we should still be outputting it as a DCT.license instead of a DCT.rights. But I'm not sure this case can exist?
So if the definition of a license in udata is "has an URI" then we're good with the current implem. Otherwise we may have to be more careful.
There was a problem hiding this comment.
That's why I kept the existing logic as I wasn't sure how to improve the situation at the moment.
I can ask for a juridical advise on this but I believe keeping close to what exists is fair for now.
Use these at the appropriate level, ie: - accessRights for Dataset - license + rights for Distribution
Co-authored-by: maudetes <maudet.estelle@gmail.com>
There is a lot of variability in our tests but here are some stats. #### [main](https://app.circleci.com/pipelines/github/opendatateam/udata/8966/workflows/7cc6d4e3-1bab-48fd-ac08-4711b236e5b2?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-checks-link&utm_content=summary) - Around 9min/9min30 for tests - Total is 11min30 (including deploy) #### [some branch right now](https://app.circleci.com/pipelines/github/opendatateam/udata/8963/workflows/3b176b23-e787-4738-8e4e-ed860f8ffe55?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-checks-link&utm_content=summary) - Between 7min30 and 9min30 for tests - Total 9min52 #### [this branch](https://app.circleci.com/pipelines/github/opendatateam/udata/8974/workflows/9696d75b-506c-4ec8-ae21-4294873ace81?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-checks-link&utm_content=summary) - Between 5min40 and 6min45 for tests - Total 7min16
The real commit is 3220c99 (adding the type logic to `field()`). Other commits are simple find and replace the dynamic `db.*Field` to the correct direct import. 1. Add correct return type for `field()` wrapper based on the `inner` Mongo type. 2. Converting dynamic `db.StringField` to simple Mongo `StringField` so it will work Allow having the correct type inside models, it's even better than simple `db.StringField` or even `StringField` during usage. For example: ```python class DatasetAPI(API): @api.doc("get_dataset") @api.marshal_with(dataset_fields) def get(self, dataset: Dataset): """Get a dataset given its identifier""" if not dataset.permissions["edit"].can(): if dataset.private: # <- Here VSCode know that .private is a bool api.abort(404) elif dataset.deleted: api.abort(410, "Dataset has been deleted") return dataset ``` With `uvx ty check`: - before `Found 3479 diagnostics` - after `Found 3212 diagnostics` #### TODO - [x] Convert all db.* fields - [ ] Improve `field()` typechecking by creating an object instead of a dict (so that we can be check inside the API generation) - [ ] Support `require=True` to add `| None` to nullable fields in Mongo
…3678) It prevents conflict in case of values that may impact tests in udata.cfg.
I have some crashes in #3674 (only on Python 3.11, reproducted twice with a retry at the same point in tests). See https://app.circleci.com/pipelines/github/opendatateam/udata/8980/workflows/8fdff4fd-aceb-4cc9-8746-4a6667808de8?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-checks-link&utm_content=summary This solution is also clearly faster! Win between 1 and 2 minutes on the tests. I've tried: - [x] ~~Dropping the collections from `get_all_models()` instead of droping the db (but some dynamic fake models weren't taken into consideration)~~ - [x] ~~Dropping the collections from `__sub_classes__` of `Document` but I still got crashes~~ - [x] Dropping the items inside collections (there was one problem with one `Fake` model that create an index for `slug=Unique` so other `Fake` models with the same collection name cannot insert two document with `slug: null`.)
This reverts commit 729eb66.
|
@maudetes there's something wrong with the PR (target branch?): 120-file diff 😬 |
Related to datagouv/data.gouv.fr#1942
Builds on #3666
Expose access rights and rights in DCAT following ERPD guidelines.
All those information are exposed at the distribution and dataset level, since everything is for now consolidated at the dataset level.
In my understanding, there are conflicting visions (ex: between DCAT guidelines, DCAT-AP guidelines, GeoDCAT-AP cardinalities and ERPD guidelines) regarding exposition of legal information at the dataset level or not.
As explained, we only expose consolidated information, thus we don't fall on the pitfall of having conflicting legal information at different levels.