Skip to content

Add a "delete_missing" option to CKAN harvester #542

@danielcoelhocgu

Description

@danielcoelhocgu

In brazilian government we have a very decentralized structure in which several entities have their own CKAN instances. We collect all data from these entities trough the harvest extension.

We have quite a lot of trouble when a dataset is deleted in one of those harvested CKAN portals because the CKAN harvester does not delete it in our CKAN, so it keeps showing many datasets with broken links or out of date information.

We propose to add an option to the CKAN harvester called delete_missing (boolean type), which will check for datasets that no longer exist in the harvested CKAN portal and delete them.

A near identical demand was reported on issue #396 about 2 years ago. The author of the issue even said he wrote some custom code to solve it, but he never shared the code, so I am opening this new issue aiming to submit a future pull request.

My idea is to copy the same logic from the DCAT JSON harvester from ckanext-dcat:

  1. Inside gather_stage function:
    1.2. List all dataset UIDs that were imported through the current harvest source (by querying the harvest_object table).
    1.3. List all remote CKAN datasets, then check for local UIDs that are missing in the remote CKAN list.
    1.4. Create harvest objects with delete state for all of those missing datasets.
  2. Inside import_stage function:
    2.1. Effectively delete (but not purge) all those missing datasets.

About step 1.2, I don't know if it would be better to look into the harvest_object table or to look for datasets with the extra field harvest_source_id that matches the harvest source of the job. It seems that the extension normally uses the havest_object table, but it won't work if we use the clear_history command on the source.

I kindly appreciate any feedback about this implementation idea, since this is my first contribution to the project.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions