Add fix script for rewriting STA ScheduledStopPoint IDs#114
Add fix script for rewriting STA ScheduledStopPoint IDs#114leonardehrenfried wants to merge 5 commits into
Conversation
|
|
||
|
|
||
| def _update_refs(obj: Any, id_map: dict[str, str]) -> bool: | ||
| def _update_refs(obj: Any) -> bool: |
There was a problem hiding this comment.
Is your intention here to iterate over all references?
There was a problem hiding this comment.
Yes, everything that can have a reference to the ScheduledStopPoint.
I agree that it's super generic and handles lots of cases that I don't have in my data set.
There was a problem hiding this comment.
I think you can omit most of the code by def only_references(deserialized: Tid, serializer: Serializer) -> Generator[tuple[type[EntityStructure], str, str], None, None]: which does the recursive stuff.
e874a07 to
fbf72e2
Compare
|
Can you add to this pull request also a test for loading and fixing a file? |
|
The current feed this operates on is 500mb. Do you know of a tool of shrinking netex feeds down to a single journey? |
See conv.filter_db_to_db :) |
|
Actually, If the code is good, I would prefer to merge this now. I will send a follow up with a test. |
|
Nope, I want to see how it behaves, and prevent regressions. |
|
Can you point towards an example that I should emulate? All I can see are tests that appear to be reading from places like |
|
I want the code to be running. Hence I don't care at this point about asserting, I care about the code path to be touched. Hence a small subset of 10 stops in a file. Going into mdbx. Fix the result. Export to XML would be good enough. |
|
I used the following filter to get a single ServiceJourney: It produced this: https://p.ip.fi/zbul Should I commit this to the repo? |
|
@skinkie Can you look at the test? |
| def fix_ssp_ids(database: Path) -> None: | ||
| with MdbxStorage(database, readonly=False) as db: | ||
| with db.env.rw_transaction() as txn: | ||
| # TODO: delete the old ScheduledStopPoint objects (no delete API available yet) |
There was a problem hiding this comment.
For deleting the following steps must be assured in this order:
- the id of the object itself must be renamed
- all internal references must be updated, hence at least
ScheduledStopPointRef,TimingPointRef nameOfRefClass="ScheduledStopPoint", ObjectRef (NoticeAssignment), rewriting should cause the updating the referencing - the old relationship between objects must be deleted
- the key with the old object must be deleted
We have avoided such operations, so we fill a new database with the context, and not try to do such invasive operations in place.
There was a problem hiding this comment.
So you never really delete but simply filtering them out when copying to a new database?
There was a problem hiding this comment.
There are two facets here. The way we have worked was always to transfer from database to database when doing any transformation, so from NeTEx to NeTEx. The (inline) fix operations work well on attribute level like projection of all coordinates from a national grid to WGS84.
What you are doing here would match something like the EPIP conversion. Do all the transformations, write the output into the second database, and copy_map everything that remains stable. https://github.com/MMTIS/badger/blob/binary_relation_serializer/conv/epip_db_to_db.py#L181
The effect is that anything related to referential relationships are never updated, only created.
So in effect, the code to achieve such thing is virtually the same, but source is copied, and transformed, then written to the target.
The second facet is, that we have always overwritten the key. This is not the case when the id is changed, thus the key changes.
This script rewrites some IDs in STA's EPIP feed.