-
-
Notifications
You must be signed in to change notification settings - Fork 199
Description
Here's our current architecture:

In #185 we added preliminary support for anchor tags / fragments.
We discussed that supporting fragments in URLs (e.g. https://foo/bar.html#frag as in option 3 in the link types) won't be a small change.
We'd probably have to make few changes to the architecture, because right now there's no way for the check step to "ask" for all links of a given input or even just ask if a link occurs in a given input (e.g. "is https://foo/bar.html#frag valid?").
One way to go about it might be to fully decouple input handling from link checking (where inputs can be files or websites).
The fragment cache is a basic version of that, but it's limited to fragments. I think we need a bigger cache for all inputs we encounter and a central entity, which manages this cache. We can think of it as an abstraction on top of the network and the file system, purpose-built for our use-case.
It could lazy-load resources on demand and store the parsed information from inputs, which would be used by the rest of the system; so our parsed representation would be the ground-truth for the rest of the link checking. For each input, it would contain a big map of the URI of the input (i.e. the path or URL) and its parsed links/fragments.
It should be fully async, and we will need read/write access throughout the program's runtime.
Maybe this is even a graph problem, but I don't feel comfortable going down that route.
In any case, we will need a lot of discussions to come up with a solid design.
However we model it, a check to see if an input contains a link or fragment should be trivial from other parts of the program. We should not deal with ad-hoc resource-fetching within the checking code.
I'd be happy for any design feedback.