Conversation
…loaded, added documentation to easily reconstruct the final destination in the storage
wvengen
left a comment
There was a problem hiding this comment.
Thanks, also for including documentation! Some small comments.
| The joblogs section is used to configure the job logs feature. It is not enabled by default, but can be activated if you | ||
| choose to use it. The job logs feature allows you to collect logs from the Kubernetes cluster and store them in a specified | ||
| directory. The logs are collected from the pods running on the cluster and can be compressed using a specified method. | ||
| The job logs feature is implemented only for Kubernetes and is not available for Docker. |
CONFIG.md
Outdated
| **Important**: if you used the `compression_method` parameter, the extension of the log file then have two extensions. The | ||
| first extension is `.log` and the second extension is extracted to the label `org.scrapy.extension`. |
There was a problem hiding this comment.
The main motivation for the current issue, is that when the config parameter compression_method is changed, you cannot find the logs anymore without iterating files. Now including the compression extension solves this, which is nice. But it doesn't tell us what the compression is that was used for the logfile - you still need to analyse the extension and derive the compression method.
So I would rather add a compression_method label, and derive the extension. Or, include both the extension and the compression method (so that an external consumer of these labels doesn't need to maintain a mapping between compression method and extension - though it does need to retain a mapping between compression method and decompressor, already, so perhaps this is not an extra burden).
There was a problem hiding this comment.
If I follow your logic, it would be better to have two labels: extension which is .log and compression which is either one of the methods or none, right?
There was a problem hiding this comment.
(previous comment deleted) - yes!
There was a problem hiding this comment.
and I don't think we'd need the extension label then
There was a problem hiding this comment.
why we wouldn't need the extension label? how then a person should assume the final destination without knowing if the file was or was not compressed? I mean it is possible to check the config but since we may rollout update the managing pod and put a new config that has a different compression method or do not use compression at all, it would be pretty hard to trace back the extension of the file. I can entirely drop it if you like, just concerned if it makes the life of the user harder since we can reconfigure things without killing the jobs that sit on the cluster in the completed state.
it is not possible to provide extension as .log.gz because the dot . is not allowed in labels according to the k8s documentation (mentioned in the ticket)
There was a problem hiding this comment.
It is not convenient to return a compression method since it is a separate class and we have a compression extension, it would force us to have a mapping in a log handler class which is not its responsibility, I would still vore for putting the extension in the label and not the method. For the user converting one to another or vice versa is the same amount of mental load. Since we already have the final destination it would be cleaner to use the given info. But if it's important I can add a mapping there.
There was a problem hiding this comment.
For now I add two labels: extension and compression (compression extension, not method) but as I mentioned if this is something must be (method instead of the extension) then we can add some code, although I do not think it is a clean code
There was a problem hiding this comment.
I now I understand that my request for adding a compression method label instead of extension, would make the code more complex. I need to look further into this to understand it better, and see if there could be another way. Thanks for explaining!
There was a problem hiding this comment.
If you are concerned that we won't be able to decompress a file with "a quasi random compression extension", there are python packages like https://github.com/pycompression/xopen that automatically assume the compression method from the extension of the file and decompress it.
There was a problem hiding this comment.
If there are other technical concerns, feel free to share, I can search for possible solutions.
…alue); rewrite config to move the info about compression to its own section
|
I edited some logic and changed the config, if you have something else in mind I can adjust it |
There was a problem hiding this comment.
Thanks for your work on this! Seems mostly fine.
There are some things I would consider changing from an architectural point of view (going beyond this PR, but within the scope of joblogs).
-
I'm confused by
LibcloudObjectStoragereferencing joblogs. If it is joblogs-specific, I would expect it being in thescrapyd_k8s/joblogsdirectory. So there is a decision to make: consider this to be part of joblogs, and move it; or consider it to be really just object storage, and move out all joblog references.
I think that object storage will be used in other places too, in the future (for example to store the items on container storage), so I would tend to refactor out the joblogs references.
Just passingconfig.joblogs()instead ofconfigtoLibcloudObjectStorage's constructor could be an idea. -
The compression method from the configuration is kind of the 'input' of the compression, and the path on the object storage is derived from that. That is a reason why I'd prefer storing the compression method as the label, and let the path be derived from it, leaving out the extension. The compression_method value can be taken from
LibcloudObjectStorageif the approach as described in 1. is taken. -
Another consideration to take into account in the design: in #12 we will want to access the logs from within scrapyd-k8s. Here we just have the job resource in Kubernetes, and from that we'd like to figure out how to obtain the logs. This would check the
(log_file_)uploadedlabel an either look at local or object storage. Here, the job labels need to be consulted, especially regarding compression (because scrapyd-k8s's compression method may have changed from the time this log was stored). I would expect code of storing the logs to be re-used here, especially handling the compression method (though it would need additional methods for reading). Just something to keep in the back of our heads.
| * org.scrapy.spider | ||
| * org.scrapy.job_id | ||
| * org.scrapy.extension | ||
| * org.scrapy.compression |
There was a problem hiding this comment.
Shall we call this compression_method, so it is the same as the key used in the scrapyd-k8s config?
| rules: | ||
| - apiGroups: [""] | ||
| resources: ["pods"] | ||
| # add "patch" if you use joblogs feature |
There was a problem hiding this comment.
let's make it add "patch" to verbs if you use the joblogs feature (to make it even clearer that it is about the next line)
| * org.scrapy.compression | ||
|
|
||
| How can you build the end location from these labels? Complete the following line | ||
| * logs/`org.scrapy.project`/`org.scrapy.spider`/`org.scrapy.job_id`.`org.scrapy.extension`.`org.scrapy.compression` |
| "labels": { | ||
| "org.scrapy.extension": extension, | ||
| "org.scrapy.compression": compression, | ||
| "org.scrapy.log_file_uploaded": "true" |
There was a problem hiding this comment.
If there could be multiple uses of object storage, I propose include joblogs in the labels:
org.scrapy.joblogs.compression_method
org.scrapy.joblogs.uploaded
|
Hey W could you elaborate on joblogs reference in Libcloud class? The only one I see is a section joblogs in the config file, is this what you meant by cleaning out the joblogs from the class? |
|
Hi, thanks for looking into my remarks :) These are joblogs references I could find in the libcloud class:
|
#53
This PR adds the following changes:
A new private method was added to the KubernetesJobLogHandler class: _add_object_name_label.
This method is called after a log file from a corresponding pod was uploaded to the remote storage. This method adds an extension as a label (always the last extension of the file without a dot, so if a file was compressed, it adds a compression extension, if not, then it adds
logextension. It also adds a labelorg.scrapy.log_file_uploaded=trueto indicate that the file was successfully uploaded.The CONFIG md has a detailed documentation on how a user can reconstruct the final destination in the remote storage to find a specific log file.