Skip to content

Add joblog label to job#55

Open
vlerkin wants to merge 2 commits intoq-m:mainfrom
vlerkin:job_log_label
Open

Add joblog label to job#55
vlerkin wants to merge 2 commits intoq-m:mainfrom
vlerkin:job_log_label

Conversation

@vlerkin
Copy link
Copy Markdown
Collaborator

@vlerkin vlerkin commented Apr 28, 2025

#53
This PR adds the following changes:

A new private method was added to the KubernetesJobLogHandler class: _add_object_name_label.
This method is called after a log file from a corresponding pod was uploaded to the remote storage. This method adds an extension as a label (always the last extension of the file without a dot, so if a file was compressed, it adds a compression extension, if not, then it adds log extension. It also adds a label org.scrapy.log_file_uploaded=true to indicate that the file was successfully uploaded.

The CONFIG md has a detailed documentation on how a user can reconstruct the final destination in the remote storage to find a specific log file.

…loaded, added documentation to easily reconstruct the final destination in the storage
@vlerkin vlerkin self-assigned this Apr 28, 2025
@vlerkin vlerkin requested a review from wvengen April 28, 2025 08:53
@vlerkin vlerkin linked an issue Apr 28, 2025 that may be closed by this pull request
Copy link
Copy Markdown
Member

@wvengen wvengen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, also for including documentation! Some small comments.

The joblogs section is used to configure the job logs feature. It is not enabled by default, but can be activated if you
choose to use it. The job logs feature allows you to collect logs from the Kubernetes cluster and store them in a specified
directory. The logs are collected from the pods running on the cluster and can be compressed using a specified method.
The job logs feature is implemented only for Kubernetes and is not available for Docker.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's reference #37 here.

CONFIG.md Outdated
Comment on lines +67 to +68
**Important**: if you used the `compression_method` parameter, the extension of the log file then have two extensions. The
first extension is `.log` and the second extension is extracted to the label `org.scrapy.extension`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main motivation for the current issue, is that when the config parameter compression_method is changed, you cannot find the logs anymore without iterating files. Now including the compression extension solves this, which is nice. But it doesn't tell us what the compression is that was used for the logfile - you still need to analyse the extension and derive the compression method.

So I would rather add a compression_method label, and derive the extension. Or, include both the extension and the compression method (so that an external consumer of these labels doesn't need to maintain a mapping between compression method and extension - though it does need to retain a mapping between compression method and decompressor, already, so perhaps this is not an extra burden).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I follow your logic, it would be better to have two labels: extension which is .log and compression which is either one of the methods or none, right?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(previous comment deleted) - yes!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and I don't think we'd need the extension label then

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we wouldn't need the extension label? how then a person should assume the final destination without knowing if the file was or was not compressed? I mean it is possible to check the config but since we may rollout update the managing pod and put a new config that has a different compression method or do not use compression at all, it would be pretty hard to trace back the extension of the file. I can entirely drop it if you like, just concerned if it makes the life of the user harder since we can reconfigure things without killing the jobs that sit on the cluster in the completed state.

it is not possible to provide extension as .log.gz because the dot . is not allowed in labels according to the k8s documentation (mentioned in the ticket)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not convenient to return a compression method since it is a separate class and we have a compression extension, it would force us to have a mapping in a log handler class which is not its responsibility, I would still vore for putting the extension in the label and not the method. For the user converting one to another or vice versa is the same amount of mental load. Since we already have the final destination it would be cleaner to use the given info. But if it's important I can add a mapping there.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I add two labels: extension and compression (compression extension, not method) but as I mentioned if this is something must be (method instead of the extension) then we can add some code, although I do not think it is a clean code

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now I understand that my request for adding a compression method label instead of extension, would make the code more complex. I need to look further into this to understand it better, and see if there could be another way. Thanks for explaining!

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are concerned that we won't be able to decompress a file with "a quasi random compression extension", there are python packages like https://github.com/pycompression/xopen that automatically assume the compression method from the extension of the file and decompress it.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are other technical concerns, feel free to share, I can search for possible solutions.

@wvengen wvengen mentioned this pull request May 6, 2025
…alue); rewrite config to move the info about compression to its own section
@vlerkin
Copy link
Copy Markdown
Collaborator Author

vlerkin commented May 7, 2025

I edited some logic and changed the config, if you have something else in mind I can adjust it

Copy link
Copy Markdown
Member

@wvengen wvengen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work on this! Seems mostly fine.
There are some things I would consider changing from an architectural point of view (going beyond this PR, but within the scope of joblogs).

  1. I'm confused by LibcloudObjectStorage referencing joblogs. If it is joblogs-specific, I would expect it being in the scrapyd_k8s/joblogs directory. So there is a decision to make: consider this to be part of joblogs, and move it; or consider it to be really just object storage, and move out all joblog references.
    I think that object storage will be used in other places too, in the future (for example to store the items on container storage), so I would tend to refactor out the joblogs references.
    Just passing config.joblogs() instead of config to LibcloudObjectStorage's constructor could be an idea.

  2. The compression method from the configuration is kind of the 'input' of the compression, and the path on the object storage is derived from that. That is a reason why I'd prefer storing the compression method as the label, and let the path be derived from it, leaving out the extension. The compression_method value can be taken from LibcloudObjectStorage if the approach as described in 1. is taken.

  3. Another consideration to take into account in the design: in #12 we will want to access the logs from within scrapyd-k8s. Here we just have the job resource in Kubernetes, and from that we'd like to figure out how to obtain the logs. This would check the (log_file_)uploaded label an either look at local or object storage. Here, the job labels need to be consulted, especially regarding compression (because scrapyd-k8s's compression method may have changed from the time this log was stored). I would expect code of storing the logs to be re-used here, especially handling the compression method (though it would need additional methods for reading). Just something to keep in the back of our heads.

* org.scrapy.spider
* org.scrapy.job_id
* org.scrapy.extension
* org.scrapy.compression
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we call this compression_method, so it is the same as the key used in the scrapyd-k8s config?

rules:
- apiGroups: [""]
resources: ["pods"]
# add "patch" if you use joblogs feature
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make it add "patch" to verbs if you use the joblogs feature (to make it even clearer that it is about the next line)

* org.scrapy.compression

How can you build the end location from these labels? Complete the following line
* logs/`org.scrapy.project`/`org.scrapy.spider`/`org.scrapy.job_id`.`org.scrapy.extension`.`org.scrapy.compression`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't extension always log?

"labels": {
"org.scrapy.extension": extension,
"org.scrapy.compression": compression,
"org.scrapy.log_file_uploaded": "true"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there could be multiple uses of object storage, I propose include joblogs in the labels:
org.scrapy.joblogs.compression_method
org.scrapy.joblogs.uploaded

@vlerkin
Copy link
Copy Markdown
Collaborator Author

vlerkin commented Jun 23, 2025

Hey W could you elaborate on joblogs reference in Libcloud class? The only one I see is a section joblogs in the config file, is this what you meant by cleaning out the joblogs from the class?

@wvengen
Copy link
Copy Markdown
Member

wvengen commented Jun 24, 2025

Hi, thanks for looking into my remarks :) These are joblogs references I could find in the libcloud class:

  • config section
  • getting the job id from the log filename
  • constructing the object name

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add joblog label to job

2 participants