Conversation
|
Things that we think still need to be done to this:
|
b9280c6 to
067b71c
Compare
|
I think this is getting there. There's some documentation on what file storage backends now look like at https://github.com/inveniosoftware/invenio-files-rest/blob/9bc8a83163fc50d9b407e29a82d7ef6701770f48/docs/new-storage-backend.rst I've moved the old implementation into The general idea was:
Things still to do:
|
This adds overridable config variables for mapping storage backend names to classes (by default drawn from declared entry points) and setting a default backend for new files. It also adds a pyfs storage backend entry point, and makes it the default. Storage backends should be `django_files_rest.storage:FileStorage` subclasses, and declared with a `django_files_rest.storage` entrypoint.
…gress At the moment, the storage backend is responsible for computing checksums and reporting the size of files based on the number of bytes read. However, these are common concerns across all storage backends, and so this functionality can be pulled the other side of the storage backend interface. This class is therefore intended to wrap any file-like object, and to checksum the file as it is read. It also counts bytes and calls the progress callback if one is provided. In a subsequent commit we'll integrate it with FileInstance, and remove the functionality from the PyFS storage backend.
This new storage factory uses a `FileInstance.storage_backend` attribute to determine which backend to use. It uses the `FILES_REST_DEFAULT_STORAGE_BACKEND` config value when the file instance doesn't already have a storage backend. It can be extended to customize how the backend is chosen, and how it is initialised. The file path is chosen based on the `id` of the file instance, but this can be customised too. It's not set as the default storage factory to preserve backwards compatibility.
…nit classes This property can now instantiate class-based storage factories, i.e. the new factory in `invenio_files_rest.storage.factory`. It's possible the deprecation should be in the pyfs storage factory itself.
These aren't Py3.6-compatible, so will probably need to be removed later, or replaced with comment-based annotations.
This will mean we can later remove this functionality from the storage backend.
This is exposed as a class property on FileStorage and can be accessed as `FileStorage.backend_name`. It returns the associated backend name in the application config, which can then be used to find the class.
It no longer needs to do checksumming, and the filepath can be moved to the superclass.
All together because this has been very exploratory.
There's always a better option than metaclasses. In this case, a classmethod, which also simplifies getting the backend name on instances (`type(instance).backend_name` → `instance.get_backend_name()`)
lnielsen
left a comment
There was a problem hiding this comment.
Impressive job! Thanks a lot for working on this and keeping backward compatibility. I've made a couple of comments
I think the main comment is probably the removal of copy(). It might be easiest to discuss over a telecon. Honestly, I think we can go ahead without adding it back in, but perhaps we make a comment in the code, that this could optimized in the code at a later stage.
| FILES_REST_DEFAULT_STORAGE_BACKEND = 'pyfs' | ||
| """The default storage backend name.""" | ||
|
|
||
| FILES_REST_STORAGE_BACKENDS = { |
There was a problem hiding this comment.
Minor: I see some potential troubles here.
How are users expected to provide storage backend?
- Only via specifying entry points? In this case, we should probably not make it config, but just a method on the Flask extension.
- Allow developers/instances to specify a dict of storage backends? in this case, it makes sense to have a config. However when
FILES_REST_STORAGE_BACKENDSis overwritten, then below code will still execute causing all storage backends to be loaded, even if we don't want them.
Thus in both cases, I think it makes sense to move the entry point iteration to the Flask extension to avoid unnecessarily loading storage backends that we don't need.
There was a problem hiding this comment.
I was supporting both so that there was an obvious default behaviour (entry points), but that they could override this if desired. The "overriding" use case was that they wanted to change the behaviour of a backend they'd already used, but which they don't maintain (e.g. one in invenio-files-rest, or invenio-s3). In this case defining a new entrypoint with the same name wouldn't work, so they'd have to define the name→backend mapping explicitly as config.
However, yes, it's not great to do the entrypoint dereferencing here regardless of whether it's overridden. If you're happy with the rationale above I can move this to the extension loading code.
I should also write documentation for this.
| else (self.checksum == real_checksum)) | ||
| self.last_check_at = datetime.utcnow() | ||
| return self.last_check | ||
| return current_files_rest.storage_factory(fileinstance=self, **kwargs) |
There was a problem hiding this comment.
Minor: Two returns - seems like a merge problem I think?
There was a problem hiding this comment.
Yep, it does look that way 😬. Will fix!
| preferred_location = ( # type: typing.Optional[Location] | ||
| Location(uri=default_location) if default_location else None | ||
| ) | ||
| return self.initialize( |
There was a problem hiding this comment.
Question: Previously the init_contents() didn't have a return value. I assume init_contents() is intended to be replaced completely with initialize(), hence would it make sense to keep it like that, or is the return value need somewhere?
There was a problem hiding this comment.
You're right; there's no return from initialize(), so the return in init_contents() is unnecessary. I'll get it removed :-)
There was a problem hiding this comment.
And yep, I was intending that init_contents() be deprecated.
| self.set_uri( | ||
| *self.storage(**kwargs).initialize(size=size), | ||
| readable=False, writable=True) | ||
| if hasattr(current_files_rest.storage_factory, 'initialize'): |
There was a problem hiding this comment.
Comment: Perhaps better with isinstance(current_files_rest.storage_factory, StorageBackend)?
There was a problem hiding this comment.
Yep. I think this was a hangover from when I'd considered keeping the same base class between old and new approaches, but then decided against it.
| 'uri', 'size', 'checksum', 'writable', 'readable', 'storage_class' | ||
| } | ||
|
|
||
| def update_file_metadata( |
There was a problem hiding this comment.
Question: Should set_uri() be deprecated?
There was a problem hiding this comment.
Yes please! Its use of positional arguments didn't seem particularly clear for people using it, and seems inflexible when trying to omit attribute names or add new attribute names later.
| ) | ||
| self._size = result['size'] | ||
| if not result['checksum']: | ||
| result['checksum'] = self.checksum(chunk_size=chunk_size) |
There was a problem hiding this comment.
Question: is this for the case where the checksum is not calculated as part of write stream and we want to ask the backend to do it instead?
There was a problem hiding this comment.
Yep. The idea was to make the backend implementation as light-touch as possible and for it to fill in the gaps around what you've provided, unless you want to take control of those things yourself.
In this case, setting the hash name to None will result in checksum not being set from the default write implementation, at which point it'll ask the backend to calculate the checkum explicitly.
| 'uri': self.uri, | ||
| 'readable': True, | ||
| 'writable': False, | ||
| 'storage_class': 'S', |
There was a problem hiding this comment.
Question: Should this be taken from FILES_REST_DEFAULT_STORAGE_CLASS?
| """ | ||
| fp = src.open(mode='rb') | ||
| try: | ||
| warnings.warn( |
There was a problem hiding this comment.
Minor: The reason for having copy on the storage backend is that it could potentially perform the copy using the backend. Say if both files reside in the same storage system, it would be a lot more efficient to do an internal copy or that system, than having to pass all data through Invenio.
There was a problem hiding this comment.
That's true. It shouldn't be too much effort for me to reinstate it.
| def __init__(self, *args, clean_dir=True, **kwargs): | ||
| """Storage initialization.""" | ||
| self.fileurl = fileurl | ||
| # if isinstance(args[0], str): |
tests/test_cli.py
Outdated
| ], obj=script_info) | ||
| assert 0 == result.exit_code | ||
|
|
||
| print(tmpdir.listdir()) |
|
@lnielsen All replies done. Will work on the improvements soon, if you're happy with the suggested approach. I'll do everything as fixups for easier reviewing, and then squash them once we're happy. |
abd43b6 to
d1fb4d2
Compare
Currently for preliminary review.
See inveniosoftware/rfc#37 for background.
Not all tests pass yet, as the changes to
PyFSFileStoragehaven't been reflected in the tests.