Skip to content

muhanadz/omero2pandas

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

omero2pandas

A convenience package to download data from OMERO.tables into Pandas dataframes.

Installation

omero2pandas can be installed with pip on Python 3.9+:

pip install omero2pandas

omero2pandas also supports authentication using tokens generated by omero-user-token. Compatible versions can be installed as follows:

pip install omero2pandas[token]

See the omero-user-token documentation for more information.

Usage

import omero2pandas
df = omero2pandas.read_table(file_id=402)
df.head()

Tables can be referenced based on their OriginalFile's ID or their Annotation's ID. These can be easily obtained by hovering over the relevant table in OMERO.web, which shows a tooltip with these IDs.

To avoid loading data directly into a dataframe, you can also download directly into a CSV:

import omero2pandas
omero2pandas.download_table("/path/to/output.csv", file_id=2, chunk_size=1000)

chunk_size can be specified when both reading and downloading tables. It determines how many rows are loaded from the server in a single operation.

Supplying credentials

Multiple modes of connecting to OMERO are supported. If you're already familiar with omero-py, you can supply a premade client:

import omero
import omero2pandas
my_client = omero.client(host="myserver", port=4064)
df = omero2pandas.read_table(file_id=402, omero_connector=my_client)
df.head()

Alternatively, your connection and login details can be provided via arguments:

import omero2pandas
df = omero2pandas.read_table(file_id=402, server="omero.mysite.com", port=4064,
                             username="myuser", password="mypass")
df.head()

If you have omero_user_token installed, an existing token will be automatically detected and used to connect:

import omero2pandas
df = omero2pandas.read_table(file_id=402)
df.head()

You can also generate the connection object separately using the built-in wrapper:

import omero2pandas
connector = omero2pandas.connect_to_omero(server="myserver", port=4064)
# User will be prompted for any missing connection info. 

df = omero2pandas.read_table(file_id=402, omero_connector=connector)
df.head()

When prompting for missing connection information, the package automatically detects whether omero2pandas is running in a Jupyter environment. If so, you'll get a login widget to complete details. Otherwise a CLI interface will be provided.

This behaviour can be disabled by supplying interactive=False to the connect call.

Reading data

Several utility methods are provided for working with OMERO.tables. These all support the full range of connection modes.

Fetch the names of the columns in a table:

import omero2pandas
columns = omero2pandas.get_table_columns(annotation_id=142)
# Returns a list of column names

Fetch the dimensions of a table:

import omero2pandas
num_rows, num_cols = omero2pandas.get_table_size(annotation_id=12)
# Returns a tuple containing row and column count.

You can read out specific rows and/or columns

import omero2pandas
my_dataframe = omero2pandas.read_table(file_id=10, 
                                       column_names=['object', 'intensity'],
                                       rows=range(0, 100, 10))
my_dataframe.head()
# Returns object and intensity columns for every 10th row in the table

Returned dataframes also come with a pandas index column, representing the original row numbers from the OMERO.table.

Non-OMERO.tables Tables

Sometimes users attach a CSV file as a FileAnnotation in CSV format rather than uploading as an OMERO.tables object. omero2pandas can still try to read these using dedicated methods:

import omero2pandas
my_dataframe = omero2pandas.read_csv(file_id=101, 
                                     column_names=['object', 'intensity'])
my_dataframe.head()
# Returns dataframe with selected columns

Note that this interface supports less features than using full OMERO.tables, so queries and row selection are unavailable. However, it is also possible to load gzip-compressed CSV files (.csv.gz) directly with these methods.

You can also directly download the OriginalFile as follows:

import omero2pandas
omero2pandas.download_csv("/path/to/output.csv", file_id=201)

In both these cases the chunk_size parameter controls the number of bytes loaded in each server call rather than the row count. Take care when specifying this parameter as using small values (e.g. 10) will make the download very slow.

By default the downloader will only accept csv/csv.gz files, but it can technically be used with most OriginalFile objects. Supply the check_type=False argument to bypass that restriction.

N.b. OMERO.tables cannot be downloaded with this method, use omero2pandas.download_table instead.

Writing data

Pandas dataframes can also be written back as new OMERO.tables. N.b. It is currently not possible to modify a table on the server.

Connection handling works just as it does with downloading, you can provide credentials, a token or a connection object.

To upload data, the user needs to specify which OMERO object(s) the table will be associated with. This can be achieved with the parent_id and parent_type arguments. Supported objects are Dataset, Well, Plate, Project, Screen and Image.

import pandas
import omero2pandas
my_data = pandas.read_csv("/path/to/my_data.csv")
ann_id = omero2pandas.upload_table(my_data, "Name for table", 
                                   parent_id=142, parent_type="Image")
# Returns the annotation ID of the uploaded FileAnnotation object

Once uploaded, the table will be accessible on OMERO.web under the file annotations panel of the parent object. Using unique table names is advised.

OMERO ID columns

OMERO.tables support some special column types which associate tabular data with objects on the server. These are defined as integer columns with the following names: project, dataset, image, screen, plate, well and roi. These names are case-insensitive. For example, a row with an Image column with the value 1033 will be associated with Image 1033.

To display this in omero-web the table itself should be linked to either the object itself or a parent container. i.e. If you have an image column referencing several images in a dataset, attach the table itself to the parent dataset and the relevant row data will be visible when viewing the individual images in omero-web.

Linking to multiple objects

To link to multiple objects, you can supply a list of (<type>, <id>) tuples to the links parameter. The resulting table's FileAnnotation will be linked to all objects in the links parameter (plus parent_type:parent_id if provided).

import omero2pandas
ann_id = omero2pandas.upload_table(
    "/path/to/my.csv", "My table", 
    links=[("Image", 101), ("Dataset", 2), ("Roi", 1923)])
# Uploads with Annotation links to Image 101, Dataset 2 and ROI 1923 

Links allow OMERO.web to display the resulting table as an annotation associated with those objects.

Large Tables

The first argument to upload_table can be a pandas dataframe or a path to a .csv file containing the table data. In the latter case the table will be read in chunks corresponding to the chunk_size argument. This will allow you to upload tables which are too large to load into system memory.

import omero2pandas
ann_id = omero2pandas.upload_table("/path/to/my.csv", "My table", 
                                   142, chunk_size=100)
# Reads and uploads the file to Image 142, loading 100 lines at a time 

The chunk_size argument sets how many rows to send with each call to the server. If not specified, omero2pandas will attempt to automatically optimise chunk size to send ~2 million table cells per call (up to a max of 50,000 rows per message for narrow tables).

Advanced Usage

This package also contains utility functions for managing an OMERO connection.

omero2pandas.connect_to_omero() takes many of the arguments from the other functions and returns an OMEROConnection object.

The OMEROConnection handles your OMERO login and session, cleaning everything up automatically on exit. This has some accessory methods to access useful API calls:

import omero2pandas
connector = omero2pandas.OMEROConnection()
connector.connect()
client = connector.get_client()
blitz = connector.get_gateway()

When a client is active within the OMEROConnection object, calls to this wrapper class will also be forwarded directly to the client object.

OMEROConnection objects can also be used as a context manager:

import omero2pandas
with omero2pandas.OMEROConnection(server='my.server', port=4064, 
                                  username='test.user',) as connector:
    blitz = connector.get_gateway()
    image = blitz.getObject('Image', id=100)
    # Continue using the standard OMERO API.

The context manager will handle session creation and cleanup automatically.

Connection Management

omero2pandas keeps track of any active connector objects and shuts them down safely when Python exits. Deleting all references to a connector will also handle closing the connection to OMERO gracefully. You can also call connector.shutdown() to close a connection manually.

By default omero2pandas also keeps active connections alive by pinging the server once per minute (otherwise the session may timeout and require reconnecting). This can be disabled as follows

omero2pandas.connect_to_omero(keep_alive=False)

Querying tables

You can also supply PyTables condition syntax to the read_table and download_table functions. Returned tables will only include rows which pass this filter.

Basic syntax

Select rows representing objects with area greater than 20:

omero2pandas.read_table(file_id=10, query='(area>20)')

Multiple conditions

Select rows representing objects with an even ID number lower than 50:

omero2pandas.read_table(file_id=10, query='(id%2==0) & (id<50)')

Complex conditions

Select rows representing objects which originated from an ROI named 'Nucleus':

omero2pandas.read_table(file_id=10, query='x!="Nucleus"', variables={'x': omero.rtypes.rstring('Roi Name')})

N.b. Column names containing spaces aren't supported by the native syntax, but can be supplied as variables which are provided by the variables parameter.

The variables map needs to be a dictionary mapping string variables to OMERO rtypes objects rather than raw Python objects. These should match the relevant column type. Mapped variables are substituted into the query during processing.

A variables map usually isn't needed for simple queries. The basic condition string should automatically get converted to a meaningful type, but when this fails replacing tricky elements with a variable may help.

Remote registration

For OMERO Plus installations which support TileDB as the OMERO.tables backend it is possible to register tables in-place in a similar manner to in-place image imports (otherwise table data is stored in the ManagedRepository).

This is a two-step process:

  1. Convert the dataframe into a TileDB file
  2. Register the remote converted table with OMERO

If you don't know what table backend your OMERO Plus server is using, you probably don't have this feature available. If you have access to the server machine you can check by running omero config get omero.tables.module, if the response is omero_plus.run_tables_pytables_or_tiledb then tiledb is available.

For this mode to be available extra dependencies must also be installed as follows

pip install omero2pandas[remote]

To activate this mode use omero2pandas.upload_table with arguments as follows:

import omero2pandas
db_path = omero2pandas.upload_table("/path/to/my_data.csv", "Name for table", 
                                    local_path="/path/to/mytable.tiledb")
# Returns the path to the created tiledb file

Similar to regular table uploads, the input can be a dataframe in memory or a csv file on disk. The input will be copied into a new TileDB database and registered to OMERO in-place.

To perform this kind of registration you need to provide the local_path argument to the standard omero2pandas.upload_table function (alongside required params for a "normal" upload e.g. server connection details). The local path is the file path where the tiledb file will be written to and registered to OMERO from. If you provide a directory instead the tiledb file will be named based on the table_name argument.

Naturally, the OMERO server will need to be able to access the resulting tiledb file in order to be registered. If the local_path is also visible from the server machine (e.g. you're running the upload on the server itself) then that's sufficient. Otherwise a remote_path argument is also available to tell the server where it should find the table. This is typically needed if the tiledb file ends up mounted at a different location between the local machine and the OMERO server.

For example, if registering from a Windows machine with a network drive to an OMERO server on Linux:

omero2pandas.upload_table(
    df, "My Custom Table",
    local_path="J:\\data\\tables\\my_omero_table.tiledb",
    remote_path="/network_data/tables/my_omero_table.tiledb"
)

Effectively, local_path is where the current machine should write the data to, remote_path is where that file will be from the OMERO server's point of view. No remote path implies that both machines will see the file at the local path.

Note that when a table is registered remotely it is not part of the Managed Repository used to store OMERO data. This means that it becomes the user's responsibility to update the table object on the OMERO server if the file is moved/deleted.

How it works

Remote registration is a two-step process: conversion to TileDB format followed by registration using a HTTP API.

The TileDB conversion is handled automatically by omero2pandas. This largely involves creating a TileDB database from your dataframe and adding a few details to the converted table array metadata. Most native pandas column types are supported.

The actual registration involves telling the server that we'd like to register a remote table and providing it with the TileDB location. There is then a security check to ensure that the user is able to read the file that they've asked the API to register. This is achieved by asking the user to provide a "SecretToken" which must also be present in the the TileDB array metadata. omero2pandas will manage the creation of this token automatically. When using omero2pandas this process also implicitly confirms that the table seen by the server is the same one written by this library.

While it is possible to manually create and register tables without a SecretToken, this is strongly discouraged as other users could potentially register and access the same table without permission. With that in mind the implementation within omero2pandas could be considered as an example of "best practice" for handling remote table registration.

If the registration succeeds the tables API will create all the necessary OMERO objects and return a FileAnnotation ID just as if we'd uploaded the table normally.

Converting to TileDB format without registration

While the processes of tiledb conversion and remote registration are intended to be used together, it is possible to only convert a table to an OMERO Plus-compatible TileDB file. This can be achieved as follows:

import pandas as pd
from omero2pandas.remote import create_tiledb
df = pd.read_csv("/path/to/table.csv")
secret_token = create_tiledb(df, "/path/to/output.tiledb")

This will convert an input dataframe of csv file path into a TileDB file with appropriate metadata for remote registration.

For convenience the creation function will return the SecretToken needed to perform remote registration securely. That token could also be retrieved from the TileDB file metadata if necessary.

About

Exchange data between OMERO.tables and pandas DataFrames.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%