Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
c8446ad
Start snapshot directory.
sampottinger Jan 3, 2025
3e8a870
Additional README edits.
sampottinger Jan 3, 2025
6aec7cb
Add combine shards.
sampottinger Jan 3, 2025
01d4664
Fixes for #114 initial code.
sampottinger Jan 3, 2025
8312d7a
Fix test_changed in test_combine_shards.
sampottinger Jan 3, 2025
d5d2373
Additional bash scripts.
sampottinger Jan 3, 2025
7c02f93
Fix in test for #114.
sampottinger Jan 3, 2025
559166c
Type fixes for #114.
sampottinger Jan 3, 2025
400643a
Viz updates for expanded types checks.
sampottinger Jan 3, 2025
4ea9b4b
Additional type fixes #114.
sampottinger Jan 3, 2025
3368e5c
Style fixes for #114.
sampottinger Jan 3, 2025
efefb1a
Add generate_indicies.
sampottinger Jan 3, 2025
2915804
Move some values to const for #114.
sampottinger Jan 3, 2025
6ad249b
Add check for presence only index.
sampottinger Jan 3, 2025
cdd9916
Additional fixes for #115.
sampottinger Jan 3, 2025
d3043e5
Add additional explanation for #115.
sampottinger Jan 3, 2025
6ca70a6
Update tests for #115.
sampottinger Jan 3, 2025
367844b
Merge pull request #115 from SchmidtDSE/v2-presence-only-fix
sampottinger Jan 3, 2025
f540475
Update coverage targets.
sampottinger Jan 4, 2025
0539c91
Expand tests for ignorable re #111.
sampottinger Jan 4, 2025
8a40820
Add tests for generate_indicies.
sampottinger Jan 4, 2025
131c09e
Add additional bash scripts for #114.
sampottinger Jan 4, 2025
b8f8ed5
Type fixes for tests in #114.
sampottinger Jan 4, 2025
e51a38c
Partial implementation of render_flat.
sampottinger Jan 4, 2025
bf49a8b
Docstring complete on render_flat.
sampottinger Jan 4, 2025
51db277
Fix unit tests for #114.
sampottinger Jan 4, 2025
65b5ce9
Through pyflakes for render_flat.
sampottinger Jan 4, 2025
6f2aaa9
Style fixes in flat_index_util.
sampottinger Jan 4, 2025
2426f18
Additional style fixes on #114.
sampottinger Jan 4, 2025
43b19de
Through mypy in render_flat.
sampottinger Jan 4, 2025
2d081ca
Add tests for render flat.
sampottinger Jan 4, 2025
f576e0f
Test fixes for mypy as part of #114.
sampottinger Jan 4, 2025
7adeee9
Add initial request_source implementation.
sampottinger Jan 4, 2025
294325d
Type fixes for request source.
sampottinger Jan 4, 2025
a46ca21
Add docstring for request source.
sampottinger Jan 4, 2025
426d9d6
Fixes for #114.
sampottinger Jan 4, 2025
d03b171
Add write main index.
sampottinger Jan 4, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,18 @@ jobs:
run: pip install -e .[dev]
- name: Install dev dependencies for app
run: pip install -r afscgapviz/requirements.txt
- name: Install dev dependencies for snapshot
run: pip install -r snapshot/requirements.txt
- name: Install afscgap
run: pip install .
- name: Unit tests main
run: nose2
- name: Unit test app
run: nose2 --start-dir=afscgapviz
- name: Unit test snapshot
run: nose2 --start-dir=snapshot
- name: Check types
run: mypy **/*.py
run: mypy **/*.py --check-untyped-defs
- name: Check errors
run: pyflakes **/*.py
- name: Check style
Expand Down
8 changes: 5 additions & 3 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,9 @@ Thank you for your contribution. We appreciate the community's help in any capac
In order to ensure the conceptual integrity and readability of our code, we have a few guidelines for Python code under the `afscgap` library itself:

- Please try to follow the conventions laid out by the project in existing code. In cases of ambiguity, please refer to the [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html) where possible.
- Tests are encouraged and we aim for 80% coverage where feasible.
- Type hints are encouraged and we aim for 80% coverage where feasible.
- Docstrings are encouraged and we aim for 80% coverage. Please use the [Google-style](https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html) to ensure that our automated documentation system can use your work.
- Tests are encouraged.
- Type hints are encouraged.
- Docstrings are encouraged. Please use the [Google-style](https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html) to ensure that our automated documentation system can use your work.
- Please check that you have no mypy errors when contributing.
- Please check that you have no linting (pycodestyle, pyflakes) errors when contributing.
- As contributors may be periodic, please do not re-write history / squash commits for ease of fast forward.
Expand All @@ -27,6 +27,8 @@ The `afscgap` library itself requires a very high rigor. For other sections incl

Of course, **do not worry if you aren't sure that you met all of our the guidelines!** We encourage pull requests and are happy to work through any necessary outstanding tasks with you.

Previous versions of this guide indicated specific coverage targets but those are removed for the `2.x` release as the codebase spans more modalities where different approaches may be more appropriate in different areas.

<br>
<br>

Expand Down
7 changes: 4 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,9 +133,10 @@ at UC Berkeley](https://dse.berkeley.edu) where [Kevin Koy](https://github.com/k
<br>

## Open Source
We are happy to be part of the open source community.
We are happy to be part of the open source community. We use the following:

At this time, the only open source dependency used by this microlibrary is [Requests](https://docs.python-requests.org/en/latest/index.html) which is available under the [Apache v2 License](https://github.com/psf/requests/blob/main/LICENSE) from [Kenneth Reitz and other contributors](https://github.com/psf/requests/graphs/contributors).
- [Requests](https://docs.python-requests.org/en/latest/index.html) which is available under the [Apache v2 License](https://github.com/psf/requests/blob/main/LICENSE) from [Kenneth Reitz and other contributors](https://github.com/psf/requests/graphs/contributors).
- [fastavro](https://fastavro.readthedocs.io/en/latest/) by Miki Tebeka and Contributors under the [MIT License](https://github.com/fastavro/fastavro/blob/master/LICENSE).

In addition to Github-provided [Github Actions](https://docs.github.com/en/actions), our build and documentation systems also use the following but are not distributed with or linked to the project itself:

Expand All @@ -149,7 +150,7 @@ In addition to Github-provided [Github Actions](https://docs.github.com/en/actio
- [sftp-action](https://github.com/Creepios/sftp-action) under the [MIT License](https://github.com/Creepios/sftp-action/blob/master/LICENSE) from Niklas Creepios.
- [ssh-action](https://github.com/appleboy/ssh-action) under the [MIT License](https://github.com/appleboy/ssh-action/blob/master/LICENSE) from Bo-Yi Wu.

Next, the visualization tool has additional dependencies as documented in the [visualization readme](https://github.com/SchmidtDSE/afscgap/blob/main/afscgapviz/README.md).
Next, the visualization tool has additional dependencies as documented in the [visualization readme](https://github.com/SchmidtDSE/afscgap/blob/main/afscgapviz/README.md). Similarly, the community flat files snapshot updater has additional dependencies as documented in the [snapshot readme](https://github.com/SchmidtDSE/afscgap/blob/main/snapshot/README.md).

Finally, note that the website uses assets from [The Noun Project](thenounproject.com/) under the NounPro plan. If used outside of https://pyafscgap.org, they may be subject to a [different license](https://thenounproject.com/pricing/#icons).

Expand Down
37 changes: 34 additions & 3 deletions afscgap/flat_index_util.py
Original file line number Diff line number Diff line change
Expand Up @@ -457,6 +457,7 @@ def get_matches(self, value: MATCH_TARGET) -> bool:

FIELD_DATA_TYPE_OVERRIDES = {'date_time': 'datetime'}

# These fields, when indexed, ignore zero values. If not presence only, these need to be included.
PRESENCE_ONLY_FIELDS = {'species_code', 'common_name', 'scientific_name'}


Expand All @@ -479,6 +480,38 @@ def decorate_filter(field: str, original: IndexFilter) -> IndexFilter:
return UnitConversionIndexFilter(original, user_units, system_units)


def determine_if_ignorable(field: str, param: afscgap.param.Param, presence_only: bool) -> bool:
"""Determine if a field parameter is ignored for pre-filtering.

Determine if a field parameter is ignored for pre-filtering, turning it into a noop because
pre-filtering isn't possible or precomputed indicies are not available.

Args:
field: The name of the field for which filters should be made.
param: The parameter to apply for the field.
presence_only: Flag indicating if the query is for presence so zero inference records can be
excluded.

Returns:
True if ignorable and false otherwise.
"""
if param.get_is_ignorable():
return True

# If the field index is presence only and this isn't a presence only request, the index must be
# ignored (cannot be used to pre-filter results).
zero_inference_required = not presence_only
field_index_excludes_zeros = field in PRESENCE_ONLY_FIELDS
if zero_inference_required and field_index_excludes_zeros:
return True

filter_type = param.get_filter_type()
if filter_type == 'empty':
return True

return False


def make_filters(field: str, param: afscgap.param.Param,
presence_only: bool) -> typing.Iterable[IndexFilter]:
"""Make filters for a field describing a backend-agnostic parameter.
Expand All @@ -494,12 +527,10 @@ def make_filters(field: str, param: afscgap.param.Param,
be approximated such that all matching results are included in results but some results may
included may not match, requiring re-evaluation locally.
"""
if param.get_is_ignorable():
if determine_if_ignorable(field, param, presence_only):
return []

filter_type = param.get_filter_type()
if filter_type == 'empty':
return []

if field in FIELD_DATA_TYPE_OVERRIDES:
data_type = FIELD_DATA_TYPE_OVERRIDES[field]
Expand Down
9 changes: 9 additions & 0 deletions afscgap/test/test_convert.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,12 @@
"""
Tests for unit conversion.

(c) 2025 Regents of University of California / The Eric and Wendy Schmidt Center
for Data Science and the Environment at UC Berkeley.

This file is part of afscgap released under the BSD 3-Clause License. See
LICENSE.md.
"""
import unittest
import unittest.mock

Expand Down
49 changes: 49 additions & 0 deletions afscgap/test/test_flat_index_util.py
Original file line number Diff line number Diff line change
Expand Up @@ -347,6 +347,50 @@ def test_decorate_filter_active_none(self):
self.assertFalse(decorated.get_matches(None))


class DetermineIfIgnorableTests(unittest.TestCase):

def test_explicit_ignorable_require_zero(self):
param = self._make_test_param(True, 'int')
ignorable = afscgap.flat_index_util.determine_if_ignorable('test', param, False)
self.assertTrue(ignorable)

def test_explicit_ignorable_presence_only(self):
param = self._make_test_param(True, 'int')
ignorable = afscgap.flat_index_util.determine_if_ignorable('test', param, True)
self.assertTrue(ignorable)

def test_require_zero_supported(self):
param = self._make_test_param(False, 'int')
ignorable = afscgap.flat_index_util.determine_if_ignorable('count', param, False)
self.assertFalse(ignorable)

def test_require_zero_unsupported(self):
param = self._make_test_param(False, 'str')
ignorable = afscgap.flat_index_util.determine_if_ignorable('species_code', param, False)
self.assertTrue(ignorable)

def test_presence_only_unsupported(self):
param = self._make_test_param(False, 'str')
ignorable = afscgap.flat_index_util.determine_if_ignorable('species_code', param, True)
self.assertFalse(ignorable)

def test_empty(self):
param = afscgap.param.EmptyParam()
ignorable = afscgap.flat_index_util.determine_if_ignorable('count', param, True)
self.assertTrue(ignorable)

def test_plain_not_ignorable(self):
param = afscgap.param.IntRangeParam(1, None)
ignorable = afscgap.flat_index_util.determine_if_ignorable('count', param, True)
self.assertFalse(ignorable)

def _make_test_param(self, ignorable, filter_type):
param = unittest.mock.MagicMock()
param.get_is_ignorable = unittest.mock.MagicMock(return_value=ignorable)
param.get_filter_type = unittest.mock.MagicMock(return_value=filter_type)
return param


class MakeFilterTests(unittest.TestCase):

def test_empty(self):
Expand All @@ -365,6 +409,11 @@ def test_string_false(self):
filters = afscgap.flat_index_util.make_filters('common_name', param, True)
self.assertEqual(len(filters), 1)
self.assertFalse(filters[0].get_matches('other'))

def test_presence_only(self):
param = afscgap.param.StrEqualsParam('test')
filters = afscgap.flat_index_util.make_filters('common_name', param, False)
self.assertEqual(len(filters), 0)

def test_int_true(self):
param = afscgap.param.IntEqualsParam(1)
Expand Down
14 changes: 7 additions & 7 deletions afscgapviz/afscgapviz.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
import sqlite3
import typing

import flask
import flask # type: ignore

import data_util
import model
Expand Down Expand Up @@ -265,7 +265,7 @@ def render_page():
with conn_generator() as con:
return flask.render_template(
'viz.html',
displays=get_display_info(con, state)['state'],
displays=get_display_info(con, state)['state'], # type: ignore
get_species_select_content=get_species_select_content
)

Expand Down Expand Up @@ -387,7 +387,7 @@ def download_geohashes():
else:
base_sql = sql_util.get_sql('query')
query_sql = base_sql % (geohash_size + 1, species_filter[0])
query_args = (year, survey, species_filter[1])
query_args = (year, survey, species_filter[1]) # type: ignore

output_io = io.StringIO()
writer = csv.DictWriter(
Expand Down Expand Up @@ -416,7 +416,7 @@ def download_geohashes():
writer.writerows(results_dict_final)

full_filename_pieces = comparison_filename_pieces + filename_pieces
filename_spaces = '_'.join(full_filename_pieces)
filename_spaces = '_'.join(full_filename_pieces) # type: ignore
filename = filename_spaces.replace(' ', '_')

if FILENAME_REGEX.match(filename) is None:
Expand Down Expand Up @@ -556,7 +556,7 @@ def try_float(target: str) -> float:
species_filter[0],
geohash_size + 1
)
query_args = (year, survey, species_filter[1])
query_args = (year, survey, species_filter[1]) # type: ignore

with conn_generator() as connection:
cursor = connection.cursor()
Expand Down Expand Up @@ -586,7 +586,7 @@ def try_float(target: str) -> float:
max_temp,
first_cpue,
second_cpue
) = result_float
) = result_float # type: ignore

ret_object = {
'cpue': {
Expand All @@ -602,7 +602,7 @@ def try_float(target: str) -> float:
}

if is_comparison:
ret_object['cpue']['second'] = {
ret_object['cpue']['second'] = { # type: ignore
'name': other_species_filter[1],
'year': other_year,
'value': second_cpue
Expand Down
4 changes: 2 additions & 2 deletions afscgapviz/build_database.py
Original file line number Diff line number Diff line change
Expand Up @@ -327,8 +327,8 @@ def download_main(args):
for year in years:
for survey in SURVEYS:

with connection as cursor:
download_and_persist_year(survey, year, cursor, geohash_size)
with connection as cursor: # type: ignore
download_and_persist_year(survey, year, cursor, geohash_size) # type: ignore

print('Completed %d for %s.' % (year, survey))
time.sleep(SLEEP_TIME)
Expand Down
65 changes: 65 additions & 0 deletions snapshot/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Snapshot Updater
Scripts to update the community Avro flat files as described at [data.pyafscgap.org](https://data.pyafscgap.org/).

## Purpose
Due to API limitations that prevent filtering joined data prior to downloading locally, community flat files in [Avro format](https://avro.apache.org/) offer pre-joined data with indicies which can be used by `pyafscgap` to avoid downloading all catch data or specifying individual hauls. This directory contains scripts used to update those resources which are availble at [data.pyafscgap.org](https://data.pyafscgap.org/).

## Usage
The updater can be executed with individual scripts or in its entirety through bash. Note that some of these steps use environment variables specified in local setup.

### Python library
These community files are used by default when interacting with the `pyafscgap` library. See [pyafscgap.org](https://pyafscgap.org/) for instructions. These Avro files will be requested and iterated by the client without the user needing to understand the underlying file format. Only the `pyafscgap` interface is intended to be maintained across major versions for backwards compatibility.

### Prebulit payloads
Prebuilt Avro files are avialable via HTTPS through [data.pyafscgap.org](https://data.pyafscgap.org/). There are two subdirectories of files.

First, [index](https://data.pyafscgap.org/index) contains "index data files" which indicate where catch data can be found. These indicies include filename that can be found in `joined`. Each file maps from a value for the filename's variable to a set of joined flat files with those data can be found. Each value refers to a specific haul where floating point values are rounded to two decimal places. Note that, due to this rounding, more precise filters will have to further sub-filter after collecting relevant data from the `joined` subdirectory.

Second, [joined](https://data.pyafscgap.org/joined) includes all catch data joined against the species list and hauls table to create a single "flat" file which fully describes all information available for each catch. Each record is a single catch and each file is a single haul where a haul takes place within a specific year and survey.

Note that, while provided as a service to the community, these Avro files and directory structure may be changed in the future. These files exist to serve the `pyafscgap` functionality as the NOAA APIs change over time. Therefore, for a long term stable interface with documentation and further type annotation, please consider using the `pyafscgap` library isntead.

### Manual execution
In order to build the Avro files yourself by requesting, joining, and indexing original upstream API data, you can simply execute `bash execute_all.sh` after local setup. These will build these files on S3 but they may be deployed to an SFTP server trivially.

## Local setup
Local environment setup varies depending on how these files are used.

### Python library setup
Simply install `pyafscgap` normally to have the library automatically use the flat files for queries.

### Prebuilt payloads environment
These files may be used by any programming language or environment supporting Avro. For more information, see the official [Avro documentation](https://avro.apache.org/docs/) though [fastavro](https://fastavro.readthedocs.io/en/latest/) is recommended for use in Python.

### Environment for manual execution
To perform manual execution, these scripts expect to use [AWS S3](https://aws.amazon.com/s3/) prior to deployment to a simple SFTP server. In order to use these scripts, the following envrionment variables need to be set after installing dependencies (optionally within a virtual environment) via `pip install -r requirements.txt`:

- `AWS_ACCESS_KEY`: This is the access key used to upload completed payloads to AWS S3 or to request those data as part of distributed indexing and processing.
- `AWS_ACCESS_SECRET`: This is the secret associated with the access key used to upload completed payloads to AWS S3 or to request those data as part of distributed indexing and processing.
- `BUCKET_NAME`: This is the name of the bucket where completed uploads should be uploaded or requested within S3.

These may be set within `.bashrc` files or similar through `EXPORT` commands. Finally, these scripts expect [Coiled](https://www.coiled.io/) to perform distributed tasks.

## Testing
Unit tests can be executed by running `nose2` within the `snapshot` directory.

## Deployment
Files generated in S3 can be trivially deployed to an SFTP server or accessed directly from AWS.

## Development
These scripts follow the same development guidelines as the overall `pyafscgap` project. Note that style and type checks are enforced though CI / CD systems. See [contributors documentation](https://github.com/SchmidtDSE/afscgap/blob/main/CONTRIBUTING.md).

## Open source
The snapshots updater uses the following open source packages:

- [bokeh](https://docs.bokeh.org/en/latest/) from Bokah Contributors and NumFocus under the [BSD License](https://github.com/bokeh/demo.bokeh.org/blob/main/LICENSE.txt).
- [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) under the [Apache v2 License](https://github.com/boto/boto3/blob/develop/LICENSE).
- [dask](https://www.dask.org/) from Anaconda and Contributors under the [BSD License](https://github.com/dask/dask/blob/main/LICENSE.txt).
- [fastavro](https://fastavro.readthedocs.io/en/latest/) by Miki Tebeka and Contributors under the [MIT License](https://github.com/fastavro/fastavro/blob/master/LICENSE).
- [requests](https://docs.python-requests.org/en/latest/index.html) which is available under the [Apache v2 License](https://github.com/psf/requests/blob/main/LICENSE) from [Kenneth Reitz and other contributors](https://github.com/psf/requests/graphs/contributors).
- [toolz](https://toolz.readthedocs.io/en/latest/) under a [BSD License](https://github.com/pytoolz/toolz/blob/master/LICENSE.txt).

We thank these projects for their contribution. Note that we also use [coiled](https://www.coiled.io/).

## License
Code to generate these flat files is released alongside the rest of the pyafscgap project under the [BSD License](https://github.com/SchmidtDSE/afscgap/blob/main/LICENSE.md). See [data.pyafscgap.org](https://data.pyafscgap.org/) for further license details regarding prebuilt files.
Loading