DataONEorg · amoeba · Mar 19, 2021 · Mar 20, 2021 · Mar 22, 2021 · Feb 9, 2021
diff --git a/.github/workflows/docs.yaml b/.github/workflows/docs.yaml
@@ -0,0 +1,27 @@
+name: docs
+
+on:
+  push:
+    branches:
+      - main
+
+jobs:
+  document:
+    runs-on: ubuntu-latest
+    container:
+      image: ghcr.io/dataoneorg/slinky:latest
+    steps:
+    - name: Install rsync 📚
+      run: |
+        apt-get update && apt-get install -y rsync
+    - uses: actions/checkout@v3
+    - name: pip install
+      working-directory: ./slinky
+      run: python -m pip install .[docs]
+    - name: Build documentation
+      run: sphinx-build . _build
+      working-directory: ./slinky/docs
+    - name: Deploy 🚀
+      uses: JamesIves/github-pages-deploy-action@v4
+      with:
+        folder: ./slinky/docs/_build
diff --git a/.github/workflows/pytest.yaml b/.github/workflows/pytest.yaml
@@ -0,0 +1,17 @@
+name: pytest
+
+on: push
+
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    container:
+      image: ghcr.io/dataoneorg/slinky:latest
+    steps:
+    - uses: actions/checkout@v3
+    - name: pip install
+      working-directory: ./slinky
+      run: python -m pip install .[test]
+    - name: pytest
+      working-directory: ./slinky
+      run: python -m pytest
diff --git a/.gitignore b/.gitignore
@@ -1,6 +1,7 @@
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[cod]
+*.py.bak
 
 # C extensions
 *.so
@@ -22,6 +23,7 @@ var/
 *.egg-info/
 .installed.cfg
 *.egg
+_build
 
 # PyInstaller
 #  Usually these files are written by a python script from a template
@@ -66,3 +68,6 @@ webapps/
 .idea/
 .venv/
 *.logs
+
+# macOS Specifics
+.DS_Store
diff --git a/Makefile b/Makefile
diff --git a/README.md b/README.md
@@ -1,90 +1,99 @@
 # Slinky, the DataONE Graph Store
 
-## Overview
-Service for the DataONE Linked Open Data graph.
+[![pytest](https://github.com/dataoneorg/slinky/actions/workflows/pytest.yaml/badge.svg)](https://github.com/dataone/slinky/actions/workflows/pytest.yaml)
+
+A Linked Open Data interface to [DataONE](https://dataone.org) designed to run on [Kubernetes](https://kubernetes.io).
 
-This repository contains a deployable service that continuously updates the [DataOne](https://www.dataone.org/) [Linked Open Data](http://linkeddata.org/) graph. It was originally developed as a provider of data for the [GeoLink](http://www.geolink.org/) project, but now is a core component of the DataONE services. The service uses [Docker Compose](https://docs.docker.com/compose/) to manage a set of [Docker](https://www.docker.com/) containers that run the service. The service is intended to be deployed to a virtual machine and run with [Docker Compose](https://docs.docker.com/compose/).
+## Overview
 
-The main infrastructure of the service is composed of four [Docker Compose](https://docs.docker.com/compose/) services:
+Slinky is essentially just a backround job system hooked up to an RDF triplestore that converts DataONE's holdings into Linked Open Data.
 
-1. `web`: An [Apache httpd](https://httpd.apache.org/) front-end serving static files and also reverse-proxying to an [Apache Tomcat](http://tomcat.apache.org/) server running a [GraphDB](http://graphdb.ontotext.com/display/GraphDB6/Home) Lite instance which is bundled with [OpenRDF Sesame](http://rdf4j.org) Workbench.
-2. `scheduler`: An [APSchduler](https://apscheduler.readthedocs.org) process that schedules jobs (e.g., update graph with new datasets) on the `worker` at specified intervals
-3. `worker`: An [RQ](http://python-rq.org/) worker process to run scheduled jobs
-4. `redis`: A [Redis](http://redis.io) instance to act as a persistent store for the `worker` and for saving application state
+It's made up of five main components:
 
-In addition to the core infrastructure services (above), a set of monitoring/logging services are spun up by default. As of writing, these are mostly being used for development and testing but they may be useful in production:
+1. `web`: Provides a public-facing API over Slinky
+2. `virtuoso`: Acts as the backend graph store
+3. `scheduler`: An [RQScheduler](https://github.com/rq/rq-scheduler) process that enqueues repeated jobs in a cron-like fashion
+4. `worker`: One or more [RQ](http://python-rq.org/) processes that runs enqueues jobs
+5. `redis`: A [Redis](http://redis.io) instance to act as a persistent store for the `worker` and for saving application state
 
-1. `elasticsearch`: An [ElasticSearch](https://www.elastic.co/products/elasticsearch) instance to store, index, and support analysis of logs
-2. `logstash`: A [Logstash](https://www.elastic.co/products/logstash) instance to facilitate the log pipeline
-3. `kibana`: A [Kibana](https://www.elastic.co/products/kibana) instance to search and vizualize logs
-4. `logspout`: A [Logspout](https://github.com/gliderlabs/logspout) instance to collect logs from the [Docker](https://www.docker.com/) containers
-5. `cadvisor`: A [cAdvisor](https://github.com/google/cadvisor) instance to monitor resource usage on each [Docker](https://www.docker.com/) container
-6. `rqdashboard`: An [RQ Dashboard](https://github.com/nvie/rq-dashboard) instance to monitor jobs.
+![slinky architecture diagram showing the components in the list above connected with arrows](./docs/slinky-architecture.png)
 
-As the service runs, the graph store will be continuously updated as datasets are added/updated on [DataOne](https://www.dataone.org/). Another scheduled job exports the statements in the graph store and produces a Turtle dump of all statements at [http://dataone.org/d1lod.ttl](http://dataone.org/d1lod.ttl).
+As the service runs, the graph store will be continuously updated as datasets are added/updated on [DataOne](https://www.dataone.org/).
 
 ### Contents of This Repository
 
-```
+```text
 .
-├── d1lod       # Python package which supports other services
-├── docs        # Detailed documentation beyond this file
-├── logspout    # Custom Dockerfile for logspout
-├── logstash    # Custom Dockerfile for logstash
-├── redis       # Custom Dockerfile for Redis
-├── rqdashboard # Custom Dockerfile for RQ Dashboard
-├── scheduler   # Custom Dockerfile for APScheduler process
-├── web         # Apache httpd + Tomcat w/ GraphDB
-├── worker      # Custom Dockerfile for RQWorker process
-└── www         # Local volume holding static files
+├── slinky  # Python package used by services
+├── docs   # Documentation
+├── helm   # A Helm chart for deploying on Kubernetes
 ```
 
-Note: In order to run the service without modification, you will need to create a 'webapps' directory in the root of this repository containing 'openrdf-workbench.war' and 'openrdf-sesame.war':
+## What's in the graph?
 
-```
-.
-├── webapps
-│   ├── openrdf-sesame.war
-└   └── openrdf-workbench.war
-```
+For an overview of what concepts the graph contains, see the [mappings](/docs/mappings.md) documentation.
 
-These aren't included in the repository because we're using GraphDB Lite which doesn't have a public download URL. These WAR files can just be the base Sesame WAR files which support a variety of backend graph stores but code near https://github.com/ec-geolink/d1lod/blob/master/d1lod/d1lod/sesame/store.py#L90 will need to be modified correspondingly.
+## Deployment
 
+Slinky is primarily designed for deployment on the DataONE [Kubernetes](https://kubernetes.io/) cluster.
+However, a [Docker Compose](https://docs.docker.com/compose/) file has been provided for anyone that doesn't have a cluster readily available but still wants to run Slinky.
 
-## What's in the graph?
+### Deployment on Kubernetes
 
-For an overview of what concepts the graph contains, see the [mappings](/docs/mappings.md) documentation.
+To make installing Slinky straightforward, we provide a [Helm](https://helm.sh) chart.
 
+Pre-requisites are:
 
-## Getting up and running
+- A [Kubernetes](https://kubernetes.io) cluster
+- [Helm](https://helm.sh)
 
-Assuming you are set up to to use [Docker](https://www.docker.com/) (see the [User Guide](https://docs.docker.com/engine/userguide/) to get set up):
+Install the Chart by running:
 
+```sh
+cd helm
+helm install $YOUR_NAME .
 ```
-git clone https://github.com/DataONEorg/slinky
-cd slinky
-# Create a webapps folder with openrdf-sesame.war and openrdf-workbench.war (See above note)
-docker-compose up # May take a while
+
+See the [README](./helm/README.md) for more information, including how to customize installation of the Chart to support Ingress and persistent storage.
+
+### Local Deployment with Docker Compose
+
+To deploy Slinky locally using [Docker Compose](https://docs.docker.com/compose/), run:
+
+```sh
+docker compose up
 ```
 
-After running the above `docker-compose` command, the above services should be started and available (if appropriate) on their respective ports:
-1. Apache httdp → $DOCKER_HOST:80`
-2. OpenRDF Workbench → `$DOCKER_HOST:8080/openrdf-workbench/`
-3. Kibana (logs) → `$DOCKER_HOST:5601`
-4. cAdvisor → `$DOCKER_HOST:8888`
+After a few minutes, you should be able to visit http://localhost:9181 to see the worker management interface and see work being done or http://localhost:8080 to send SPARQL queries to the endpoint.
 
-Where `$DOCKER_HOST` is `localhost` if you're running [Docker](https://www.docker.com/) natively or some IP address if you're running [Docker Machine](https://docs.docker.com/machine/). Consult the [Docker Machine](https://docs.docker.com/machine/) documentation to find this IP address. When deployed on a Linux machine, [Docker](https://www.docker.com/) is able to bind to localhost under the default configuration.
+### Virtuoso
 
+The virtuoso deployment is a custom image that includes a runtime script
+for enabling sparql updates. This command is run alongside the Virtuoso
+startup script in a different process and completes when the Virtuoso
+server comes online. This subsystem is fully automated and shouldn't need
+manual intervention during deployments.
 
-## Testing
+#### Protecting the Virtuoso SPARQLEndpoint
 
-Tests are written using [PyTest](http://pytest.org/latest/). Install [PyTest](http://pytest.org/latest/) with
+In order to protect the `sparql/` endpoint that Virtuoso exposes, follow
+[this](http://vos.openlinksw.com/owiki/wiki/VOS/VirtSPARQLProtectSQLDigestAuthentication)
+guide from Open Link. While performing 'Step 6', use the `Browse` button
+to locate the authentication function rather than copy+pasting
+`DB.DBA.HP_AUTH_SQL_USER;`, which is suggested by the guide. _This
+should be done for all new production deployments_.
 
+### Scaling Workers
+
+To scale the number of workers processing datasets beyond the default, run:
+
+```sh
+kubectl scale --replicas=3 deployments/{dataset-pod-name}
 ```
-pip install pytest
-cd d1lod
-py.test
-```
 
-As of writing, only tests for the supporting Python package (in directory './d1lod') have been written.
-Note: The test suite assumes you have a running instance of [OpenRDF Sesame](http://rdf4j.org) running on http://localhost:8080 which means the Workbench is located at http://localhost:8080/openrdf-workbench and the Sesame interface is available at http://localhost:8080/openrdf-sesame.
+## Testing
+
+A test suite is provided for the `slinky` Python package used by workers.
+Tests are written using [pytest](http://pytest.org).
+
+See the [slink README](./slinky/README.md) for more information.
diff --git a/d1lod/Makefile b/d1lod/Makefile
diff --git a/d1lod/README.md b/d1lod/README.md
diff --git a/d1lod/d1lod/__init__.py b/d1lod/d1lod/__init__.py