Skip to content

Commit 2d3ac9e

Browse files
authored
Merge pull request #243 from DataBiosphere/dev
PR for 0.4.7 release
2 parents 9313ef5 + b7bdd73 commit 2d3ac9e

29 files changed

+469
-101
lines changed

README.md

Lines changed: 75 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -17,24 +17,17 @@ and Azure Batch.
1717

1818
## Getting started
1919

20-
You can install `dsub` from [PyPI](https://pypi.org/project/dsub/), or you can clone and
21-
install from [github](https://github.com/DataBiosphere/dsub).
20+
`dsub` is written in Python and requires Python 3.6 or higher.
2221

23-
### Sunsetting Python 2 support
24-
25-
Python 2 support ended in January 2020.
26-
See Python's official [Sunsetting Python 2 announcement](https://www.python.org/doc/sunset-python-2/) for details.
27-
28-
Automated `dsub` tests running on Python 2 have been disabled.
29-
[Release 0.3.10](https://github.com/DataBiosphere/dsub/releases/tag/v0.3.10) is
30-
the last version of `dsub` that supports Python 2.
31-
32-
Use Python 3.6 or greater. For earlier versions of Python 3, use `dsub` 0.4.1.
22+
* For earlier versions of Python 3, use `dsub` [0.4.1](https://github.com/DataBiosphere/dsub/releases/tag/v0.4.11).
23+
* For Python 2, use `dsub`[0.3.10](https://github.com/DataBiosphere/dsub/releases/tag/v0.3.10).
3324

3425
### Pre-installation steps
3526

27+
#### Create a Python virtual environment
28+
3629
This is optional, but whether installing from PyPI or from github,
37-
you are encouraged to use a
30+
you are strongly encouraged to use a
3831
[Python virtual environment](https://docs.python.org/3/library/venv.html).
3932

4033
You can do this in a directory of your choosing.
@@ -56,9 +49,27 @@ virutalenv before calling `dsub`, `dstat`, and `ddel`. They are in the
5649
use these scripts if you don't want to activate the virtualenv explicitly in
5750
your shell.
5851

52+
#### Install the Google Cloud SDK
53+
54+
While not used directly by `dsub` for the `google-v2` or `google-cls-v2` providers, you are likely to want to install the command line tools found in the [Google
55+
Cloud SDK](https://cloud.google.com/sdk/).
56+
57+
If you will be using the `local` provider for faster job development,
58+
you *will* need to install the Google Cloud SDK, which uses `gsutil` to ensure
59+
file operation semantics consistent with the Google `dsub` providers.
60+
61+
1. [Install the Google Cloud SDK](https://cloud.google.com/sdk/)
62+
2. Run
63+
64+
gcloud init
65+
66+
67+
`gcloud` will prompt you to set your default project and to grant
68+
credentials to the Google Cloud SDK.
69+
5970
### Install `dsub`
6071

61-
Choose one of the following:
72+
Choose **one** of the following:
6273

6374
#### Install from PyPI
6475

@@ -167,12 +178,7 @@ The steps for getting started differ slightly as indicated in the steps below:
167178

168179
[Enable the Cloud Life Sciences, Storage, and Compute APIs](https://console.cloud.google.com/flows/enableapi?apiid=lifesciences.googleapis.com,storage_component,compute_component&redirect=https://console.cloud.google.com)
169180

170-
1. [Install the Google Cloud SDK](https://cloud.google.com/sdk/) and run
171-
172-
gcloud init
173-
174-
This will set up your default project and grant credentials to the Google
175-
Cloud SDK. Now provide [credentials](https://developers.google.com/identity/protocols/application-default-credentials)
181+
1. Provide [credentials](https://developers.google.com/identity/protocols/application-default-credentials)
176182
so `dsub` can call Google APIs:
177183

178184
gcloud auth application-default login
@@ -423,57 +429,88 @@ specified and they can be specified in any order.
423429

424430
#### Mounting "resource data"
425431

426-
If you have one of the following:
432+
While explicitly specifying inputs improves tracking provenance of your data,
433+
there are cases where you might not want to expliclty localize all inputs
434+
from Cloud Storage to your job VM.
435+
436+
For example, if you have:
437+
438+
- a large set of resource files
439+
- your code only reads a subset of those files
440+
- runtime decisions of which files to read
441+
442+
OR
443+
444+
- a large input file over which your code makes a single read pass
445+
446+
OR
447+
448+
- a large input file that your code does not read in its entirety
427449

428-
1. A large set of resource files, your code only reads a subset of those files,
429-
and the decision of which files to read is determined at runtime, or
430-
2. A large input file over which your code makes a single read pass or only
431-
needs to read a small range of bytes,
450+
then you may find it more efficient or convenient to access this data by
451+
mounting read-only:
432452

433-
then you may find it more efficient at runtime to access this resource data via
434-
mounting a Google Cloud Storage bucket read-only or mounting a persistent disk
435-
created from a
436-
[Compute Engine Image](https://cloud.google.com/compute/docs/images) read-only.
453+
- a Google Cloud Storage bucket
454+
- a persistent disk that you pre-create and populate
455+
- a persistent disk that gets created from a
456+
[Compute Engine Image](https://cloud.google.com/compute/docs/images) that you
457+
pre-create.
437458

438-
The `google-v2` and `google-cls-v2` providers support these two methods of providing access to
439-
resource data. The `local` provider supports mounting a local directory in a
440-
similar fashion to support your local development.
459+
The `google-v2` and `google-cls-v2` providers support these methods of
460+
providing access to resource data.
461+
462+
The `local` provider supports mounting a
463+
local directory in a similar fashion to support your local development.
464+
465+
##### Mounting a Google Cloud Storage bucket
441466

442467
To have the `google-v2` or `google-cls-v2` provider mount a Cloud Storage bucket using
443468
Cloud Storage FUSE, use the `--mount` command line flag:
444469

445-
--mount MYBUCKET=gs://mybucket
470+
--mount RESOURCES=gs://mybucket
446471

447472
The bucket will be mounted into the Docker container running your `--script`
448473
or `--command` and the location made available via the environment variable
449-
`${MYBUCKET}`. Inside your script, you can reference the mounted path using the
474+
`${RESOURCES}`. Inside your script, you can reference the mounted path using the
450475
environment variable. Please read
451476
[Key differences from a POSIX file system](https://cloud.google.com/storage/docs/gcs-fuse#notes)
452477
and [Semantics](https://github.com/GoogleCloudPlatform/gcsfuse/blob/master/docs/semantics.md)
453478
before using Cloud Storage FUSE.
454479

480+
##### Mounting an existing peristent disk
481+
482+
To have the `google-v2` or `google-cls-v2` provider mount a persistent disk that
483+
you have pre-created and populated, use the `--mount` command line flag and the
484+
url of the source disk:
485+
486+
--mount RESOURCES="https://www.googleapis.com/compute/v1/projects/your-project/global/images/your-image 50"
487+
488+
##### Mounting a persistent disk, created from an image
489+
455490
To have the `google-v2` or `google-cls-v2` provider mount a persistent disk created from an image,
456491
use the `--mount` command line flag and the url of the source image and the size
457492
(in GB) of the disk:
458493

459-
--mount MYDISK="https://www.googleapis.com/compute/v1/projects/your-project/global/images/your-image 50"
494+
--mount RESOURCES="https://www.googleapis.com/compute/v1/projects/your-project/global/images/your-image 50"
460495

461496
The image will be used to create a new persistent disk, which will be attached
462497
to a Compute Engine VM. The disk will mounted into the Docker container running
463498
your `--script` or `--command` and the location made available by the
464-
environment variable `${MYDISK}`. Inside your script, you can reference the
499+
environment variable `${RESOURCES}`. Inside your script, you can reference the
465500
mounted path using the environment variable.
466501

467502
To create an image, see [Creating a custom image](https://cloud.google.com/compute/docs/images/create-delete-deprecate-private-images).
468503

504+
##### Mounting a local directory (`local` provider)
505+
469506
To have the `local` provider mount a directory read-only, use the `--mount`
470507
command line flag and a `file://` prefix:
471508

472-
--mount LOCAL_MOUNT=file://path/to/my/dir
509+
--mount RESOURCES=file://path/to/my/dir
473510

474511
The local directory will be mounted into the Docker container running your
475512
`--script`or `--command` and the location made available via the environment
476-
variable `${LOCAL_MOUNT}`. Inside your script, you can reference the mounted
513+
variable `${RESOURCES}`. Inside your script, you can reference the mounted
477514
path using the environment variable.
478515

479516
### Setting resource requirements

dsub/_dsub_version.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,4 +26,4 @@
2626
0.1.3.dev0 -> 0.1.3 -> 0.1.4.dev0 -> ...
2727
"""
2828

29-
DSUB_VERSION = '0.4.6'
29+
DSUB_VERSION = '0.4.7'

dsub/commands/ddel.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
# Lint as: python3
21
# Copyright 2016 Google Inc. All Rights Reserved.
32
#
43
# Licensed under the Apache License, Version 2.0 (the "License");

dsub/commands/dstat.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
# Lint as: python3
21
# Copyright 2016 Google Inc. All Rights Reserved.
32
#
43
# Licensed under the Apache License, Version 2.0 (the "License");

dsub/commands/dsub.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
# Lint as: python3
21
# Copyright 2016 Google Inc. All Rights Reserved.
32
#
43
# Licensed under the Apache License, Version 2.0 (the "License");

dsub/lib/dsub_errors.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
# Lint as: python3
21
# Copyright 2017 Google Inc. All Rights Reserved.
32
#
43
# Licensed under the Apache License, Version 2.0 (the "License");

dsub/lib/dsub_util.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
# Lint as: python3
21
# Copyright 2016 Google Inc. All Rights Reserved.
32
#
43
# Licensed under the Apache License, Version 2.0 (the "License");

dsub/lib/job_model.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
# Lint as: python3
21
# Copyright 2017 Google Inc. All Rights Reserved.
32
#
43
# Licensed under the Apache License, Version 2.0 (the "License");
@@ -399,6 +398,14 @@ def __new__(cls, name, value, docker_path, disk_size, disk_type):
399398
cls, name, value, docker_path, disk_size=disk_size, disk_type=disk_type)
400399

401400

401+
class ExistingDiskMountParam(MountParam):
402+
"""A MountParam representing an existing Google Persistent Disk."""
403+
404+
def __new__(cls, name, value, docker_path):
405+
return super(ExistingDiskMountParam, cls).__new__(cls, name, value,
406+
docker_path)
407+
408+
402409
class LocalMountParam(MountParam):
403410
"""A MountParam representing a path on the local machine."""
404411

dsub/lib/output_formatter.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
# Lint as: python3
21
# Copyright 2019 Verily Life Sciences Inc. All Rights Reserved.
32
#
43
# Licensed under the Apache License, Version 2.0 (the "License");

dsub/lib/param_util.py

Lines changed: 43 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
# Lint as: python3
21
# Copyright 2016 Google Inc. All Rights Reserved.
32
#
43
# Licensed under the Apache License, Version 2.0 (the "License");
@@ -237,8 +236,32 @@ class MountParamUtil(object):
237236
def __init__(self, docker_path):
238237
self._relative_path = docker_path
239238

240-
def _parse_image_uri(self, raw_uri):
241-
"""Return a valid docker_path from a Google Persistent Disk url."""
239+
def _is_gce_disk_uri(self, raw_uri):
240+
"""Returns true if we can parse the URI as a GCE disk path."""
241+
242+
# Full disk URI should look something like:
243+
# https://www.googleapis.com/compute/<api_version>/projects/<project>/regions/<region>/disks/<disk>
244+
# https://www.googleapis.com/compute/<api_version>/projects/<project>/zones/<zone>/disks/<disk>
245+
#
246+
# This function only returns True if we were able to recognize the path
247+
# as clearly for a GCE disk. This is different than the Image path parsing
248+
# in "make_param" below, which was made very forgiving.
249+
250+
if raw_uri.startswith('https://www.googleapis.com/compute'):
251+
parts = raw_uri.split('/')
252+
253+
# Parts will look something like
254+
# ['https:', '', 'www.googleapis.com', 'compute', '<version>', 'projects',
255+
# '<project>', '[regions/zones]', '<region/zone>', 'disks', '<disk>']
256+
#
257+
return ((parts[0] == 'https:') and (not parts[1]) and
258+
(parts[2] == 'www.googleapis.com') and (parts[3] == 'compute') and
259+
(parts[5] == 'projects') and (parts[9] == 'disks'))
260+
261+
return False
262+
263+
def _gce_uri_to_docker_uri(self, raw_uri):
264+
"""Return a valid docker_path from a GCE disk or image url."""
242265
# The string replace is so we don't have colons and double slashes in the
243266
# mount path. The idea is the resulting mount path would look like:
244267
# /mnt/data/mount/http/www.googleapis.com/compute/v1/projects/...
@@ -263,13 +286,22 @@ def _parse_gcs_uri(self, raw_uri):
263286
return docker_uri
264287

265288
def make_param(self, name, raw_uri, disk_size):
266-
"""Return a MountParam given a GCS bucket, disk image or local path."""
267-
if raw_uri.startswith('https://www.googleapis.com/compute'):
289+
"""Return a MountParam given a GCS bucket, disk uri, image uri or local path."""
290+
291+
if self._is_gce_disk_uri(raw_uri):
292+
docker_path = self._gce_uri_to_docker_uri(raw_uri)
293+
return job_model.ExistingDiskMountParam(name, raw_uri, docker_path)
294+
elif raw_uri.startswith('https://www.googleapis.com/compute'):
295+
# In retrospect, this function should have been more precise to only
296+
# treat a raw_uri as being for an "Image" if the path followed a known
297+
# format. Just checking for the googleapis.com/compute prefix is too
298+
# forgiving.
299+
268300
# Full Image URI should look something like:
269301
# https://www.googleapis.com/compute/v1/projects/<project>/global/images/
270302
# But don't validate further, should the form of a valid image URI
271303
# change (v1->v2, for example)
272-
docker_path = self._parse_image_uri(raw_uri)
304+
docker_path = self._gce_uri_to_docker_uri(raw_uri)
273305
return job_model.PersistentDiskMountParam(
274306
name, raw_uri, docker_path, disk_size, disk_type=None)
275307
elif raw_uri.startswith('file://'):
@@ -374,6 +406,11 @@ def get_persistent_disk_mounts(mounts):
374406
return _get_filtered_mounts(mounts, job_model.PersistentDiskMountParam)
375407

376408

409+
def get_existing_disk_mounts(mounts):
410+
"""Returns the existing disk mounts from mounts."""
411+
return _get_filtered_mounts(mounts, job_model.ExistingDiskMountParam)
412+
413+
377414
def get_local_mounts(mounts):
378415
"""Returns the local mounts from mounts."""
379416
return _get_filtered_mounts(mounts, job_model.LocalMountParam)

0 commit comments

Comments
 (0)