Skip to content

Commit f34cce6

Browse files
Merge pull request #298 from justinkadi/main
Updating folder hierarchy section
2 parents bb91f5c + 388edc1 commit f34cce6

File tree

2 files changed

+123
-35
lines changed

2 files changed

+123
-35
lines changed

.github/workflows/bookdown.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ jobs:
3333
brew install pandoc
3434
3535
- name: Cache bookdown results
36-
uses: actions/cache@v2
36+
uses: actions/cache@v4
3737
with:
3838
path: _bookdown_files
3939
key: bookdown-${{ hashFiles('**/*Rmd') }}

workflows/edit_data_packages/target_paths.Rmd

Lines changed: 122 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,15 @@
22

33
Sometimes, researchers will upload zip files that a contain nested file and folder structure that we would like to maintain. This reference section will walk you through how to re-upload the contents of the zip file to the Arctic Data Center such that the files and folders are preserved. Note that changing the locations of the files within the package can be tricky, so take these steps with care and try to make sure it is done correctly.
44

5-
With that, here are the steps assuming that the PI has uploaded one zip file to their dataset. You may need to modify the steps for other scenarios, but if you are not sure, feel free to ask the data coordinator. In particular, tar files (.tgz) or rar files (.rar) are also compressed archives that might be better to unpack using command line tools.
5+
With that, here are the steps assuming that the PI has uploaded one zip file to their dataset that holds all their files organized in their desired file hierarchy. You may need to modify the steps for other scenarios, but if you are not sure, feel free to ask the data coordinator. In particular, tar files (.tgz) or rar files (.rar) are also compressed archives that might be better to unpack using command line tools.
66

77
### Download the zip file to datateam
88

99
First, we will download the file, using R.
1010

1111
1. Navigate to the dataset landing page on the Arctic Data Center
12-
2. Right click the "download" button next to the zip file
13-
3. Select "copy link address"
12+
2. Right click the "Download" button next to the zip file
13+
3. Select "Copy Link Address"
1414
4. Run the following two lines of code to set the URL variable, and extract the pid
1515

1616
Here is an example on the test site. Note that on production, you will need to change the URL that you are substituting in the second line of code.
@@ -26,21 +26,21 @@ pid <- gsub("https://test.arcticdata.io/metacat/d1/mn/v2/object/", "", url) %>%
2626
Note that this will download the file to the location you specify as the second argument in `writeBin`. If you are organizing your scripts and data by submitter, it would look like the example below.
2727

2828
```{r, eval = FALSE}
29-
writeBin(getObject(d1c@mn, pid), "~/Submitter/example.zip")
29+
writeBin(getObject(d1c@mn, pid), "~/submitter/data/example.zip")
3030
```
3131

3232
6. Unzip the file into your submitter directory.
3333

3434
```{r, eval = FALSE}
35-
unzip("~/Submitter/example.zip", exdir = "~/Submitter")
35+
unzip("~/submitter/data/example.zip", exdir = "~/submitter/data")
3636
```
3737

3838

3939
7. Delete the zip file (example.zip)
4040

41-
Now if you look at the directory, you shuold see the unzipped contents of the file in a subdirectory of `~/Submitter`. The name of the directory will be the name of the folder the PI created the archive from.
41+
Now if you look at the directory, you should see the unzipped contents of the file in a sub-directory of `~/submitter/data`. The name of the directory will be the name of the folder the PI created the archive from. In this case, that folder is titled `final_image_set`.
4242

43-
Right now, you should stop and examine each file in the directory closely (or each type of file). You may need to make some minor adjustments or ask for clarification from the PI. For example, we still may need to ask for csv versions of excel files, you may need to re-zip certain directories (for example: a zip which contains 5 different sets of shapefiles should be turned into 5 different zips). Evaluate the contents of the directory alongside the data coordinator.
43+
Right now, you should stop and examine each file in the directory closely (or each type of file). You may need to make some minor adjustments or ask for clarification from the PI. For example, we still may need to ask for CSV versions of Excel files, you may need to re-zip certain directories (for example: a zip which contains 5 different sets of shapefiles should be turned into 5 different zips). Evaluate the contents of the directory alongside the data coordinator.
4444

4545
### Re-upload the contents to the Arctic Data Center
4646

@@ -49,59 +49,60 @@ Once you have confirmed everything is all good, we can upload the files to the A
4949
First, get the data package loaded into your R session as usual. I recommend not attempting any EML edits while you do these steps, this update adding the files is best done on it's own, and EML edits can be done on the next version.
5050

5151
```{r, eval = FALSE}
52-
dp <- getDataPackage(d1c, "identifier")
52+
dp <- getDataPackage(d1c, identifier = resourceMapId, lazyLoad = TRUE, quiet = FALSE) # Gather data package
5353
```
5454

55-
Next, we will set up two types of paths describing each of the objects. The first will be an absolute path, so we can be sure the R function finds the files. The second will be a relative path, which will be what shows up on the landing page (or in the download all result) of the data package.
55+
Next, we will set up two types of paths describing each of the objects. The first will be an absolute path, so we can be sure the R function finds the files. The second will be a relative path, which will be what shows up on the landing page (or in the "Download All" result) of the data package.
5656

5757
```{block, type = "warning"}
5858
If you don't know the difference between an absolute and relative path, read on. It is SUPER IMPORTANT!
5959
6060
A path is a location of a file or folder on a computer. There are two types of paths in computing: absolute paths and relative paths.
6161
62-
An absolute path always starts with the root of your file system and locates files from there. The absolute path to my example submitter zip is: `/home/jclark/Submitter/example.zip`. The generic shortcut `~/` is often used to replace the location of your home directory (`/home/username`) to save typing, but your path is still an absolute path if it starts with `~/`. Note that a relative path will **always** start with either `~/` or `/`.
62+
*Absolute paths* always start with the root of your file system and locates files from there. The absolute path to my example submitter zip is: `/home/jclark/submitter/data/example.zip`. The generic shortcut `~/` is often used to replace the location of your home directory (`/home/username`) to save typing, but your path is still an absolute path if it starts with `~/`. Note that a relative path will **always** start with either `~/` or `/`.
6363
64-
Relative paths start from some location in your file system that is below the root. Relative paths are combined with the path of that location to locate files on your system. R (and some other languages like MATLAB) refer to the location where the relative path starts as our working directory. If our working directory is set to `~/Submitter`, the relative path to the zip would be just `example.zip`. Note that a relative path will **never** start with either `~/` or `/`.
64+
*Relative paths* start from some location in your file system that is below the root. Relative paths are combined with the path of that location to locate files on your system. R (and some other languages like MATLAB) refer to the location where the relative path starts as our working directory. If our working directory is set to `~/submitter`, the relative path to the zip would be just `data/example.zip`. Note that a relative path will **never** start with either `~/` or `/`.
6565
```
6666

67-
Getting these paths right is very important because we don't want submitters to download a folder of data, and have the paths look like `/home/internname/ticket_27341/important_folder/important_file.csv`. The first part of that path is particular to however the person processing the ticket organized the data, and is not how the submitter of the data intented to organize the data. Follow the steps below to make sure this does not happen.
67+
Getting these paths right is very important because we don't want submitters to download a folder of data, and have the paths look like `/home/internname/ticket_27341/important_folder/important_file.csv`. The first part of that absolute path is particular to however the person processing the ticket organized the data, and is not how the submitter of the data intended to organize the data. Follow the steps below to make sure this does not happen.
6868

69-
1. get a list of absolute paths for each file in the directory. **NOTE** The "PI_dir_name" here represents whatever directory you retrieved after running `unzip` in the previous step. The actual .zip file should not be in this directory.
69+
1. Get a list of absolute paths for each file in the directory. **NOTE** The "PI_dir_name" here represents whatever directory you retrieved after running `unzip` in the previous step. The actual .zip file should not be in this directory. In our example, this "PI_dir_name" is `final_image_set`.
7070

71-
2. get a list of relative paths for each file in the directory. Note this is the same command, but with the argument `full.names` set to `FALSE`.
71+
2. Get a list of relative paths for each file in the directory. Note this is the same command, but with the argument `full.names` set to `FALSE`.
7272

7373
```{r, eval = FALSE}
74-
abs_paths <- list.files("~/Submitter", full.names = TRUE, recursive = TRUE)
75-
rel_paths <- list.files("~/Submitter", full.names = FALSE, recursive = TRUE)
74+
abs_paths <- list.files("~/submitter/data", full.names = TRUE, recursive = TRUE)
75+
rel_paths <- list.files("~/submitter/data", full.names = FALSE, recursive = TRUE)
7676
```
7777

7878
```{block, type = "warning"}
79-
Make sure that these paths look correct! They should contain ONLY the files that were unzipped. If you have other scripts or metadata files you might want to rearrange your directories to get the correct paths. The relative paths should start with the submitter's directory name. In this example they will look like the below:
79+
Make sure that these paths look correct! They should contain ONLY the files that were unzipped. If you have other scripts or metadata files you might want to rearrange your directories to get the correct paths. The *relative paths* should start with the submitter's directory name. In this example, that submitter's top-level directory is titled `final_image_set`, so they will look like the names below:
8080
8181
`"final_image_set/level1.png" "final_image_set/photos/level2_1.png" "final_image_set/photos/level2_2.png"`
8282
```
8383

8484
Now for each of these files, we can create a `dataObject` for them and add them to the package using a loop. Before running this, look at the values of your `abs_paths` and `rel_paths` and make sure they look correct based on what you know about both paths and the structure of the directory. Within this loop we will also create otherEntities for each item, just putting in the bare minimum of information that will help us make sure that we know what files are what.
8585

8686
```{r, eval = FALSE}
87-
metadataObj <- selectMember(dp, name="sysmeta@formatId", value="https://eml.ecoinformatics.org/eml-2.2.0")
87+
metadataId <- selectMember(dp, name="sysmeta@formatId", value="https://eml.ecoinformatics.org/eml-2.2.0") # Get metadata PID
88+
doc <- read_eml(getObject(d1c@mn, metadataId)) # Read in metadata EML file
8889
89-
doc <- read_eml(getObject(d1c@mn, metadataObj))
90-
doc$dataset$otherEntity <- NULL
9190
oes <- list()
9291
93-
for (i in 1:length(abs_paths)){
92+
for (i in 1:length(abs_paths)) {
9493
formatId <- arcticdatautils::guess_format_id(abs_paths[i])
9594
id <- generateIdentifier(d1c@mn, scheme = "uuid")
9695
dataObj <- new("DataObject", format = formatId, filename = abs_paths[i], targetPath = rel_paths[i], id = id)
97-
dp <- addMember(dp, dataObj, metadataObj)
98-
oes[[i]] <- eml$otherEntity(entityName = rel_paths[i], entityType = formatId, id = id)
96+
dp <- addMember(dp, dataObj, metadataId)
97+
oes[[i]] <- eml$otherEntity(entityName = rel_paths[i], entityType = formatId, id = id) # Can add entityDescription in this command also
9998
}
10099
101-
doc$dataset$otherEntity <- oes
100+
doc$dataset$otherEntity <- NULL # Removing otherEntity of zip file
101+
doc$dataset$otherEntity <- oes # Adding otherEntities to section
102102
103+
eml_validate(doc)
103104
write_eml(doc, "~/metadata.xml")
104-
dp <- replaceMember(dp, metadataObj, replacement="~/metadata.xml")
105+
dp <- replaceMember(dp, metadataId, replacement="~/metadata.xml")
105106
```
106107

107108
Once this is finished you can examine the relationships by running `View(dp@relations$relations)`. If everything worked out well you should see rows that look like this:
@@ -127,24 +128,111 @@ Finally, check your work. Go to the Arctic Data Center and see if the package di
127128
Here is the example code all put together. Make sure you change all of the relevant bits, and check your work carefully!!
128129

129130
```{r, eval = FALSE}
131+
### Downloading ZIP file
130132
url <- "https://arcticdata.io/metacat/d1/mn/v2/object/urn%3Auuid%3A8fee5046-1a8f-4ccc-80f2-70c557a66338"
131133
pid <- gsub("https://arcticdata.io/metacat/d1/mn/v2/object/", "", url) %>% gsub("%3A", ":", .)
132134
133-
writeBin(getObject(d1c@mn, pid), "~/Submitter/example.zip")
135+
writeBin(getObject(d1c@mn, pid), "~/submitter/data/example.zip")
136+
unzip("~/submitter/data/example.zip", exdir = "~/submitter/data")
134137
135-
unzip("~/Submitter/example.zip", exdir = "~/Submitter")
138+
### Re-uploading contents to data package
139+
d1c <- dataone::D1Client("STAGING", "urn:node:mnTestARCTIC") # Setting the Member Node
140+
dp <- getDataPackage(d1c, identifier = resourceMapId, lazyLoad = TRUE, quiet = FALSE) # Gather data package
136141
137-
dp <- getDataPackage(d1c, "identifier")
142+
##### Gather Metadata EML
143+
metadataId <- selectMember(dp, name="sysmeta@formatId", value="https://eml.ecoinformatics.org/eml-2.2.0") # Get metadata PID
144+
doc <- read_eml(getObject(d1c@mn, metadataId)) # Read in metadata EML file
138145
139-
abs_paths <- list.files("~/Submitter/PI_dir_name/", full.names = TRUE, recursive = TRUE)
140-
rel_paths <- list.files("~/Submitter/PI_dir_name/", full.names = FALSE, recursive = TRUE)
146+
##### Get paths
147+
abs_paths <- list.files("~/submitter/data", full.names = TRUE, recursive = TRUE)
148+
rel_paths <- list.files("~/submitter/data", full.names = FALSE, recursive = TRUE)
141149
142-
metadataObj <- selectMember(dp, name="sysmeta@formatId", value="https://eml.ecoinformatics.org/eml-2.2.0")
150+
oes <- list()
151+
for (i in 1:length(abs_paths)) {
152+
formatId <- arcticdatautils::guess_format_id(abs_paths[i])
153+
id <- generateIdentifier(d1c@mn, scheme = "uuid")
154+
dataObj <- new("DataObject", format = formatId, filename = abs_paths[i], targetPath = rel_paths[i], id = id)
155+
dp <- addMember(dp, dataObj, metadataId)
156+
oes[[i]] <- eml$otherEntity(entityName = rel_paths[i], entityType = formatId, id = id) # Can add entityDescription in this command also
157+
}
158+
159+
doc$dataset$otherEntity <- NULL # Removing otherEntity of zip file
160+
doc$dataset$otherEntity <- oes # Adding otherEntities to section
161+
162+
### Validate and save EML
163+
eml_validate(doc)
164+
write_eml(doc, "~/metadata.xml")
165+
166+
### Upload Dataset
167+
dp <- replaceMember(dp, metadataId, replacement="~/metadata.xml") # Replace metadata file
168+
169+
myAccessRules <- data.frame(subject="CN=arctic-data-admins,DC=dataone,DC=org", permission="changePermission")
170+
packageId <- uploadDataPackage(d1c, dp, public=F, accessRules=myAccessRules, quiet=FALSE)
171+
```
172+
173+
One note: if you have files you want to keep in your dataset along with the ZIP file contents you're adding, you can run `doc$dataset$otherEntity <- c(doc$dataset$otherEntity, oes)` so that you'll just be adding your new entities instead of replacing the older ones.
174+
175+
### Example with multiple ZIP files
176+
177+
Here is an example script for a dataset in which the PI uploaded multiple ZIP files they wanted to represent different folders of their dataset
178+
179+
```{r, eval=FALSE}
180+
181+
### Downloading and unzipping 3 ZIP files
182+
url1 <- "https://arcticdata.io/metacat/d1/mn/v2/object/urn%3Auuid%3A5f177997-841b-4081-bc91-65521016b205"
183+
pid1 <- gsub("https://arcticdata.io/metacat/d1/mn/v2/object/", "", url1) %>% gsub("%3A", ":", .)
143184
144-
for (i in 1:length(abs_paths)){
185+
url2 <- "https://arcticdata.io/metacat/d1/mn/v2/object/urn%3Auuid%3A44f29c80-0f27-48be-a04e-fc171c1e5088"
186+
pid2 <- gsub("https://arcticdata.io/metacat/d1/mn/v2/object/", "", url2) %>% gsub("%3A", ":", .)
187+
188+
url3 <- "https://arcticdata.io/metacat/d1/mn/v2/object/urn%3Auuid%3A9a0048b4-c24b-410d-b695-ee95bda97063"
189+
pid3 <- gsub("https://arcticdata.io/metacat/d1/mn/v2/object/", "", url3) %>% gsub("%3A", ":", .)
190+
191+
writeBin(getObject(d1c@mn, pid1), "~/datasets/submitter/data/Regional_estimates.zip")
192+
writeBin(getObject(d1c@mn, pid2), "~/datasets/submitter/data/Terminus_Ablation.zip")
193+
writeBin(getObject(d1c@mn, pid3), "~/datasets/submitter/data/Terminus_Mass_Error.zip")
194+
195+
unzip("~/datasets/submitter/data/Regional_estimates.zip", exdir = "~/datasets/submitter/data/Regional_estimates")
196+
unzip("~/datasets/submitter/data/Terminus_Ablation.zip", exdir = "~/datasets/submitter/data/Terminus_Ablation")
197+
unzip("~/datasets/submitter/data/Terminus_Mass_Error.zip", exdir = "~/datasets/submitter/data/Terminus_Mass_Error")
198+
199+
######################################################
200+
201+
### Set up node and gather data package
202+
d1c <- dataone::D1Client("STAGING", "urn:node:mnTestARCTIC") # Setting the Member Node
203+
resourceMapId <- "..." # Get data package PID (resource map ID)
204+
dp <- getDataPackage(d1c, identifier = resourceMapId, lazyLoad = TRUE, quiet = FALSE) # Gather data package
205+
206+
### Gather Metadata EML
207+
metadataId <- selectMember(dp, name="sysmeta@formatId", value="https://eml.ecoinformatics.org/eml-2.2.0") # Get metadata PID
208+
doc <- read_eml(getObject(d1c@mn, metadataId)) # Read in metadata EML file
209+
210+
### Paths
211+
abs_paths <- list.files("~/datasets/submitter/data", full.names = TRUE, recursive = TRUE)
212+
rel_paths <- list.files("~/datasets/submitter/data", full.names = FALSE, recursive = TRUE)
213+
214+
### Uploading files to data package and saving otherEntities for each file
215+
oes <- list()
216+
for (i in 1:length(abs_paths)) {
145217
formatId <- arcticdatautils::guess_format_id(abs_paths[i])
146-
dataObj <- new("DataObject", format = formatId, filename = abs_paths[i], targetPath = rel_paths[i])
147-
dp <- addMember(dp, dataObj, metadataObj)
218+
id <- generateIdentifier(d1c@mn, scheme = "uuid")
219+
dataObj <- new("DataObject", format = formatId, filename = abs_paths[i], targetPath = rel_paths[i], id = id)
220+
dp <- addMember(dp, dataObj, metadataId)
221+
oes[[i]] <- eml$otherEntity(entityName = rel_paths[i], entityType = formatId, id = id) # Can add entityDescription in this command also
148222
}
223+
224+
doc$dataset$otherEntity <- oes # Replace otherEntity section with new otherEntities
225+
226+
### Check and save the metadata
227+
eml_validate(doc)
228+
eml_path <- arcticdatautils::title_to_file_name(doc$dataset$title)
229+
eml_path <- paste("/home/intern/datasets/submitter/", eml_path, sep="")
230+
write_eml(doc, eml_path)
231+
232+
### Upload Dataset
233+
dp <- replaceMember(dp, metadataId, replacement=eml_path) # Replace metadata file
234+
235+
myAccessRules <- data.frame(subject="CN=arctic-data-admins,DC=dataone,DC=org", permission="changePermission")
236+
packageId <- uploadDataPackage(d1c, dp, public=F, accessRules=myAccessRules, quiet=FALSE)
149237
```
150238

0 commit comments

Comments
 (0)