library (EDIutils) # Handy tools for interacting with EDI's API
library (tidyverse) # For inspecting dataDownloading data programatically
Overview of download methods
EDI provides several tools and methods for accessing data. These include Point and Click methods from the data package landing page as well as Data Download scripts in R, Python and MATLAB that are displayed with each package.
The EDIutils R package also provides excellent documentation on how to Search and Access Data directly from R.
Download programmatically using R
Find what you want
If you know the identifier of the data package you want, it’s easy to find the latest revision
scope <- "knb-lter-nwt" # Niwot scope
identifier <- "314" # Dataset of interest
# ask EDI to tell me what the most current version is
revision <- list_data_package_revisions(scope, identifier, filter = "newest")
# display current version - > this is referred to as the "packageID"
packageID <- paste(scope, identifier, revision, sep = ".")
packageID[1] "knb-lter-nwt.314.4"
Download by package
If you want to download the entire data package, use the function read_data_package_archive. Warning some datasets are large so you may not want to read the entire dataset.
# Download the ENTIRE package to a temporary directory
read_data_package_archive(packageID, path = tempdir())
# Inspect results
list.files(tempdir())
# Download the ENTIRE package to a place you intend to store it for repeated use
# Note you must FIRST create the directory 'some_real_path_on_your_computer'
# before downloading
if (!dir.exists("./some_real_path_on_your_computer")) {
dir.create("./some_real_path_on_your_computer")
}
read_data_package_archive(packageID, path = "./some_real_path_on_your_computer")
# Inspect results
list.files("./some_real_path_on_your_computer")Download select entities
It is also possible to read only select portions of the dataset into your analysis pipeline. I find this helpful for sharing code among collaborators - everyone does not need to reinvent the discovery aspect, and there is no need to email files around (which inevitably ends up with someone working on the wrong file).
# List data entities of the data package
res <- read_data_entity_names(packageID)
res entityId
1 fd533b5b9f3ae79862a33bad964d0c0c
2 e786cdbe1ac83579f69a0e088cccc1c9
entityName
1 Homogenized, gap-filled, daily air temperature
2 Homogenized, gap-filled, daily air temperature full methods
Using the above mapping information, find the entityID of the data table you want to analyze
entityId <- res$entityId[res$entityName == 'Homogenized, gap-filled, daily air temperature']The read_data_entity_resource_metadata function provides additional information about each dataset entity. In particular, the resourceID provides the url that directly links to the dataset
entity_resources <- read_data_entity_resource_metadata(packageID, entityId)
url_of_the_table <- entity_resources$resourceId
name_of_the_table <- entity_resources$fileNameWith the url of the entities, you can down individual data tables
download.file(url = url_of_the_table,
destfile = file.path('./some_real_path_on_your_computer', name_of_the_table))This method then allows you to read back the data tables you want without the slow step of downloading each time you fix your code.
my_file <- read.csv(file.path('./some_real_path_on_your_computer', name_of_the_table))
# analyze awayRead data directly into R without a separate download step
Alternatively, you can wrap the code to read the data table directly into your code and never store the file locally.
my_awesome_data <- read_data_entity(packageID, entityId) |>
# pipe the result to readr::read_csv() to read the data into R
# readr::read_csv() will guess the column types
# but sometimes you need to have it scan a larger portion of the than the default
# first 1000 lines to get the right type
readr::read_csv(guess_max = 100000)Rows: 13906 Columns: 112
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (48): LTER_site, local_site, logger, flag_1, flag_2, flag_3, source_sta...
dbl (63): year, airtemp_max_homogenized, airtemp_min_homogenized, airtemp_a...
date (1): date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# analyze awayAuthentication
EDI recently it intends to require login for data download in the near future. This will also require adding EDI credentials and/or providing an EDI authentication token in API requests (such as those in the examples above). EDI is in the process of updating the EDI utils package and documentation for seamless authentication. Stay tuned.