Why We Harvest
The Ag Data Commons serves as the central registry and repository for open, USDA-funded data resources -- primarily datasets, databases, software tools, APIs, and related services. Researchers are permitted to deposit their data with any repository that meets federal open data requirements, as long as they create a record of that data in the Ag Data Commons for cataloging purposes. Harvests allow the Ag Data Commons to ingest records for large groups of data from a given repository all at once and on a recurring basis to capture additions or changes in the records. When datasets are harvested, the resources are added as remote files, which means they are links to the original files on the remote server.
DKAN Harvest Module
DKAN Harvest Module provides a common harvesting framework for DKAN. It supports custom extensions and adds drush commands and a web UI to manage harvesting sources and jobs. To “harvest” data is to use the public feed or API of another data portal to import items from that portal’s catalog into your own. For example, Data.gov harvests all of its datasets from the data.json files of hundreds of U.S. federal, state and local data portals.
DKAN Harvest is built on top of the widely-used Migrate framework for Drupal. It follows a two-step process to import datasets:
- Process a source URI and save resulting data locally to disk as JSON
- Perform migrations into DKAN with the locally cached JSON files, using mappings provided by the DKAN Migrate Base module
The following are examples of the types of data sources the Ag Data Commons harvests:
- NCBI Bioprojects, which contains genomics datasets. USDA funded NCBI records are filtered out and included in the Ag Data Commons catalog.
- NAL GeoData, which contains geospatial datasets. USDA funded GeoData records are filtered out and included in the Ag Data Commons catalog.
See our Harvest Policy for more details on our criteria for harvesting metadata records from outside repositories.