This dataset contains all documents, the text and the pdf files, as well as the code that was used to carry out the term analysis of agriculturally relevant organisms in GBIF.
The Global Biodiversity Information Facility (GBIF) is an international network and research infrastructure funded by the world's governments and aimed at providing anyone, anywhere, open access to data about all types of life on Earth.
The National Agricultural Library Thesaurus (NALT) has online vocabulary tools of agricultural terms.
My task was to use the agricultural terms from the NALT and analyze the agriculturally relevant organisms in GBIF. Some of the goals were:
- Get descriptive statistics about Agrobiodiversity Data (AgData) in GBIF
- Create visualizations to view occurrence trends of the GBIF corpus and AgData in GBIF to determine gaps or biases.
- Provide examples of and code for how agricultural researchers can work with GBIF data.
Details about the process and the methodologies used to carry out this analysis
I started off with trying to extract names from the Agricultural Thesaurus. I encountered some problems trying to extract names using the RDF format in the Thesaurus. An employee at the Library later provided me with the names in the Thesaurus in a text file.
I then proceeded to extract the scientific names from that text file to run them through the GBIF API. Since there were so many of the names, the API would throw a connection error. The API can handle only so many requests in a particular interval of time. To handle this, I leveraged exception handling in Python. Every time the API threw an error, I told the script to wait for 5 seconds and then resume sending requests. Although this took a lot of time, it allowed me to get data such as year of occurrence, coordinate values about the ag relevant data from the API.
I used Python because it is has support for both web scraping and data analysis, both of which were needed for this project. I used Jupyter notebooks, run through Anaconda. Project Jupyter is a non-profit, open-source project that supports interactive data analysis and scientific computing. It allows users to code right in our browser and eliminates the need to install any other Integrated Development Environment, and also makes it very convenient to share our code.
The main packages used in this project are pandas for data manipulation, requests and json to interact with the GBIF API, NumPy which adds support for array and matrix operations and more.
Tableau and matplotlib has been used to create visualizations after performing the analysis in Python.