This dataset has all documents, the text and the pdf files as well as the code that was used to carry out the sentiment analysis on USDA Dietary Guidelines.
The scope of the project and the resulting dataset uploaded here is carrying out the sentiment analysis on USDA dietary guidelines from 1980 till 2015 (released every 5 years).
The motivation behind this project was the fact that recommendations regarding the different nutrients have changed over the years. For example, In the past, fats have usually been presented in a negative tone, but over the time it has also been mentioned that some types of fats are good for the body unlike other fats.
The goal was to create visualizations to easily convey complex information. Basically, it is about analyzing all the statements dealing with a particular nutrient, and not just understanding whether the sentiment is positive or negative but also calculating the extent to which a statement is positive or negative. The individual statement sentiments have been averaged over the time to generate trendline visualizations.
There are 3 resources added along with this dataset:
The Code file : This file contains the actual code used to carry out the analysis.
The Corpus: This contains the official USDA dietary guidelines in both the original PDF as well as the converted text formats (analysis was carried out using the files in the text format)
Sentiment Polarity Value CSVs: The polarity values CSVs which has been added a a zipped file. This contains the individual statement polarities as calculated by the NLP package. The files are arranged in a way where each nutrient has a separate file for each of the 8 years in consideration (1980,1985,1990,1995,2000,2005,2010,2015) and each such file has the statement and the sentiment scores for individual statements which had that nutrient present as a word in the statement. The CSV files name has 2 parts -> NutrientName_Year. For example : A file named 'Fat_2015' has all the statements and the corresponding polarity values for the nutrient Fat in the Dietary guideline for the year 2015.
Here are few details about the process and the methodologies used to carry out this analysis:
I started by converting the pdfs to text data. I had to use Optimal Character Recognition for that and Google Docs did a fairly good job of converting everything to textual data that was required to carry out the analysis. Basically one just needs to upload the original pdf file on Google drive and then open the same file using Google doc and Google doc does the rest.As far as the data cleaning goes, I had to remove erroneous new line characters and special characters that played no part in the sentiment. I also had to develop regular expressions to identify the beginning of a new statement and this was later used in effectively separating the different statements. Finally, after separating at the individual statement level, I used the relevant package methods to give me the sentiment scores.
As far as the technologies used go, I have used Python, which is a very popular language in the Data Science world. I have personally used both Python and R and went with Python in this case because I felt it had a variety of packages for data manipulation as well Natural Language Processing.
Jupyter notebooks have been used which allow us to create and share documents that contain live code, equations, visualizations. It allows us to code right in our browser and eliminates the need to install any other Integrated Development Environment and also makes it very convenient to share our code. Also, I have used Anaconda which is an open source distribution and helps in simplifying package management and deployment.
The 2 python packages used for sentiment analysis are TextBlob and Vader. Each of these packages is tuned to a specific type of data - Vader is more or less tuned to social media data and Texblob is a beginner level package not tuned to any specific type of data. I have used both of them and provided the end user the option to use either of these packages to carry out the sentiment analysis. One package performs better for some kind of statements and vice versa.
For visualizations I ended up using plotly due to the ease as well as the quality of visualizations it produces with minimal code. It is important to note although that only a limited number of visualizations generated are free and it is not free for commercial use.
- Data Dictionarycsv Dataset data dictionary
Defines variables and properties for sentiment analysis data for...
This zipped file contains the corpus that was used to carry out the analysis...
Python notebook with the entire code which has everything from cleaning the...
- Nutrient Sentiment Polaritieszip
This is a zipped file which once unzipped will have 6 different directories...
Dataset InfoThese fields are compatible with DCAT, an RDF vocabulary designed to facilitate interoperability between data catalogs published on the Web.
|Equipment or Software Used|
Ag Data Commons
|Public Access Level|
|Cites Other Datasets|
Agricultural Research Service
University of Maryland
|Dataset DOI (digital object identifier)|
- Food & Nutrition