Ag Data Commons
Browse
1/1
4 files

USDA Dietary Guidelines Sentiment Analysis

dataset
posted on 2023-12-14, 14:12 authored by Shivam Saith

This dataset has all documents, the text and the pdf files as well as the code that was used to carry out the sentiment analysis on USDA Dietary Guidelines.

The scope of the project and the resulting dataset uploaded here is carrying out the sentiment analysis on USDA dietary guidelines from 1980 till 2015 (released every 5 years).

The motivation behind this project was the fact that recommendations regarding the different nutrients have changed over the years. For example, In the past, fats have usually been presented in a negative tone, but over the time it has also been mentioned that some types of fats are good for the body unlike other fats.

The goal was to create visualizations to easily convey complex information. Basically, it is about analyzing all the statements dealing with a particular nutrient, and not just understanding whether the sentiment is positive or negative but also calculating the extent to which a statement is positive or negative. The individual statement sentiments have been averaged over the time to generate trendline visualizations.

There are 3 resources added along with this dataset:

  1. The Code file : This file contains the actual code used to carry out the analysis.

  2. The Corpus: This contains the official USDA dietary guidelines in both the original PDF as well as the converted text formats (analysis was carried out using the files in the text format)

  3. Sentiment Polarity Value CSVs: The polarity values CSVs which has been added a a zipped file. This contains the individual statement polarities as calculated by the NLP package. The files are arranged in a way where each nutrient has a separate file for each of the 8 years in consideration (1980,1985,1990,1995,2000,2005,2010,2015) and each such file has the statement and the sentiment scores for individual statements which had that nutrient present as a word in the statement. The CSV files name has 2 parts -> NutrientName_Year. For example : A file named 'Fat_2015' has all the statements and the corresponding polarity values for the nutrient Fat in the Dietary guideline for the year 2015.

Here are few details about the process and the methodologies used to carry out this analysis:

I started by converting the pdfs to text data. I had to use Optimal Character Recognition for that and Google Docs did a fairly good job of converting everything to textual data that was required to carry out the analysis. Basically one just needs to upload the original pdf file on Google drive and then open the same file using Google doc and Google doc does the rest.As far as the data cleaning goes, I had to remove erroneous new line characters and special characters that played no part in the sentiment. I also had to develop regular expressions to identify the beginning of a new statement and this was later used in effectively separating the different statements. Finally, after separating at the individual statement level, I used the relevant package methods to give me the sentiment scores.

As far as the technologies used go, I have used Python, which is a very popular language in the Data Science world. I have personally used both Python and R and went with Python in this case because I felt it had a variety of packages for data manipulation as well Natural Language Processing.

Jupyter notebooks have been used which allow us to create and share documents that contain live code, equations, visualizations. It allows us to code right in our browser and eliminates the need to install any other Integrated Development Environment and also makes it very convenient to share our code. Also, I have used Anaconda which is an open source distribution and helps in simplifying package management and deployment.

The 2 python packages used for sentiment analysis are TextBlob and Vader. Each of these packages is tuned to a specific type of data - Vader is more or less tuned to social media data and Texblob is a beginner level package not tuned to any specific type of data. I have used both of them and provided the end user the option to use either of these packages to carry out the sentiment analysis. One package performs better for some kind of statements and vice versa.

For visualizations I ended up using plotly due to the ease as well as the quality of visualizations it produces with minimal code. It is important to note although that only a limited number of visualizations generated are free and it is not free for commercial use.


Resources in this dataset:

  • Resource Title: Nutrient Sentiment Polarities.

    File Name: NutrientSentimentStatementPolarities.zip

    Resource Description: This is a zipped file which once unzipped will have 6 different directories corresponding to each of the 6 nutrients namely Fat, Water, Protein, Mineral, Carbohydrate and Vitamin. Within each of these directories will be individual CSV files containing 8 files for each of the years from 1980 to 2015 (with a 5 year gap).


  • Resource Title: Code.

    File Name: USDA_DietaryGuidelinesSentimentAnalysis.ipynb.html

    Resource Description: Python notebook with the entire code which has everything from cleaning the data to carrying out the analysis to generating the visualizations.


  • Resource Title: Corpus.

    File Name: Corpus.zip

    Resource Description: This zipped file contains the corpus that was used to carry out the analysis. It has both the PDF version as well as the converted text files.'


  • Resource Title: Data Dictionary.

    File Name: USDA_DietaryGuidelinesSentimentAnalysis_DataDictionary.csv

    Resource Description: Defines variables and properties for sentiment analysis data for carbohydrates, fats, minerals, proteins, vitamins, and water.

Funding

Agricultural Research Service

University of Maryland

History

Data contact name

Saith, Shivam

Data contact email

shivam.saith@gmail.com

Publisher

Ag Data Commons

Temporal Extent Start Date

1980-01-01

Temporal Extent End Date

2015-12-31

Theme

  • Not specified

ISO Topic Category

  • health

National Agricultural Library Thesaurus terms

Dietary Guidelines; nutrients; vitamins; lipids; carbohydrates; guidelines; computer software; data visualization; consumer attitudes

Pending citation

  • No

Public Access Level

  • Public

Preferred dataset citation

Saith, Shivam (2018). USDA Dietary Guidelines Sentiment Analysis. Ag Data Commons. https://doi.org/10.15482/USDA.ADC/1438034

Usage metrics

    Categories

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC