Directory: Directory: 1. Homepage 2. Exploratory Data Analysis 3. Natural Language Processing Analysis 4.
Machine Learning Analysis
Our goal of this study was to deep dive into the state of politics
in America. We were interested in whether Reddit could be representative
of national discourse and national sentiment, in how divisive different
political subreddits actually were, and how common topics and sentiment
varied across subreddits of different ideologies. We performed robust
textual analysis of American political discourse on Reddit - a highly
trafficked and anonymous social media platform whose subcommunities of
dialogue make it a perfect case study of political factions. To try to
pinpoint differences between left-leaning and right-leaning political
communities in America, we analyzed three different subreddits:
r/politics (which theoretically has users of all political ideologies),
r/Republican (which caters to the political right), and r/democrats
(which caters to the political left). Through visualizations, Natural
Language processing models, and Machine Learning models, this analysis
aimed to better understand textual patterns of different political
Reddit users and tie these trends to the state of the economy.
Through Exploratory Data Analysis we identified patterns in
engagement and Reddit user behavior across subreddits and topics. For
anyone looking to get a political message across or organize an event
around a political issue, who may be hoping to reach the largest
audience on Reddit, the best times to post in all political subreddits
is the afternoon (12pm - 6pm). Avoid posting in the middle of the night
or early in the morning on these platforms! Additionally, engagement can
be maximized by optimizing text length - the best title length is in the
range of 50 to 100 words to achieve high engagement in political
subreddits.
Through Natural Language Processing, our primary finding was
pinpointing how sentiment varies across subreddits. Both sides of the
political spectrum are grappling with intense and controversial issues –
e.g. gun violence, healthcare, education, the strength of our democratic
systems – and we were curious to see how people of different ideologies
were talking about these issues. Were people more civil and positive
when talking to like-minded individuals? Did discourse on r/politics, a
melting pot of ideas, spawn more debate and negative sentiment? How did
this vary by topic? Were some topics, potentially more controversial and
high profile topics like the police, met with positive, neutral, or
negative sentiment? Our NLP analysis revealed that the comments of
r/Republican skew positive, while the comments of r/democrats and
r/politics skew slightly negative (46%, 50.8%, and 62%,
respectively). Additionally, discussion of 4 main topics:
Climate, Economy, Healthcare, and the Police all faced mostly negative
discourse at 61.5%, 65.9%, 66.5%, and 66.7% negative, respectively.
Climate had the highest rate of positive discourse – 33.8% -
which is hopefully indicative of problem solving and solution building
for such an important global issue. Through NLP we also pulled
out keywords for each subreddit that sparked the most discussion:
vote, civil, plea for r/politics, republican,
vote, concern for r/Republican, and democrat, reddit,
and rule for r/democrats. These text frequencies, as well as
comment frequencies, offer a small but important glimpse into what
subjects are most interesting to people of each political group.
[Fig 1] Sentiment of Text Broken Down by Subreddit
[Table 1] Sentiment of Text By Topic
Through Machine Learning, we solidified our earlier assumptions
about division across the subreddits. External research suggested that
America was more divided than ever into a two-party system. We aimed to
test that assumption by classifying text into our three subreddits – to
see how well a well-tuned Random Forest classifier could distinguish
text from each. After hyperparameter tuning, we were able to achieve a
classification accuracy of 45% for comments and 50% for titles. Though
imperfect, it was promising that textual classification alone was
indicative of political leaning at 50% accuracy. The ways in which
people across the political spectrum are getting their news and talking
about certain issues is distinct, and if you’re a non-profit trying to
get a message across about investments in climate change, understanding
these distinctions is critical. ML also allowed us to investigate the
relationship between the economy and politics deeper. The results of two
robust supervised regression analyses did not find a strong correlation
between the text of Reddit posts and the DOW index, indicating that the
state of the economy may be less of a correlative factor to politics
than originally suspected. Further research using other economic
indicators, across a wider timeframe and with daily granularity would be
needed to confirm this result.
[Table 2] Classification Outcomes
Future Work
Moving forward beyond this
study, there are a few areas of future research we hope to expand into.
Firstly, expanding the granularity of our key economic indicators to the
day level, when available, would allow for much more robust regression
analysis. Though we examined DOW, CPI, and unemployment rates, it may be
worth investigating other KPIs as well - potentially something like gas
prices which directly impacts constituents on a day to day basis.
Additionally, in future analysis we plan to look at a longer time frame
of analysis. When studying long term trends such as economic health or
effects of an election year on text sentiment, a 2 year period is not
enough time to generate defensible results. Being able to examine Reddit
posts across a 10 or 20 year period, that spans multiple midterm and
presidential elections, would offer a much better view into temporal
trends. Finally, future analysis plans to investigate other text-based
social media platforms where politics are discussed - such as Twitter or
Facebook - in order to compare the metrics of interest across different
public text sources.
Appendix
In the final iteration of this study, we improved our analysis by
expandeding our ML work. In the interim, we ran Random Forest subreddit
classification on the ‘submissions’ data as well, to understand how
titles across subreddits can be classified. We were able to pick up
differences in classification accuracy across comments classification
and titles submission - which expanded our understanding of our key
business questions.