Studying text has been a skill I was very looking forward to acquire and develop. In this project, I use different R packages and implement text analysis tools to investigate data collected from Twitter. As a social media platform, Twitter offers an easy-to-use medium to share information and spread ideas. I decided to explore climate-related issues, and was interested to see what I will find by analyzing tweets that focus on recycling.
Collecting tweets
In order to collect tweets data, the first thing we need to do is get a twitter API (there are tons of others who explained how to do it, see this example). After getting approved and adding the relevant keys, I used the package TwitteR to pull 4000 tweets that were shared during one week in the middle of May 2021.
Within climate issues, I was interested to investigate the actions people take, and the information they share in the context of recycling. To cover as much ground as possible, my twitter search focused on the term #recycling
Timing
The first question I address is about time. I collected tweets that were sent between May 13th – 20th, 2021. One way to visualize the data is to plot a timeline – in the figure below, it appears that more tweets were sent during the latter half of the week, although it seems that there are fluctuations following May 17th as well.
Moving on from the full timeline, another question I looked at focused on the time of the day that most tweets are sent. To further unpack it, I split the data and compare tweets that attract other users’ attention and receive more than three "likes" (average of a little less than two "likes" for the whole dataset), to all those that were ‘less popular’. The figure below displays the distribution of tweets during an average 24-hour day.
As a whole, there are no major differences as most tweets are sent during the later morning hours until the late afternoon. However, it seems that the 'popular' tweets have a more condensed time frame for when tweets are peaking, compared to the less popular ones who are dispersed throughout the day. The higher density of tweets about #recycling during the workday can be coming from professional organizations and firms that share information about their actions (a social media department at work), or perhaps individuals who tweet about different actions that they participants during that time.
Patterns in tweets – top hashtags
One approach to search for interesting patterns in the data, and explore whether some fit exists across a topic is to explore the hashtags that are attached to tweets. Hashtags allow users to relate their content to more general topics and expose their tweets to a larger audience.
In order to explore the content of tweets, I leverage tools from the tidytext package. First, I use the function unnest_tokens which breaks-down text into words and arrange them into a dataset. Then, using data management functions from dplyr, I filter the terms associated with a hashtag (#) and kept only the 15 most frequently used hashtags.
The results are displayed in the table below. One super-cool package to create beautiful tables is reactable which offers many features when showing the results. For my results, I removed the #recycling which (unsurprisingly) is the most frequent, and I present other frequent terms that are used in tweets about recycling.
Another highly frequent term in these tweets is #sustainability. It may be part of tweets in which users encouraging the adoption of innovative solutions to recycling, or trying to emphasize the benefits). #plastic is directly related to recycling as significant amount of global effort is devoted to identifying solutions that help reduce the damages of plastic. The popularity of terms such as #homeschooling and #parents seem to indicate the important role that education plays in sharing stories about recycling and teaching the younger generation of its importance.
Dig deeper into the text – top words
Exploring the more popular hashtags is a first step in learning about a topic. A more thorough examination of the data will focus on the content of tweets and assess various links in this textual information.
The first necessary step is to clean the text so that common elements such as blank spaces, punctuation marks, links etc., would not clutter the analysis. After removing those elements, the next step is identifying and dropping stop words (words that are not useful for our analysis, for example “the”, “of”, etc.). There are several methods to conduct this step – for this task I used the Corpus method which creates a dataframe and a specific column for the text itself. After cleaning the data from all irrelevant words, I created the more general visualization tool of wordcloud which shows frequent words in the tweets.
The wordcloud plot offers some similarities with the previous analysis of the hashtags. Recycling is the most common word (not shown here since it scales badly), but terms like plastic, waste, sustainability and others are also common.
One potential conclusion in this stage is that tweets that focus on recycling tend to emphasize the solutions and benefits aspects of it – how can we promote and further educate people on this topic? What options are ‘out-there’ to engage in this important task? (more on the positive angle of these tweets in the sections below).
What more can we learn by unpacking the content of tweets?
The data collected for this project focused on tweets over a week in the middle of May 2021. Therefore, we can search for interesting trends in the textual data over time. In particular, I study the question when do people share tweets about recycling and relevant topics/terms?
Once again, to conduct this analysis, we must begin with cleaning the data from any unnecessary text. This time I use a method like the one employed for the top hashtags analysis, i.e. employing the unnest_tokens function, and several other tools from the tidytext package. After the initial cleaning action, the list of most frequent words includes few others that are not relevant (such as “http”, for url’s) so I ran a command for removing few specific stopwords and arrange the data by the most frequently used words in the tweets.
For the trends analysis, I chose the top 4 terms (recycling, plastic, sustainability and waste) and added two more related terms (packaging, bottles) that are likely to be part of information about recycling actions. For the analysis of this reduced data, I was mostly interested in the timing of tweets that consists of the top words. Recycling is by far the most frequent word (over 2000 mentions in the data) so I plot it as a separate trend.
In the figure above, we can see the number of times the term recycling was tweeted. Throughout the timeline, there is an inverted-u shaped trend in which people primarily tweet between 9-10 am to 2-4 pm. In the early part of the week, the top row of the figure, there is a large number of tweets in the late afternoon hours. In the latter half of the week, most tweets shift to an earlier hour in the day (closer to noon).
The second part of this analysis focused on the other terms in the data. The figure below shows when and how many times the other frequent terms were tweeted during the week. The terms plastic and waste have a 'shallower' inverted u trend compared to recycling, and the remaining terms do not seem to follow a specific pattern in an average day. Despite the overall distribution of the tweets and the different terms, it seems that many people shared content with the term plastic on May 15th (Saturday) while the term waste was more dominant on May 18th (Tuesday).
Sustainability, which was the second most tweeted term after recycling, does not seem to follow a certain pattern (other than a larger number of tweets on May 19th, very early in the morning).
One potential explanation to these patterns is the release of the “Plastic waste index” report on May 17th. The report details the companies that contribute the most to generating throwaway plastic items. The spread of these news starting from the 17th may have contributed to the increase in tweets about these topics.
Text also conveys emotions
The analysis of tweets can also benefit from tools that investigate the textual data and identify certain emotional reactions (very common in many social media platforms). This can be accomplished by employing sentiment analysis.
This type of analysis, also termed opinion mining, captures the tone of the text and whether it lends itself to either positive or negative emotions. There are a variety of methods to conduct sentiment analysis and here I talk about two of them.
First, I employ the syuzhet package which uses the nrc sentiment lexicon. The procedure classifies the words in the text into binary values ('yes' / 'no') for categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. Based on this data, I create a plot that aggregates the overall sentiments in all the tweets about #recycling. While my early intuition was that the general tone of the tweets will be negative, it seems that positive emotions are more prevalent – including trust, joy and anticipation (in addition to the general positive category). Negative sentiments do exist but seems to have lower scores.
Going beyond the general sentiment of the tweets data, I ran one more test. The objective is to explore the text and identify which words ‘drive’ those sentiments (both positive and negative). To accomplish that, I went back to the tidytext package and used the uunest_tokens function to unpack the tweets data and remove stop words.
Then, I employed a second approach to sentiment analysis. Using the bing lexicon (part of tidytext), I created a counter for the most common words, and displayed the top 10 positive and negative words in the figure below, including their frequency.
The most frequent words in the entire data play a central role in driving the sentiments expressed by users. Negative emotion is associated with terms such as waste, garbage and toxic. All representing the problems that push us to search for recycling solutions. On the other hand, the positive sentiment is related to both emotions such as excitement, but also to outcomes such as sustainability and improvement. Interestingly, innovation is a common word that drives the positive tone of the tweets. It may suggest that sharing information or learning about the role innovation plays in recycling efforts lends itself to a positive sentiment.
Summary
Text analysis of tweets offers a method to explore many different questions and identify a variety of patterns related to topics that interest us.
My project focused on the environment – I collected tweets that consist of the term #recycling, and found some interesting patterns. For example, an analysis of when people share this information reveal that weekdays are more popular and particularly, normal working hours (between 9-10am to 3-4pm). I also looked at the most frequently used hashtags in the data – the analysis shows which terms are most common and are tweeted in tandem with #recycling. Those patterns were also evident in the top words analysis – terms such as sustainability, waste, plastic and education aspects were the most frequently used words in the data. A trends analysis of top words showed a potential link to global news that relate to recycling as the number of tweets using these terms spiked after the release of a major report about damaging plastics.
Finally, sentiment analysis demonstrated that when it comes to recycling, the emotional tone of tweeter users is mostly positive and driven by outcomes (sustainability, clean) and views about the solution (innovation).
There are other angles to explore when it comes to users’ tweets and textual data overall. By leveraging the ever-growing tools with packages such as tidytext, we can unpack all kinds of data and learn so much more from content that relates to topics we care about.
As always, the complete code for the analysis is available on my Github.
Comments