This repository provides the codes and guidance to create a database with tweets about Mar Menor and perform a sentiment analysis to study how contamination affected the evolution of public opinion.
#First step: Generate database with mar_menor.py
This code includes snscrape library, which requires to use the interpreter Python 3.8 to run. Previous versions cannot run this code. Posterior versions may lead to errors.
Snscrape is a library to perform web scraping from the historical of social media programs such as Twitter, Facebook, Instagram, Reddit, etc. In this application, we focus on Twitter.
The code performs a search in twitter of all the tweets containing the words “Mar Menor” that have been posted in given day. This process is repeated in a loop for every day from 1st January 2010 to 18th March 2022.
We can modify the parameters to change the maximum number of tweets that we want to get per day, the starting date, and the ending date of the extraction. This code stores the date, the text and the username of every tweet filtered by the search. It creates one separated .xlsx by each year. We have 3 reasons to do this:
First, the code takes a long time to run. If there is any interruption in the execution, at least we conserve the database until the last completed year before the interruption, such that we only need to run this code for the remaining years instead of starting from the beginning.
Second, textual information weights a lot. If we store everything in one dataframe within python, the probability that it is going to run out of memory is very high.
Third, splitting the sample in different files have several computational advantages. We can perform parallel computing if we want to build or analyse the database faster , using many computers or threads at the same time.
#Second step: Classify the tweets with Sentiment_analysis.py
This program uses as input the files generated in the previous step. It runs a loop which analyses in each iteration all the tweets about Mar Menor in a given year, and generates a new .xlsx file with the sentiment data for that year, which contain the daily number of positive, negative, neutral and total tweets about Mar Menor. First, it detects the language of the tweet with the library langdetect. Then we classify the subjective polarity of the tweets. This program can filter Spanish and English tweets.
To work with the Spanish tweets, we used the sentiment-spanish python library, created by Hugo J. Bello (https://pypi.org/project/sentiment-analysis-spanish). It uses convolutional neural networks to predict the probability of Spanish text being positive. We classified a tweet as positive, negative, or neutral if this probability is >0.6, <0.4 or between 0.4-0.6 respectively. This machine learning model was trained using over 800000 reviews of users of the pages el tenedor, decathlon, tripadvisor, filmaffinity and ebay. This model has a validation accuracy (accuracy over fresh data, no used for training) of 88%.
The algorithm for English tweets is VADER (Valence Aware Dictionary for Sentiment Reasoning), which is available in the NLTK python Library. It is a lexicon (LEX) and rule-based model which has shown to outperform other rule-based and machine learning models previously used in the literature (Hutto and Gilbert, 2014). One of the variables of the VADER polarity score algorithm is called compound, and its sign tell us if the tweet is positive, negative, or neutral (compound >0, <0, =0 respectively).
Machine Learning (ML) models are fed with a training sample, where we have the reviews of thousands of users, with their comments and their rating of the reviewed service or product. ML models learn the patterns in the texts and stablish relationships between these patterns and the rating of the reviews. In this process, the textual information is transformed in to vectors saying which of these patterns occurred in the input, which makes hard to understand what patterns is the ML model using after it has been trained. Therefore, ML models are sometimes defined as black boxes, because it produces useful information without revealing information about its internal workings.
Unlike ML models, the lexicon (LEX) rule-based models do not need to be trained; they use a set of predefined linguistic rules to analyze the text and determine its polarity. While ML models efficiency to classify the text relies on the quality of the training sample; the LEX rule-based models depend on the quality of the linguistic resources employed to build the rules. LEX rule-based models are also more transparent about its internal workings. Henriquez, Guzman, & Santamaria (2016) found that ML models work better for Spanish text, but Hutto and Gilbert (2014) reached the conclusion that LEX rule-based models are more efficient in classifying English textual data. These different results may be due to the differences in the training samples and linguistic resources available in both languages.
#Third step: We merge the sentiment data into a single .xlsx with merge.py
Once we have all the sentiment data, we merge all the sentiment data in an unique .xls
Then we group by week, adding up all the number of positive, negative, neutral and total tweets within each week. We do this to reduce the noise of the data, because high frequency data is very noisy. This way, the graphs are easier to interpret.
To group by week we use pseudo_week_transformation.py. The first 3 pseudo weeks comprehend 7 days; but the last one comprehend from the day 22 until the end of the month. We do this to make every month have always 4 pseudo weeks, which is standard in the economics literature, see, for example, Lewis et al. (2020). Otherwise, some months would have 4 weeks and other months would have 5 weeks, which creates problems when using this data into mathematical or statistical models.
The last week being a little longer may causes the number of tweets to be greater in the last week of every month; but when we compute the proportions of negative, positive and neutral tweets, we are controlling for the different overall number of tweets in each week.
This last step of grouping by weeks is optional, if it is not done, we can still have the data at a daily frequency.
#Bibliography
Henriquez Miranda, C. N., Guzman, J., & Santamaria, R. (2016). A review of Sentiment Analysis in Spanish. TECCIENCIA, 12(22), 40–47. Retrieved from https://revistas.ecci.edu.co/index.php/TECCIENCIA/article/view/320
Hutto, C., & Gilbert, E. (2014, May). Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the international AAAI conference on web and social media (Vol. 8, No. 1, pp. 216-225).
Lewis, D. J., Mertens, K., Stock, J. H., & Trivedi, M. (2022). Measuring real activity using a weekly economic index. Journal of Applied Econometrics, 37(4), 667-687.
#Contact information
Manuel Medina Magro (mmemag97@gmail.com)