How do we build an automated tool for tweet sentiment analysis that learns from crowdsourced human annotations? This is the challenge addressed by the Bayesian Classifier Combination with Words (BCCWords) model presented in the paper to appear in WWW 2015:


Simpson, Edwin, Venanzi, Matteo, Reece, Steven, Kohli, Pushmeet, Guiver, John, Roberts, Stephen and Jennings, Nicholas R. (2015) Language Understanding in the Wild: Combining Crowdsourcing and Machine Learning. In, 24th International World Wide Web Conference (WWW 2015)


This problem involves classifying the sentiment of a large corpus, i.e., hundreds of thousands, of tweets using only a small set of crowdsourced sentiment labels provided by human annotators. In particular, this problem is relevant to various text mining tasks such as weather sentiment classification from twitter [1] and disaster response applications, e.g., the Ushahidi-Haiti project where 40,000 emergency reports were received in the first week from victims of the 2010 Haiti earthquake [2].


To build such a system that classifies tweets based on crowdsourced judgments, we must deal with three key challenges. Firstly, each annotator may have different reliabilities of labelling tweets correctly depending on the content of the tweet. In fact, interpreting sentiment or relevance of a piece of text is highly subjective and, along with variations in annotators’ skill levels, it can result in disagreement amongst the annotators. Secondly, typically there are so many tweets that a small number of dedicated expert labellers will be overwhelmed as was the case during the Haiti earthquake. As a result, the human labels may or may not cover the whole set of tweets, so we may have tweets with only one label or multiple, perhaps conflicting, labels, or none. Thirdly, each distinct term of the dictionary has different probabilities to appear in tweets of different sentiment classes. For example, the terms “Good” and “Nice” are more likely to be used for tweets with a positive sentiment. Thus, we must be able to provide reliable classifications of each tweet by leveraging the language model inferred from the aggregated crowdsourced labels to classify the entire corpus (i.e., tweet set).


To solve this problem, BCCWords builds upon the core structure the Bayesian Classifier Combination model (BCC) to add a new feature relating to modelling human language in addition to aggregating crowdsourced labels. In detail, BCC can learn the reliability of each annotator through a confusion matrix expressing the labelling probabilities for each possible sentiment class. Here are examples of confusion matrices for two annotators rating tweets in five sentiment classes [neutral, positive, not related, unknown] from the CrowdFlower (CF) dataset described in the paper:




Thus, by combining the aggregation mechanism of BCC with language model, we are able to simultaneously inferring the confusion matrix of each worker, the true label of each tweet and the word probabilities of each sentiment class.


We have applied BCCWords to the CF dataset containing up to five weather sentiment annotations for tens of thousands of tweets annotated by thousands of workers. The model correctly found the correlation between positive, negative, neutral words in the related sentiment class. We can see this in the word clouds below. These show the most probable words in each class with word size proportional to the estimated probability of the word conditioned on the true label:


*The black boxes hide some swear words that were inferred by BCCWords within the feature set for the tweets with negative sentiment.

We can also identify the most discriminative words in each class by applying a standard normalisation of the estimated word probabilities (details in the paper):


These word clouds clearly show that the words such are “beautiful” and “perfect” are more discriminative for positive tweets, while the words “stayinghometweet” and “dammit” are more likely to occur in negative tweets. We also found that some words like “complain”, “snowstorm” and “warm” do not necessarily imply a particular positive or negative sentiment as their interpretation is highly context dependent and therefore most of the annotators classified the relative tweets as “unknown”. You can find out more about the classification accuracy of BCCWords and its ability to exploit the language model to predict labels for the entire set of tweets in the paper.


More trials of this technology is in progress. We are currently working with RescueGlobal (A UK-NGO specialised in professional disaster response) and the ORCHID project to use the developed model to analyse live streams of emergency tweets received during recent environmental disasters in the Philippines.


[1] See


[2] See