Abstract—Social media is a key aspect for social networking and content sharing. An example is that people often share their thoughts about movies, social causes and many more using Twitter, Facebook, Orkut, etc. But then, the data that is created from these sites remains unexplored to an great extent. In this paper, we exhibit how online networking data can be utilized to predicting results. In particular, we use Twitter as a platform to predict the earning of the box-office for movies. Movies in India are increasingly taking to digital platforms to create buzz before the release of a movie. Sentiment Analysis can be used to analyze the reaction of people for various purposes by reckoning the contextual polarity of tweets in real time. We would be performing tweets cleansing to remove irrelevant and incomplete tweets. We have two approaches proposed for sarcasm detection followed by tweets cleansing. In the first approach sarcasm detection will be followed by tweets cleansing, then we would be perform sentiment analysis on the tweets obtain after sarcasm detection. In second approach Sentiment Analysis is performed on literal tweets. We further investigate the classified tweets and mine the pattern to segregate sarcastic tweets from literals through hash tag analysis and Bootstrap Algorithm. Then by using relevant matrix, performance of the movie is calculated. The endeavor is to ultimately predict whether a movie can succeed on the box office or not. Gauging and surveying the mood of people can help companies gain inestimable vision about the dynamics of consumers’ propensity unstintingly.
Keywords—Twitter; Sentiment Analysis; Sarcasm; Movie; Performance
I. INTRODUCTION
Social media has exploded as a category of online discourse where people create content, share it, bookmark it and network at a prodigious rate. Examples include Facebook, Twitter, Instagram and Snapchat etc. Because of its ease of use, speed and reach, social media is fast changing the public discourse in society and setting trends and agendas in topics that range from the environment and politics to technology and the entertainment industry. Since social media can also be construed as a form of collective wisdom, we can investigate its power at predicting real-world outcomes. Twitter is the world’s most popular platform for self-expression. With more than 18 million registered users, India has the third highest number of accounts on Twitter. This provides a tremendous opportunity to marketers to gauge public opinion and make precise and targeted campaigns. Thus by collecting the tweets from twitter based on the movie we can predict the performance of the movie. The reviews can be based on various factors of movies like songs, trailer, etc.
Sarcasm is very common in social media and difficult to analyze. It has an important effect on sentiment. Sarcasm is nothing but the opposite of whatever the person wants to convey. It has the power to flip the polarity of the message.
II. RELATED WORK
A. Sentiment Analysis
The topic of using social media to predict the future has becomes very popular in recent years. Sitaram Asur & Bernardo A. Huberman.(2010)[1] tried to show that twitter-based prediction of box office revenue performs better than market based prediction by analyzing various aspects of tweets sent during the movie release. Andrei Oghina, Mathias Breuss, Manos Tsagkias & Maarten de Rijke[2] uses twitter and YouTube data to predict the IMDB scores of movies. Sentiment analysis of twitter data is a hot research topic in recent years. While sentiment analysis of documents has been studied for a long time, the techniques may not perform very well in twitter data because of the characteristics of tweets. The major difficulties in processing twitter data: the tweets are usually short (up to 140 words). The text of the tweets is often ungrammatical. Some investigates features of sentiment analysis on tweets data. However, few works directly uses sentiment analysis results to predict the future.
Vasu Jain also did the same prediction by using Lingpipe sentiment analyzer to perform sentiment analysis on twitter data[3]. It also includes investigation on related topics like the relationship between tweet sent time and tweet number. Walaa Medhat, Ahmed Hassan, Hoda Korashy (2014) published survey paper which tackles a comprehensive overview of the last update in sentiment analysis field. Many recently proposed algorithms’ enhancements and various SA applications are investigated and presented briefly in this survey[7]. Most previous research on sentiment-based classification has been at least partially knowledge-based. Some of this work focuses on classifying the semantic orientation of individual words or phrases, using linguistic heuristics or a pre-selected set of seed words (Jain, V. (2013))[3]. Past work on sentiment-based categorization of entire documents has often involved either the use of models inspired by the manual or semi-manual construction of discriminant-word lexicons (Taboada, M., Brooke, J., Tofiloski, M., Voll, K., & Stede, M. (2011))[8].
B. Sarcasm Detection
As having sarcastic tweets can affect the accuracy of our prediction, thus for removing the sarcastic tweets we can use the very simple hash tag analysis or there are other methods like lexicon analytics, Fact Negation, or temporal knowledge extraction. Prior work on sarcasm detection on Twitter (Gonz´alez- Ib´a˜nez, Muresan, and Wacholder 2011)[4] found that humans can’t easily judge the sarcasm of others’; consequently, recent research exploits users’ self-declarations of sarcasm in the form of #sarcasm or #sarcastic tags of their own tweets. The same has been explained by Diana Maynard and Mark A. Greenwood(2014)[6]. Figure 1 gives one such example.
Figure 1: User self-reporting of sarcasm.
We follow the same technique here as well, identifying the
Tweets regarding a particular movie mentioning #sarcasm or #sarcastic or #beingsarcastic.
The other method for sarcasm detection is to use Bootstrap Algorithm in which the system doesn’t look for a specific hashtag.
III. OUR MODEL
Since Social Media is also a huge collection of wisdom we discovered the fact that this knowledge can be used in a positive manner to make predictions. In particular, we use Twitter.com to forecast box-office revenues for movies. Tweets are collected in real time from twitter, stored and analyzed, and classified into sub entities. The critical period, i.e., the period around release is used to contemplate the online hype generated. We use sentiment analysis results of tweets sent during movie release to predict the box office success of the movie. Figure 2 shows the flow chart of our model.
Fig. 2: Flow Chart of Movie Performance Prediction Using Sentiment Analysis
A. Data Collection
For collection of dataset twitter APIs are used. By using Twitter Streaming and Rest API we can connect to twitter and request for the API keys. And using those keys real time tweets are fetched on which sentiment analysis would be performed. Since for the prediction of a movie success we need to have a real time dataset.
B. Data Cleansing
The tweets obtained from twitter will be real time but it is possible that the tweets may comprise of noisy data. The tweets can be irrelevant as well. So to clean those tweets the data cleansing process is performed. This will remove incomplete, noisy and irrelevant data. After the preprocessing work, the output is nothing but the literal tweets.
C. Sentiment Analysis
Once the tweets are cleaned and pre-processed we can perform sentiment analysis on the literal tweets. We can make
use of lexicon based algorithm for analysis as it is an unsupervised method and thus it does not require any training of the dataset. There are two approaches for sentiment analysis, Corpus based as well as Dictionary based approach. Dictionary based approach is used as corpus based requires labeled training data but movies reviews lacks labeled training data. The literals tweets are classified in to positive, negative and neutral tweets through lexicon dictionary based approach. This approach drives a small set of opinion words that is collected manually with known orientations. Then, this set is grown by searching in the well known corpora WordNet or thesaurus for their synonyms and antonyms. The newly found words are added to the seed list. Then the next iteration starts. The iterative process stops when no new words are found. After the process is completed, manual inspection can be carried out to remove or correct errors.
D. Sarcasm Detection
Sarcasm is \”a sharp, bitter, or cutting expression or remark; a bitter gibe or taunt.\” Sarcasm may employ ambivalence, although sarcasm is not necessarily ironic. \”The distinctive quality of sarcasm is present in the spoken word and manifested chiefly by vocal inflections\”. Sarcasm involves the expression of an insulting remark that requires the interpreter to understand the negative emotional connotation of the expresser within the context of the situation at hand.
Sarcasm is tough to detect in written form unless you know the context. Sarcasm detection is a challenging task. When speaking, the tone usually gives away the Sarcasm. That’s not the case in written text. Because of the complexities of social media, such as not knowing who the user’s actual audience is, understanding sarcasm on Twitter is a little trickier. However there is a drawback to taking our data from Twitter; it’s noisy. Some people use the #sarcasm hashtag to point out that their tweet was meant to be sarcastic, but a human would not have been able to guess that the tweet is sarcastic without the label #sarcasm.
Therefore we would be using Bootstrap algorithm which identifies sarcastic tweets from dateset without any mandatory requirement of #sarcasm or #sarcastic, etc. It makes use of simple three files to remove the sarcastic tweets. As by removing the sarcastic tweets accuracy level of prediction result can be increased.
There are two approaches to remove sarcastic tweets.
i. Before sentiment analysis
ii. After sentiment analysis
i. Sarcasm Detection before Sentiment Analysis
Sarcasm can be detected before sentiment analysis that is in the cleansing phase. Once the tweets are cleansed by removing irrelevant and noisy data we would perform sarcasm detection on the remaining set of dataset. Sarcasm might need context, but context can depend on something that’s not mentioned in the tweet.
For example, tweeting “Mysore had the country’s highest air pollution levels\” will tell you it’s sarcastic only because you know Mysore was ranked as the second cleanest city in India and the cleanest in Karnataka by Urban Development . Algorithm would actually have to search the internet and understand news articles to get a full context.
Figure 3: Flow Chart for sarcasm detection before sentiment analysis
For sarcasm detection bootstrapping algorithm can be used which explicitly recognizes contexts that contain a positive sentiment contrasted with a negative situation. This algorithm is similar to lexical analysis approach. In bootstrap algorithm three files are maintained i.e. positive sentiment file, negative situation file and file containing common known facts. A key factor that makes the algorithm work is the presumption that if you find a positive sentiment along with a negative situation in a tweet or else if tweet which is contradictory to the known fact, then that tweet is sarcastic tweet[10]. Positive sentiment file store all the expected positive words which can be taken from wordNet. Similarly, negative situation file can also be constructed using wordNet. Fact file is constructed by analysis of all the known facts about movies and movie related entities.
For example, “I love waiting forever for the SRK movie release.” This will tell you it’s sarcastic only because of positive sentiment phrase “love” and negative situation phrase “waiting forever”. Algorithm also detects that #sarcasm or #sarcastic is at the end of the sentence. While performing sarcasm detection using bootstrap algorithm large amount of dataset is available because of which more time is require for pre-processing of tweets. Thus, we preferred to follow the second approach i.e. sarcasm detection after sentiment analysis.
ii. Sarcasm Detection after Sentiment Analysis
Sarcasm sentiment analysis is a rapidly growing area of NLP with research ranging from word, phrase and sentence level classification[9]. In this approach, once the tweets are preprocessed we would be performing sentiment analysis on the obtained dataset. Sentiment analysis on tweets will give three set of classified data i.e. Positive, Negative and Neutral tweets and further we would be performing sarcasm detection on the positive and the negative tweets obtained followed by sentiment analysis.
Figure 2: Flowchart for Sarcasm detection after sentiment analysis
For removing sarcastic tweets after sentiment analysis, we can use the same Bootstrap algorithm which uses three files as mentioned above i.e. positive sentiment file, negative situation file and file containing common known facts. But in this case, after sentiment analysis tweets are classified into positive tweets and negative tweets which are then used for sarcasm detection.
By using Bootstrap Algorithm, positive tweets are then classified into actual positive tweets i.e. non-sarcastic tweets and sarcastic tweets. Similarly, negative tweets are classified into actual negative i.e. non-sarcastic tweets and sarcastic tweets. Due to classified dataset, detection become easier as we have positive sentiment file in which we can search for positive tweets and then its corresponding value in negative situation file and vice versa. If there is no sentiment related to positive tweet in positive file we need not need to serach the negative situation file and it same goes other way round for negative tweets and negative file. Due to this a lot of time can be saved and analyzing can be done faster. This can also be achieved by using hashtag analysis. After classifying the tweets into positive, negative and neutral we can analyze the tweets on the basis of the hashtag used like #sarcasm or #sarcastic or #beingsarcastic. Further this literal tweets are used for movie performance prediction.
E. Result Prediction
After the tweets are classified into actual positive and actual negative, we can take the ratio of actual positive tweets to actual negative tweets to predict the performance of the movie. We can estimate threshold value for the ratio of positive tweets to the negative tweets so that result can be predicted. We will make use of relevant matrix for predicting the performance by deciding the threshold value for same. It must be noted that the ultimate performance of the movie can never be judged with 100 percent accuracy. Outliers to prediction are bound to be present.
F. Result Visualization
Once the result is predicted we can display our prediction through graphs. We can make use of matplotlibs and pandas for developing the graph. Performing visualization in Python is much easier as compared to R. Python based plotting library offers matplotlib with a complete 2D support along with limited 3D graphic support. It is useful in producing publication quality figures in interactive environment across platforms. It can also be used for animations as well. Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Python has long been great for data munging and preparation, but less so for data analysis and modeling. Pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R.
IV. CONCLUSION
Twitter is the most widely used micro-blogging platform. It is of great utility for harnessing the gigantic pool of data and gaining insight into the mind of the consumers. Sentiment Analysis is a powerful technique to achieve the same. Thus by using this one can predict the performance of movies and help in improving the box office performance and ultimately the profit. In case of result prediction, accuracy is one of the major factor which needs to be taken into consideration. In movie prediction, sarcasm plays a major role for estimating the accuracy. Accuracy of result prediction will increase due to removal of sarcastic tweets.
But sarcasm detection is a difficult and challenging task. We presented a bootstrapped learning method to acquire lists of positive sentiment phrases and negative activities and states, and show that these lists can be used to recognize sarcastic tweets. Our model identifies only two type of sarcasm that is common in tweets and contrast between a positive sentiment with a negative situation as well as hashtag analysis. Algorithm used for sarcasm detection before sentiment analysis would actually have to process large dataset which require more time for pre-processing of tweets as it need to have a look for all the tweets obtained in all available three respective files as mentioned.
Whereas sarcasm detection after sentiment analysis is implemented using conventional methods with small datasets which consist of either positive tweets or negative tweets. In this case, already some set of tweets are reduced due to classification and with this, it becomes easy to process small dataset as compare to large dataset. We have tweets classified using which we can search for value in one of the positive or negative file with respect to the classified set i.e for positive tweets in positive file and for negative tweets in negative and if a value exist in a file then search for the corresponding value in another respective file to find out the relation and define the result at a more faster rate. And if there is no value matching in first file then there is no need to search in another file and thus it saves the time.
If we compare the sarcasm detection before sentiment analysis to the sarcasm detection after sentiment analysis, the accuracy would be better in the latter one. It comes out to be faster and accurate.
V. LIMITATIONS
• Limitations of Twitter API- Twitter API allows a maximum of 15 API requests per rate limit window, search is limited to 180 queries per window of 15 minutes length. This reduces the operating speed of the algorithm exponentially.
• Noise, Promotion & Spam- Tweets, especially reaching up to the pre-release period, can be repetitive and contain irrelevant communication.
• The dictionary based approach has a major disadvantage which is the inability to find opinion words with domain and context specific orientations.
• In context of movie review , it is not possible to have a specific context orientation, and thus a big corpora can’t be build. Thus dictionary-based comes in to the picture.
REFERENCES
[1] Sitaram Asur&Bernardo A. Huberman. (2010) Predicting the Future with Social Media. Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology – Volume 01, pp. 492-499.
[2] Andrei Oghina, Mathias Breuss, Manos Tsagkias&Maarten de Rijke. (2012) Predicting IMDB movie ratings using social media. Proceedingsof the 34th European conference on Advances in Information Retrieval, pp. 503-507..
[3] Jain, V. (2013). Prediction of Movie Success using Sentiment Analysis of Tweets .The International Journal of Soft Computing and Software Engineering, 308-313.
[4] I.S. Gonz´alez-Ib´a˜nez, R.; Muresan, S.; andWacholder, N. 2011 Identifying sarcasm in Twitter: A closer look. In ACL
[5] S.K.Bharti,K.S.Babu,S.K.Jena,Parsin based sarcasm sentiment recognition in twitter data.In Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), ACM, 2015, pp. 1373–1380.
[6] Diana Maynard and Mark Greenwood. Who cares about sarcastic tweets? Investigating the impact of sarcasm on sentiment analysis. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, may 2014.ELRA.
[7] Walaa Medhat a,*, Ahmed Hassan b, Hoda Korashy b (2014), Sentiment analysis algorithms and applications:A survey.
[8] Taboada, M., Brooke, J., Tofiloski, M., Voll, K., & Stede, M. (2011)Lexicon-Based Methods for Sentiment Analysis
[9] E.Riloff,A.Qadir,P.Surve,L.DeSilva,N.Gilbert,R.Huang,Sarcasm as contrast between a positive sentiment and negative situation,in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2013,pp. 704–714.
[10] Sarcasm as Contrast between a Positive Sentiment and Negative Situation Ellen Riloff, Ashequl Qadir, Prafulla Surve, Lalindra De Silva, Nathan Gilbert, Ruihong Huang (EMNLP 2013), UT 84112
2016-10-29-1477756875