Stock return predictors and social media sentiment

Project Definition

My thesis has two main objectives. First, I will use machine learning methods to identify the most important stock return predictors . I distinguish between two types of features. The first type reflects conventional predictors, such as, but not limited to, stock performance, valuation ratio and stock liquidity . A large number of predictors have been extensively analysed in the mainstream financial literature. The second type reflects the general public and investor’s sentiment or attitude toward companies and the market, determined by online (social) media presence from Google and Twitter. With an abundance of variables and the potential interaction between them, model accuracy will be measured. That is, the predicted stock returns in month t will be compared to the realised stock returns in that same period. Second, this research will quantify the relative importance of the two types of predictors described above. In other words, what is the value of social media sentiment in predicting stock returns compared to conventional predictors, and how do they perform together?

Motivation

The behaviour of asset prices influences many important decisions for households. For example, the composition of retirement savings, whether to buy a house, and mortgage decisions depend on assessing the risks and returns associated with these investments. Another example is mispricing of assets, which may contribute to and amplify financial crises, as illustrated by the financial crisis of 2008. This, in turn, can have a have a long-lasting impact on economic growth. Given the fundamental role of asset prices in many decisions, Lars Peter Hansen, Eugene Fama and Robert J. Shiller received the Nobel Prize in Economics in 2013 for their work aimed at understanding how asset prices are determined.

An important aspect of understanding asset prices boils down to linking differences in returns of different assets to differences in exposure to common factors. More than fifty years of research in financial economics has revealed that modern asset pricing theory struggles to explain the cross-section of asset prices (i.e. the difference in asset prices at one point in time) (Fama & French, 2006). For example, empirical work has discovered more than hundred violations of the Capital Asset Pricing Model (CAPM). Like any economic theory, the CAPM, and subsequent theories evolved from it, is based on a set of simplifying assumptions about the investors’ preferences, beliefs and constraints. However, it has been argued that economic theories should be judged by their ability to describe the world, as opposed to their assumptions.

The breakdown of the CAPM has encouraged the development of numerous asset pricing models in finance, such as the Fama French 3 (FF3) factor model. Despite the progress that has been made in understanding asset prices beyond the CAPM, new asset pricing models are introduced in the scientific literature at a rapid rate, with hundreds of predictors being published by researches (Harvey, Liu, & Zhu, 2016). Given the (over)abundance of different factor models, conventional statistical techniques have proven both ineffective and inefficient in identifying the relevance of each respective factor in those models. Machine learning offers the possibility of incorporating a large number of features in the model, whereas statistical models would lose reliability in such cases (Gu, Kelly, & Xiu, 2018).

Although many different predictors are used in asset pricing models, they are all strongly related to the direct performance of an asset (over time) and the company in question. A new form of currency, cryptocurrencies, has attracted the interest of the general public and investors alike since the inception of Bitcoin in 2009. Since then, the potential of its underlying technology has resulted in hundreds of new cryptocurrencies. The current market can be characterised as highly volatile and the discussion of these assets is prevailing on social media platforms, such as Twitter and Reddit. Consequently, a niche field of research has dedicated itself to the prediction of cryptocurrency returns using the public’s sentiment as main factor. Although sentiment indicators are not a new phenomenon in asset pricing, those that were derived from social media are yet to be integrated. Seeing the positive results these have had on the prediction of cryptocurrency returns, the question of how social media sentiment can be used for asset pricing logically follows.

Background

The relevant literature for this thesis can be subdivided in three categories. Firstly, it is the literature regarding the different types of conventional asset pricing models that are primarily grounded in financial econometrics, of which some have been introduced above. Secondly, it is the expanding academic research in machine learning for asset pricing. Thirdly, it is the research done on the use of natural language processing (NLP) techniques for the prediction of asset prices (of different kinds).

There have been numeral research efforts to predict financial markets with deep learning. Wang et al. (2019) used a 1D convolutional neural network (CNN) to predict 6 futures from both the Chicago and New York Mercantile Exchange. They mention the advantage of deep learning methods being mainly their ability to extract features automatically, which removes the hurdle of gathering many technical indicators. Additionally, the authors propose a weighted F-score measure for evaluation, as it is comparable to the average annual return and Sharpe ratio, which are common metrics in the academic finance literature. Neural networks have further been used by Arora et al. (2019) and Radityo et al. (2018). Although the authors receive impressive results with low errors and high accuracies, there is no mention of what data was used and what the exact model architectures were. Therefore, such results must be taken with a grain of salt. Classification has also been used for stock prediction. Rather than predicting returns, Dixon et al. (2017) predicted price movement (i.e. up or down). An accuracy of 42% was reached, which can be considered low, but is still able to generate positive returns if the strategy from the model is followed. Primarily, these papers show the potential of machine learning methods for the purpose of predicting the return of different financial assets. Traditional statistical methods have consistently been outperformed by machine learning models.

In the academic literature, the use of social media sentiment analysis has been used to predict cryptocurrency prices, albeit less for conventional and stable assets, like company stocks. Specifically, Twitter data was used to gauge the hourly sentiment on a cryptocurrency with a relatively small market capitalization by (Kjellstadli et al. (2019). Lamon et al. (2016) labelled daily news and social media posts with actual price changes, rather than sentiment. This way, the model is able to directly predict price changes with media sources, without having to do sentiment analysis. Rebane & Karlsson (2018) highlight the importance of including social media information as a predictor of asset pricing, as it showed to increase performance of machine learning models in the short term.

In conclusion, we see that there is little research on the combination of more traditional predictors of the difference in asset returns with the relatively novel sentiment analysis approach. Machine learning offers the potential of identifying which features are effective at explaining the cross-sectional difference in asset returns.

Data Description

My focus will be on the U.S. stock market, since it is the largest stock market and data is readily available. I will obtain daily stock returns from the Center for Research in Security Prices (CRSP) for all firms listed at the New York Stock Exchange (NYSE), American Stock Exchange (AMEX) and NASDAQ. Although data is available from 1963 onward, with over 25,000 samples, only from those years where social media data is available will be used (approximately 2008). A full set of conventional return predictors at the individual stock level will be acquired from Compustat and CRSP.

Google Trends is publicly available and can be used to see how much a specific company or stock has been looked for on the world’s most popular search engine. Pseudo-APIs have been made that make it relatively easy to gain these insights with Python.

Getting access to historical tweets directly from Twitter is, unfortunately, not possible with the free and public API. As an alternative, I can make use of datasets of tweets regarding companies listed on the aforementioned stock exchanges made by other people. Even though options are out there, up until now, only datasets with a time frame of at most 4 years have been found. Therefore, I still need to look into all the other options I might have (e.g. making use of Twitter’s paid API, which gives access to all historical tweets).

Algorithms and Software

The code for this project will be developed in Python, since it is most accessible for machine learning and deep learning. As different machine learning methods will be used (ranging from random forests to recurrent neural networks), several libraries come specifically to mind, such as scikit-learn for the simpler models and Keras for the deep learning models.

Evaluation Method

Different asset pricing models, ranging from a select number of traditional statistical models introduced above to different machine learning models, will be compared to each other to determine which is best for predicting asset returns. Additionally, model performances will be compared to the results by Gu et al. (2018), as they use a similar approach of stacking different models against each other, and the result of other research efforts that are found in the process of conducting this research.

A large number of features will be used, and I am interested in identifying which of those are most important in predicting asset returns. In other words, which factors contribute most to the model? For this, permutation importance can be used.

Milestone and Plan

TBD

TO DO

– More in-depth research on asset pricing models, such as CAPM, q-factor model, FF3 factor model, and FF5 factor model; what is concretely ‘wrong’ with them?

– In general, a more coherent literature review is needed.

– Look more in the evaluation methods and determine what is of use in the field of finance and that is in accordance with the modern machine learning literature.

– Gather the final data that will be used for the research, most importantly the Twitter data.

– Determine which traditional, machine learning and deep learning will be used to compare.

– Set a milestone plan.

References

Arora, N., & M, P. (2019). Financial Analysis: Stock Market Prediction Using Deep Learning Algorithms. SSRN Electronic Journal, 2191–2197. https://doi.org/10.2139/ssrn.3358252

Chiong, R., Adam, M. T. P., Fan, Z., Lutz, B., Hu, Z., & Neumann, D. (2018). A sentiment analysis-based machine learning approach for financial market prediction via news disclosures. GECCO 2018 Companion – Proceedings of the 2018 Genetic and Evolutionary Computation Conference Companion, 278–279. https://doi.org/10.1145/3205651.3205682

Dixon, M., Klabjan, D., & Bang, J. H. (2017). Classification-based financial markets prediction using deep neural networks. Algorithmic Finance, 6(3–4), 67–77. https://doi.org/10.3233/AF-170176

Fama, E. F., & French, K. R. (2006). Profitability, investment and average returns. Journal of Financial Economics, 82(3), 491–518. https://doi.org/10.1016/j.jfineco.2005.09.009

Gu, S., Kelly, B. T., & Xiu, D. (2018). Empirical Asset Pricing via Machine Learning. SSRN Electronic Journal, 1–79. https://doi.org/10.2139/ssrn.3281018

Harvey, C. R., Liu, Y., & Zhu, H. (2016). ⋯ and the Cross-Section of Expected Returns. Review of Financial Studies, 29(1), 5–68. https://doi.org/10.1093/rfs/hhv059

Kjellstadli, J. T., Bering, E., Hendrick, M., Pradhan, S., & Hansen, A. (2019). Sentiment-Based Prediction of Alternative Cryptocurrency Price Fluctuations Using Gradient Boosting Tree Model. Frontiers in Physics, 7(July), 1–8. https://doi.org/10.3389/fphy.2019.00098

Lamon, C., Nielsen, E., & Redondo, E. (2016). Cryptocurrency Price Prediction Using News and Social Media Sentiment. Pdfs.Semanticscholar.Org.

Radityo, A., Munajat, Q., & Budi, I. (2018). Prediction of Bitcoin exchange rate to American dollar using artificial neural network methods. 2017 International Conference on Advanced Computer Science and Information Systems, ICACSIS 2017, 2018-Janua, 433–437. https://doi.org/10.1109/ICACSIS.2017.8355070

Rebane, J., & Karlsson, I. (2018). Seq2Seq RNNs and ARIMA models for Cryptocurrency Prediction : A Comparative Study. (August), 2–6.

Wang, J., Sun, T., Liu, B., Cao, Y., & Wang, D. (2019). Financial Markets Prediction with Deep Learning. Proceedings – 17th IEEE International Conference on Machine Learning and Applications, ICMLA 2018, 97–104. https://doi.org/10.1109/ICMLA.2018.00022

2020-3-6-1583515215

Discover more:

Social media essays

Essay details and download:

Text preview of this essay:

References

Discover more:

Recommended for you

About this essay:

Essay Categories: