STTM: an efficient approach to estimating news impact on stock movement direction
 Published
 Accepted
 Received
 Academic Editor
 Ana Maguitman
 Subject Areas
 Algorithms and Analysis of Algorithms, Data Mining and Machine Learning, Text Mining, Sentiment Analysis
 Keywords
 Stock movement, Topic modeling, Time series, Stock markets, Sharpe ratio, Granger causality test
 Copyright
 © 2022 Riabykh et al.
 Licence
 This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
 Cite this article
 2022. STTM: an efficient approach to estimating news impact on stock movement direction. PeerJ Computer Science 8:e1156 https://doi.org/10.7717/peerjcs.1156
Abstract
Open text data, such as financial news, are thought to be able to affect or to describe stock market behavior, however, there are no widely accepted algorithms for extracting the relationship between stock quotes time series and fastgrowing textual representation of economic information. The field remains challenging and understudied. In particular, topic modeling as a powerful tool for interpretable dimensionality reduction has been hardly ever used for such tasks. We present a topic modeling framework for assessing the relationship between financial news stream and stock prices in order to maximize trader’s gain. To do so, we use a dataset of economic news sections of three Russian national media sources (Kommersant, Vedomosti, and RIA Novosti) containing 197,678 economic articles. They are used to predict 39 time series of the most liquid Russian stocks collected over eight years, from 2013 to 2021. Our approach shows the ability to detect significant returnpredictive signals and outperforms 26 existing models in terms of Sharpe ratio and annual return of simple long strategy. In particular, it shows a significant Granger causal relationship for more than 70% of portfolio stocks. Furthermore, the approach produces highly interpretable results, requires no domainspecific dictionaries, and, unlike most existing industrial solutions, can be calibrated for individual time series. This makes it directly usable for trading strategies and analytical tasks. Finally, since topic modeling shows its efficiency for most European languages, our approach is expected to be transferrable to European stock markets as well.
Introduction
Effective market hypothesis (EMH) (Fama, 1970; Fama, 1991) argues that all publicly available information is immediately and fully reflected in stock market prices. Consequently, neither historical data nor the forecasts based on them are seen as usable for the development of efficient investment strategies. However, many approaches for stock market movement prediction that were developed since EMH had been proposed (Fabozzi & Fabozzi, 2020) have demonstrated certain levels of efficiency. At the same time, as the task remains challenging due to the high volatility of stock quotes, new approaches are still needed. Overall, two main groups of approaches—technical and fundamental—are usually singled out by researchers, both nowadays employing machine learning methods (Dixon, Halperin & Bilokon, 2020). In technical analysis, the analyst uses past trends in the share prices to predict their performance in future, without inferring the causes of the observed trends. Fundamental analysis is based on the assumption that the market price of an asset tends to its intrinsic value, but always deviates from it with the asset thus being either overvalued or undervalued. By inferring the intrinsic value to which the market is expected to correct, this approach aims to predict stock price behavior. For this, various external data are often used, including information disclosed by companies, such as revenues, earnings or profit margin, and independent analytics.
One of the promising types of external information is unstructured textual data, notably financial news. Coupled with automated machine learning techniques, it allows investors to solve predictive and descriptive tasks, saving time and labor costs for finding important information in a large amount texts. Such data is found able to generate interpretable and significant information signals that help investors to minimize investment risks.
Shallow feature based methods of text processing play a special role in predicting the direction of different types of financial movement, such as stock or commodity prices, with unstructured text data, such as news or usergenerated content. Most often, these methods do not require markup (unlike approaches based on sentiment analysis) and do not need updating their parsing algorithms (unlike event extraction methods). The general procedure of building such algorithms begins with preprocessing of the source texts, then passes to constructing vector representations, or embeddings of these texts (e.g., TFIDF, BoW, Doc2Vec, DLbased embedding) and finally incorporates these embeddings in machine learning (ML) techniques to predict stock trends. The main disadvantage of such approaches is low interpretability of vector text representations as predictors. Meanwhile, topics generated by probabilistic topic models are easily interpreted by humans based on the lists of most probable words, but are mostly missing from the relevant literature reviews (Usmani & Shamsi, 2021; Jurczenko, 2020; Shah Dev & Zulkernine, 2019). Other dimensionality reduction methods that do find their way into financial movement prediction domain are mostly based on hard clustering approaches, e.g., KMeans (Babu, Geethanjali & Satyanarayana, 2011). This is suboptimal for classification of texts that usually belong to more than one topical cluster. Additionally, such clusters are difficult to interpret as they are delivered unlabelled. As topic modeling coclusters both texts and words by topics, top words can be used as natural cluster labels, while simple clustering yields nothing except lists of items grouped into the unlabelled clusters. Although KMeansbased approaches can be ideologically adapted to fuzzy logic and to the logic of simultaneous coclustering of items and their features, we are unaware of such applications in the sphere of stock market prediction.
In this article, we propose a new method for predicting stock price movement direction based on topic modeling. Our algorithm is highly interpretable, requires no fixed markup or preexisting sentiment dictionaries, and at the same time remains an endtoend solution within the paradigm of machine learning techniques for stock prediction using numerical and textual data. Our approach achieves high predictive power in the weekly price trend prediction task, where stocks of the largest Russian companies are considered as time series (spanning eight years between 2013 and 2021), and economic news of the three largest Russianlanguage news agencies are used as textual data. We use the Granger causality test to evaluate statistical significance of the obtained predictions. In addition, we consider a simple trading strategy and evaluate the success of a portfolio calibrated on the obtained predictions through Sharpe ratio and annual return. In doing so, we consider portfolios derived from predictions of various MLmodels (Random Forest, Logistic Regression, Gradient Boosting Machine, Support Vector Machine, 3layer Neural Network) and using different embeddings (average Word2Vec, Navec, Doc2Vec, FastText) of the news title, of its entire text, and of its first paragraph. We also considered the quality of the strategies of the mentioned ML models built on endogenous data (5 lags of the time series). We compare on our approach to SESTM model (Ke, Kelly & Xiu, 2020) that has shown promising results for the US stock market and Englishlanguage news and that, according to its authors, outperforms RavenPack algorithms (the industryleading commercial vendor of financial news sentiment) in terms of Sharpe ratio scores. We show that our approach yields the best results more often than other included in the comparison. As topic modeling performs universally well across all European languages, our approach is expected to be applicable to all European stock markets, respectively.
The rest of the article is structured as follows. The ‘Related Work’ section reviews approaches based on interpretive sentiment analysis, methods based on combinations of embeddings and ML models, and topic models that are conceptually close to our framework. The ‘Methodology’ section introduces the proposed method. The ‘Datasets and preprocessing’ section describes the data used in the current study. The ‘Metrics’ section describes the return and risk metrics for portfolios obtained using various approaches discussed in this article. The ‘Experiments’ section contains a description of the procedure for forecasting and constructing various schemes for stock trend modeling. The ‘Numerical results’ section contains the results of our experiments. The ‘Discussion’ section interprets the obtained results. The ‘Conclusion’ summarizes our findings and discusses the possibilities for further framework improvements. Appendix A is devoted to a qualitative analysis of the results of topic modeling. This part of the article, first of all, compares the results of different topic models with each other. Second, it shows which topics are most frequently covered in the main federal media and in the trading terminal news. Finally, the temporal saturation of the market with new information is shown. Appendix B contains supplementary materials of this article, such as illustration of data and models, cumulative divergence of topic profiles, coherence scores and tables with the Granger causality test values.
Related Work
Much research exploring the relationship between textual information and financial time series relies on sentiment dictionaries, such as the HarvardIV4 dictionary and Loughran–McDonald Financial Dictionary (Loughran & Mcdonald., 2016). For instance, Li et al. (2014) use both of the mentioned dictionaries to create a sentimentbased model for stock market prediction tasks. Kim, Jeong & Ghani (2014) assign a sentiment score to a textual data stream using a dictionary and rules, after which the authors identify significance of correlations between this news stream and stock market fluctuations. Li et al. (2018) extract sentiment information using LoughranMcDonald, Harvard IV4, and SenticNet 3.0 in their research. Picasso et al. (2019) use McDonald dictionary and AffectiveSpace 2 (Cambria et al., 2015) to evaluate sentiment information from financial news for twenty most capitalized companies listed in the NASDAQ 100 index. However, dictionary approach is hard to customize to specific data and prediction tasks. Existing dictionaries still require being extended to the financial domain. Moreover, sentiment dictionaries are still underdeveloped for less resourceful languages, including the Russian language and are the subject of recent research (Panchenko, 2014; Koltsova et al., 2020).
Other related works exploit various machine learning approaches (Rundo et al., 2019; Thakkar & Chaudhari, 2021) combined with different encoding procedures used to assign vector representations to documents; these procedures include TFIDF features (Bing et al., 2017), wordembeddings (Mahmoudi, Docherty & Moscato, 2018) and deep learning methods (Matsubara, Akita & Uehara, 2018; Xu et al., 2020), among others. These vector representations, sometimes combined with other financial numerical features (Gu, Kelly & Xiu, 2020; Li, Wu & Wang, 2020) are used as an input for classification or regression models, depending on the time series problem statement (Henrique, Sobreiro & Kimura, 2019). For example, Khedr, S.E.Salama & Yaseen Hegazy (2017) propose the model that predicts the rise and fall of shares of companies traded at NASDAQ based on economic news. They combine stemming, ngram, TFIDF, and numerical features with Naïve Bayes and KNN algorithms. Manela & Moreira (2017) use ngram features with the SVR model to estimate the relationship between the frontpage text of The Wall Street Journal and the VIX volatility index. Weng et al. (2018) extract public information from Google and Wikipedia with Random Forest model (while simultaniously testing NN, SVR and boosted regression tree) to predict the 1day ahead price of 19 additional stocks from different industries. Such approaches are difficult to interpret by a potential investor: it is often hard to understand why the vector representation model learned a certain word embedding and what effect it had on the final result, as well as to explain why the ML model chose a specific combination of nontransparent features as significant.
Latent Dirichlet Allocation (LDA) is one of the most popular topic modeling techniques using a Bayesian approach for generating topics (Blei, Ng & Jordan, 2003). Topics derived from topic modeling can be good predictors of financial time series. For instance, Chester Curme & Preis (2017) show that the forecasts of trading volume can be improved by accounting for news topical diversity which they measure as the Shannon entropy of a topic distribution yielded by a topic modeling algorithm run over daily corpora of Financial Times news. Also, there are natural extensions to the LDA model with the temporal structure of texts: DTM (Blei & Lafferty, 2006) and DIM (Gerrish & Blei, 2010). These models allow tracing temporal evolution of topics and their lexical composition and reveal the most influential documents. Other papers integrate text and timeseries data into a single probabilistic model expanding DTM or LDA (Park, Lee & Moon, 2015; Kanungsukkasem & Leelanupab, 2019). In these papers, researchers carry out a qualitative analysis of topics associated with time series and evaluate the predictive power of the respective model. Kim et al. (2013) develop Topic Modeling with Time Series Feedback model (ITMTF) that infers topics iteratively while optimizing their correlation with timeseries data in terms of its strength and direction. The latter means that topics are gradually redefined so that to include only the words that affect the predicted time series in the same way (either negative or positive). This approach yields more causal topics than the baseline LDA in terms of Granger causality and more pure topics in terms of coherence of the effect’s direction. However, ITMTF model uses the time series data from a very limited pool of only three US companies, Apple and two airlines, and only for six months. Another important approach ideologically close to ours is SESTM—Sentiment Extraction via Screening and Topic Modeling (Ke, Kelly & Xiu, 2020). Its authors use a supervised topic model with two topics—one being assigned the words that have a positive impact on asset returns, and the other with the words having a negative impact—to calculate wordlevel predictive scores (termed sentiment scores) that are later transformed into textlevel predictive scores. These latter scores are used to optimize investment portfolio construction whose quality is assessed in terms of Sharpe ratio and annual return metrics.
Methodology
We propose the Stock Tonal Topic Modeling approach (STTM) by introducing an index that reflects the association between topics occurring in news stream and the stock prices movement. This index, hereafter termed STTM index, is positive if the overall association of all the topics in the news stream of a given time period with the stock movement is positive (i.e., it predicts stock growth), and negative in the opposite case. Further, we use the STTM index to optimize investment portfolio construction and show its efficiency.
The proposed procedure of computing STTM index has several stages. First, we perform topic modeling of the news flow and calculate the salience of each topic at each time point of a preselected period, thus receiving a distribution of saliences for each topic over time. This distribution is further referred to as topic stream. Similarly, we obtain the word stream for each word as a distribution of word frequencies in our news flow over time. Second, we compute the tone of each word as the value of an association measure (in our case—Pearson correlation) between the word stream and the target time series (in our case—stock prices). Words found to be positively associated with the target time series are considered to have positive tone, and visa versa. Third, topiclevel tone is computed based on the tones of highprobability words from each topic, according to a procedure described further below. Finally, topiclevel tones are aggregated over all topics into a distribution termed tonal topic stream, which in turn is aggregated over time into a single value—STTM index. This index, thus, reflects the strength and the direction of the aggregated impact of all topics on the stock price movement.
More formally, let us denote textual data stream as collection D = (d_{1}, t_{d1}), …, (d_{m}, t_{dm}), where d_{i}  document, t_{di}  date and time of document release. Let us also denote financial time series as p_{t} = (p_{1}, t_{1}), …, (p_{N}, t_{N}), where p_{i}  value, t_{i}  corresponding time stamp. Our key problem is predicting ℙ(r_{t} ≥ 0), where ${r}_{t}=\frac{{p}_{t}{p}_{t1}}{{p}_{t1}}$, with STTM index based on textual data flow between t and t − 1 as feature. In our notation, we normalize the raw STTM index to the range [0;1], so that STTM → 1 and STTM → 0 mean that the textual information pulls the time series p_{t} up and down, respectively. We give a detailed description of the entire procedure whose graphical representation can be found in Fig. 1.
Data preprocessing
First of all, we preprocess input text data. Each text is subject to tokenization and lemmatization, removal of stopwords, and punctuation symbols. Next, we calculate idfparameter—inverse document frequency, part of TFIDF feature, for each unique word w in D: (1)$\mathrm{idf}\left(w,D\right)=log\frac{\leftD\right}{\left\left\{{d}_{i}\in D\mid w\in {d}_{i}\right\}\right},$ where D—number of documents in collection; {d_{i} ∈ D∣w ∈ d_{i}}—number of documents from D collection, where the word w occurs. After that, we remove words in the upper and lower quantiles of the αlevel from the text data.^{1} Thus, we do not consider the most rare or the most frequent words. Then, we transform the resulting text data into a bagofword representation.
Topic modeling
We feed the preprocessed text data as an input to the topic model for generating probabilistic topics: T_{1}, …, T_{n}. The topic model can be LDA, DTM, DIM, ITMTF or any other technique.^{2} It can be pretrained in advance or onlinetrained on the textual data stream D. As a result of topic modeling procedure, each document d is represented by n  dimensional vector of topics’ probabilities: θ^{(d)} = (θ_{d,1}, …, θ_{d,n}). We numerically estimate the salience of each topic T_{j} in each time unit t_{i} of the textual data stream D as follows: (2.1)${\Theta}_{i}^{j}=\sum _{\forall d\text{in}{t}_{i}}{\theta}_{j}^{d}.$ Topic stream TS_{j} of each topic T_{j} is thus defined as a set of all salience scores of this topic in a given time period: (2.2)$T{S}_{j}={\Theta}_{0}^{j},\dots ,{\Theta}_{N}^{j}.$ Consequently, we associate each topic T_{j} with the time series of its stream TS_{j}.
Topic tonality based on wordlevel tone
By analogy with topic stream, to define word stream, for each word w we first calculate its frequency as the sum of its occurrences over all documents d that have appeared in our stream in a given time point t_{i}: (3.1)${c}_{i}^{w}=\sum _{\forall d\text{in}{t}_{i}}c\left(w,d\right).$ Thus word stream WS_{w} is defined as a set of word frequencies in a given time period: (3.2)$W{S}_{w}={c}_{0}^{w},\dots ,{c}_{N}^{w}.$ Consequently, we associate each word w with the time series of its stream WS_{w}. In general, c(w, d) can be any additive function of the number of words w in the document d. The tone of the word w is determined as a function of target time series and word stream: (4.1)${\Omega}_{w}={f}_{w}\left({p}_{t},W{S}_{w}\right).$ Function f_{w} can be any regression evaluation metric or any time series proximity metric. We use the Pearson correlation coefficient: (4.2)${f}_{w}={r}_{{p}_{t},W{S}_{w}},$ if the significance less than γ^{3} and (4.3)${f}_{w}=0,$ in other cases. Since each topic T_{j} is a probability distribution in each time unit t_{i} over a dictionary V: ${T}_{j,i}=\left({w}_{1},{\varphi}_{{w}_{1}}^{j,i}\right),\dots ,\left({w}_{\leftV\right},{\varphi}_{{w}_{\leftV\right}}^{j,i}\right)$, we define the topic tone as a function of the word’s probabilities in the topic ${\varphi}_{W}^{j,i}$ and the corresponding word’s tone Ω_{W} for each time point t_{i}: (5.1)${\psi}_{j,i}={f}_{{T}_{j,i}}\left({\varphi}_{W}^{j,i},{\Omega}_{W}\right),$ The overall tonality of topic T_{j} is defined as a set of its tones in a given time period: (5.2)${\Psi}_{{T}_{j}}={\psi}_{j,0},\dots ,{\psi}_{j,N}.$ We implement function f_{Tj,i} as follows:

The number of the most probable words in the topic T_{j} at time point t_{i} (i.e., the words for which the tone will be calculated) is selected so that the sum of their probabilities does not exceed the specified threshold C_{j}:^{4} (5.3)$\sum _{\text{}w\text{sorted by probability}}{\varphi}_{w}^{j,i}\le {C}_{j},$

The variables pProb and nProb are calculated. The calculations involve only words selected in the previous step: (5.4)$\text{pProb}=\frac{\sum _{{\Omega}_{w}\ge 0}{\Omega}_{w}{\varphi}_{w}^{j,i}}{\sum \left{\Omega}_{w}{\varphi}_{w}^{j,i}\right},\text{nProb}=\frac{\sum _{{\Omega}_{w}<0}\left{\Omega}_{w}{\varphi}_{w}^{j,i}\right}{\sum \left{\Omega}_{w}{\varphi}_{w}^{j,i}\right}.$ If the significance level r_{pt,WSw} of all the selected words is higher than γ, then we define: (5.5)$\text{pProb}=0,\text{nProb}=0.$ The final topic tonality is expressed as the difference between pProb and nProb: (5.6)${f}_{{T}_{j,i}}=\text{pProb}\text{nProb}.$
Tonal topic stream
At the next step, we calculate the tonal topic stream (TTS) as a product of topic tonalities and topic streams for each time point: (6)$TTS=\left[\begin{array}{ccc}\hfill {\Theta}_{0}^{0}\cdot {\psi}_{0,0}\hfill & \hfill {\Theta}_{1}^{0}\cdot {\psi}_{0,1}\hfill & \hfill \dots \hfill \\ \hfill \vdots \hfill & \hfill \ddots \hfill & \hfill \hfill \\ \hfill {\Theta}_{0}^{n}\cdot {\psi}_{n,0}\hfill & \hfill \hfill & \hfill {\Theta}_{N}^{n}\cdot {\psi}_{n,N}\hfill \end{array}.\right]$ Thus TTS is a matrix, where rows correspond to the topics and columns correspond to the time points. Each element of this matrix, TTS_{j,i}, is the value of tonal topic stream for the topic T_{j} at time point t_{i}.
STTM index
Finally, STTM index as the overall tonality of all topics over a given period of time is defined as a time aggregate from the TTS matrix: (7)$\text{STTM}={\text{aggregate}}_{t}\left(\text{TTS}\right).$ The aggregation function can be a simple or weighted mean, median, or sum. We use simple sum by default. The tonality of each specific news item d is, analogously, the aggregate over the products of topic probabilities of news item θ^{(d)} and its topic tonalities ψ^{(d)} at time point t_{d}. As noted above, for comparability purposes we normalize the STTM index to the range [0, 1] based on sigmoid regressor.
Figure 2 shows a general scheme for predicting directions of financial market movements using the STTM approach.
Datasets and Preprocessing
In this section, we describe in detail our datasets. We collect two types of data: financial time series data and textual data stream. We consider stocks included in the MOEX Russia Index as the timeseries data and Russianlanguage news from the largest and most influential economic media as a textual data stream. In addition, we describe the required raw data preprocessing.
MOEX Russia Index
MOEX (Moscow Exchange) Russia Index (see Fig. 3) is a capitalizationweighted composite index serving as the primary rubledenominated benchmark of the Russian stock market. It is calculated as the sum of the prices of 39 most liquid Russian stocks weighted by expert assessments of their impact on the Russian economy. These stocks are preselected by experts from among the largest and the most dynamically developing Russian issuers with economic activities in the leading sectors of the Russian economy. MOEX Russia Index is used as one of the baseline investment portfolios in this research. The shares time series from 2013 to 2021 included in MOEX Russia Index (available at: https://www.moex.com/en/index/IMOEX) listing constitute our times series dataset analyzed in this article.
There is 39 main time series with the following tickers:
SBER, SBERP, GAZP, LKOH, YNDX, GMKN, NVTK, SNGS, SNGSP,
TATN, TATNP, ROSN, POLY, MGNT, MTSS, FIVE, MOEX, IRAO,
PLZL, NLMK, ALRS, CHMF, VTBR, RTKM, PHOR, TRNFP, RUAL,
AFKS, MAGN, DSKY, PIKK, HYDR, FEES, QIWI, AFLT, CBOM,
LSRG, RSTI, UPRO.
Each time series of the mentioned shares is converted to the weekly returns r_{t}:
${r}_{t}\equiv \frac{{p}_{t}{p}_{t1}}{{p}_{t1}}$,
where p_{t} is the closing price of the shares time series, t  weekly timestamp.
Economic news
Our text dataset of daily news includes three Russian national media sources: Kommersant, RIA Novosti, and Vedomosti. Kommersant (The Businessman, available on the website: https://www.kommersant.ru) is a nationally distributed daily newspaper devoted to politics and business. RIA Novosti (Russian Information Agency, available on the website: https://ria.ru) is one of the principal stateowned news agencies publishing news and opinions of social, political, economic, scientific, and financial subjects. Vedomosti (The News, available on the website: https://www.vedomosti.ru) is a national daily newspaper specializing in business. In each media outlet we consider the texts from the economy section only. Consequently, it contains a significant number of editorials, analytical reviews, and expert opinions, affecting the estimated textual data flow. Table 1 illustrates the amount of collected data and the date intervals corresponding to it. Figure 4 shows the main statistics about the Kommersant newspaper. The top panel of the figure plots the histogram of the number of articles (ordinary news and analytical reviews) vs the number of characters. The bottom panel shows the distribution of the weekly number of articles over eight considered years. The drops in the number of news correspond to public holidays and weekends. The graphs for the other daily sources can be found in Figs. B1 and B2.
News preprocessing pipeline
We use a common natural language preprocessing pipeline. We begin with the tokenization^{5} by breaking each text into sentence components and then into word components. Next, we normalize all tokens in each article to lower case letters, remove punctuation, nonalphabetic, and nonCyrillic symbols, and perform lemmaization with Yandex MyStem—an instrument developed specially for the Russian language and based on extensive morphological analysis.^{6} A lemmatizer was preferred to stemmers because it avoids aggressive suffixstripping—an approach that would not be applicable for highly inflected languages, such as those of the Slavic family, since therein suffixes are heavily used for word formation and can entirely change word meaning. In terms of recognizing lemmas from word forms, Yandex MyStem has the error rate of about 2–3% only and consistently outperforms other existing models on different Russianlanguage corpora (Kotelnikov, Razova & Fishcheva, 2018). Finally, we remove stop words, such as prepositions, participles, interjections, numbers^{7} and the tokens in the upper and lower quantiles of the αlevel idfparameter. We use the standard 95%quantile as a default value for αlevel. Finally, we convert each news item into a vector of word counts.
News agency name  Number of articles  Dates 

Kommersant  26, 132  01.01.2013–31.12.2021 
RIA Novosti  168, 285  01.01.2013–31.12.2021 
Vedomosti  3, 261  01.02.2015–31.12.2021 
Metrics
We consider two approaches to assessing the results obtained. First, we explore the relationship between the stock market movement and news flow impact of the proposed tonal topic modeling procedure through correlation analysis using Granger’s causality test (Deveikyte et al., 2020). Second, we introduce a simple trading strategy of calibrating investment portfolios based on the predictions of the model and evaluate the performance of this strategy with the Sharpe ratio and the annual return of the portfolio (Ke, Kelly & Xiu, 2020). Economic metrics for evaluating the success of a calibrated portfolio appear to be more suitable for the stock trend prediction problem than standard classification metrics, such as accuracy, Receiving Operating Characteristics (ROC), and area under the curve (AUC). This is due to the following reasons. Standard classification quality metrics for one time series may not give high results. However, economic metrics work on many time series, so in this case we can get a good financial result.
Granger causality test
The Granger causality test^{8} is a statistical hypothesis test for determining whether one time series is useful in forecasting another. Granger causality requires time series stationarity. Let y_{t} and x_{t} be two time series. To see if x_{t} ’Granger causes’ y_{t} with maximum q time lag, the following regression is performed: (8)${y}_{t}={a}_{0}+{a}_{1}{y}_{t1}+...+{a}_{q}{y}_{tq}+{b}_{1}{x}_{t1}+...+{b}_{q}{x}_{tq}.$ Then, Ftests are used to evaluate the significance of the lagged x terms. The coefficients of lagged x terms estimate the impact of x on y. The null hypothesis that x does not Grangercause y is accepted if and only if no lagged values of x are retained in the regression. Since in our case the time series has shown nonstationary behavior, as determined by the DickyFuller unit root test,^{9} a firstorder and, where necessary, a secondorder differentiation was performed to achieve stationarity.
Sharpe ratio and annual return
The Sharpe ratio^{10} measures the performance of an investment, such as a share or a portfolio of shares compared to a riskfree asset, after adjusting for its risk. It is defined as the difference between the returns of the investment and the riskfree return, divided by the standard deviation of the investment returns. It represents the additional amount of return that an investor receives per unit of increase in risk. More formally, (9)$\text{Sharpe ratio}=\frac{{R}_{p}{R}_{f}}{{\sigma}_{p}},$ where R_{p}  return of portfolio, R_{f}  riskfree rate, σ_{p}  standard deviation of the portfolio’s excess return. As a riskfree asset for the Russian market, we use government bond yields (available on the website: https://www.cbr.ru/eng/hd_base/zcyc_params/). A Sharpe ratio of less than one is usually considered unacceptable or bad. It means that the risk of portfolio encounters is being offset well enough by its return.
Annual return or Compounded Annual Growth Rate (CAGR)^{11} is the average annual rate calculated from the returns observed in the first and the last years of a given time span, assuming that all the dividends are reinvested in the end of each year. More formally, (10)$\text{Annual return}={\left(1+\frac{EVSV}{SV}\right)}^{\frac{1}{n}}1,$ where EV  ending value, SV  start value, n  number of years.
Experiments
In this section, we describe our experimental procedures. The ‘STTM’ subsection describes how to build and evaluate the quality of models based on the proposed framework. The ‘SESTM’ subsection describes how to build a topic model that outperforms RavenPack, the industryleading commercial vendor of financial news sentiment. The ‘Shallow feature based methods of text processing’ subsection describes a scheme for building models based on embedding news. The ‘Endogen models’ subsection reveals a way to build simple endogenous models on time series lags. The ‘Evaluation procedure’ subsection describes the scheme for splitting the initial data into training and test samples to evaluate quality metrics.
STTM
As mentioned before, our Stock Tonal Topic Modeling (STTM) approach can have any basic topic model as its core. In this article, we implement two models: LDA^{12} and DTM^{13} with pythonwrapper from gensim pythonpackage: https://radimrehurek.com/gensim/. Qualitative analysis of these models is presented in the section ‘Qualitative Analysis of Russian Economic News Topic Modeling’ in Appendix A. DTM extends LDA by allowing word probabilities in a topic to change over time. We have chosen to update them in the increments of one month. Further, we select the number of topics so as to avoid solutions with either a large number of too granular topics or a small number of too broad topics. To do so, we optimize topic coherence with Roder’s C_{v} metric (Röder, Both & Hinneburg, 2015). Although the dynamics of coherence change with the increase of the number of topics is somewhat different for different news sources (see Figs. B4, B5 and B6), the overall optimum appears at n = 20, where C_{v} curve reaches its maximum before flattening out for all national media sources. This optimum is the same for both LDA and DTM. We run a topic model on the training dataset and apply it to the test dataset. The datasets are constructed according to the procedure described in ‘Evaluation procedure’ section below. Topic tonality based on wordlevel tone is calculated from the solution obtained on the training set. Topic stream is calculated from the the test set, and topic tonality based on wordlevel tone is applied to it. STTM hyperparameters are selected based on optimization of ROCAUC metrics the gridsearch on the training set for each time series independently. Examples of the STTM index, the stock time series, news, topics, words and also their tonalities and tones, as well as an example of a tonal topic stream are presented in Figs. B9, B10, B11 and B12. It can be seen that results of proposal procedure are highly interpretable. Since topic modeling possesses a certain level of instability leading to fluctuations in the word probabilities, we repeat all calculations for trading strategy performance at least ten times. After that we estimate the mean and variance of each of the considered economic metrics.
SESTM
We implemented the Sentiment Extraction via the Screening and Topic Modeling (SESTM) procedure (Ke, Kelly & Xiu, 2020) as a baseline topic model. We apply it for each time series from the MOEX Russia Index for three national news sources. SESTM approach infers only two topics—containing words that have either negative or positive effect on the target time series, the composition of these topics being optimized iteratively based on an association metric. The process consists of three steps: 1. isolating a list of sentiment terms via predictive screening 2. assigning sentiment weights to these words via topic modeling 3. aggregating terms into an articlelevel sentiment score via penalized likelihood. The main assumption of the model is that the news is generated from the following mixture multinomial distribution: (11)${d}_{i,\left[S\right]}\sim \phantom{\rule{0.25em}{0ex}}\text{Multinominal}\left({s}_{i},{p}_{i}{O}_{+}+\left(1{p}_{i}\right){O}_{}\right),$ where s_{i} is the total count of sentimentcharged words in article d_{i}, p_{i} is the sentiment score, O_{+} and O_{−} are a positive and negative topics, respectively, which is probability distributions over words. The objective of SESTM procedure is to learn the model parameters O_{+}, O_{−} and p_{i}. SESTM algorithm has three hyperparameters of screening for sentimentcharged words and one hyperparameter for regularization in learningoptimization problem. All hyperparameters are tuned through the time crossvalidation procedure with l^{1}norm of the differences between estimated article sentiment scores and the corresponding standardized return ranks as a loss function. (Figures B13 and B14) in Appendix B contain examples of the most common words with cumulative positive (negative) tones, and the most cumulatively positive (negative) tonal words for VTB (VTBR) share prices. It can be seen that the contents of these topics is mixed and broadly uninterpretable.
Shallow feature based methods of text processing
To compare our approach to the models with shallow features based methods of text preprocessing, we apply a specific pipeline shown on Fig. 5. For each economic news from the considered news agencies (Kommersant, Vedomosti, RIA Novosti), various textual components are extracted (full text, first paragraphs, titles). After that, each component is preprocessed as described in section ‘Datasets and preprocessing’. Further, various techniques for obtaining embeddings are applied to each news item: Word2Vec, Navec, Doc2Vec, FastText,^{14} Navec realization from natasha pythonpackage: https://github.com/natasha/navec, FastText realization as pythonpackage from: https://fasttext.cc/ (all embeddings models trained on the first two years of texts for each news outlet separately). The resulting embeddings of news (where all the news for the same week are treated as one document) are fed as the input features to the following machine learning models: Random Forest (RF), Logistic Regression (LR), Gradient Boosting Machine (GBM), Support Vector Machine (SVM), and 3layers Neural Network (NN).^{15} The target variable for all models is the sign of returns for each ticker included in MOEX Russia Index, which equals 1 for the growing times series and 0 otherwise.
Endogenous models
In addition, we compare the proposed framework with simple endogenous models: Random forest (RF), logistic regression (LR), gradient boosting machine (GBM), support vector machine (SVM), and threelayers neural network (NN), where weekly return lags are used as features, and the target variable is the same as in shallowfeaturebased models. Figure 6 shows the pipeline we used for construction of such endogenous models.
Evaluation procedure
As for time series data, test and train datasets have to be defined as subsets covering uninterrupted periods of time, we use an iterative expanding crossvalidation scheme (visualised in Fig. 7) in which we expand the time window of the training set by one year at each iteration, starting from two years and ending with six years, while test set window is kept stable at the length of one year across all iterations. Topic model obtained on the training set is retrained at each iteration and then applied to the test set. We estimate our models on all economic news between stock market start time each Monday and its end time each Friday. Considered target variable is movement direction sign(r_{t}) between the closing price on Friday and the opening price on the previous Monday for each share in the MOEX Russian Index.
For the proposed STTM approach, we compute a Granger causality test between the weekly value of STTM index and the weekly stock price of each ticker included in MOEX Russia Index for each test interval separately. For each ticker and for different topic models (LDA and DTM), including STTM, we evaluate the weekly trading strategy performance. We use a straightforward long strategy. For this trading strategy each Friday at the time of the closure of the stock market we select top 20 percent of stocks with the highest value of the model prediction for the current week. These stocks are the most likely to demonstrate price gains in the upcoming week. Next, we buy these 20% of stocks at the Friday prices. Such procedure is repeated each Friday thus providing portfolio recomposition. We then evaluate this strategy with the annual returns and Sharpe ratio metric introduced before. In doing so, we do not account for either broker commission or transactions costs (see Limitations section) because our goal is to evaluate the predictive power of our method as compared with other methods, rather than to calculate the amount of the final return it allows to gain.
Numerical Results
Granger causality tests
In this section, we provide numerical results for the Granger causality test between STTM index and Friday’s prices for each of 39 tickers included in MOEX Russia Index; this is done for two topic models employed (LDA and DTM) and for three different news sources. Calculation details can be found in Tables C1–C6. Figure 8 shows the proportion of the assets in our sample for which the STTM index has significant Granger predictive power in each of the studied years.
Weekly trading strategy performance
In total, we compare the performance of 28 different portfolios: five endogenous model based portfolios, 20 portfolios using shallow feature based methods of text processing, two portfolios based on the proposed STTM approach, and one based on SESTM. Each of these mentioned portfolios is constructed for three different news sources independently. Given that we have 39 tickers in our analysis, in total we obtain 2,886 different models validated on six train/test splits each. We also compare all these models to two baselines: MOEX Russia Index, as a type of a broad stock market index, introduced above, and Equal Weight Index (EWI) based on MOEX’s tickers, as a type of buy and hold strategy. While the former exemplifies capitalizationweighted index, the latter gives equal weight to all stocks, including smallcap stocks that are generally considered to be higher risk and to have higher potential return investments compared to largecaps. In theory, giving greater weight to the smaller names of the MOEX Russia Index in an equalweight portfolio should increase the return potential of the portfolio, so EWI may be expected to perform better than MOEX Russia Index.
Table 2 contains information about performance of the baselines. Table 3 demonstrates the results for topic modeling based approaches, Table 4 contains the results for shallow feature based methods of text processing and Table 5 presents the results for the endogenous models.
Figure 9 shows how weekly returns accumulate depending on the model used for forming an investment strategy and compares them to MOEX Russia Index (IMOEX) and Equal Weight Index (EWI) as baseline strategies. Each strategy uses oneweekahead approach and sorts portfolios by the score obtained from the chosen model. Specifically, Figs. 9A–9C plot portfolio return accumulation graphs for the strategies based on one of the eight models (STTM (LDA), STTM (DTM), SESTM approaches and five best returns of strategies built on shallow feature based methods of text processing). Each facet AC presents strategies using the data from only one information source: Kommersant, RIA Novosti and Vedomosti, respectively. Figure 9D contains the best models that have a Sharpe Ratio greater than one. We discuss it in detail further below.
Economic baseline  Sharpe ratio  Annual return 

Equal Weight Index (MOEX)  0.7311  0.1965 
MOEX Russia Index  0.4377  0.1851 
Equal Weight Index (MOEX) from 2017  0.2626  0.1038 
MOEX Russia Index from 2017  0.2391  0.1032 
Discussion
In this section we interpret the obtained results.
Granger causality tests
Figure 8 shows that, as the time passes and the volume of the training data increases, the proportion of tickers for which our STTM approach turns to be Grangercausal tends to increase as well. An exception is a sharp fall of the predictive power for the majority of models in 2018. This fall is most probably explained with a number of international macroeconomic events in the second half of 2018 (including USAChina trade wars, US Federal Funds Rate hike, and the collapse of a large number of global financial indices). These events were poorly covered in the Russian media which focused on the internal agendas, such as the resonant raise of the retirement age. From Fig. 8 it can be seen that, according to the Granger causality test, the proposed STTM index can be significant for as much as 70% of stock quotes listed in the MOEX Russia Index if the data is sufficient to calibrate the model.
Kommersant  

STTM (LDA)  STTM (DTM)  SESTM  
Sharpe ratio  Annual return  Sharpe ratio  Annual return  Sharpe ratio  Annual return 
1.3706 ± 0.0907  0.3612 ± 0.0212  1.0798 ± 0.0653  0.2845 ± 0.0201  0.4830  0.1598 
Vedomosti  
STTM (LDA)  STTM (DTM)  SESTM  
Sharpe ratio  Annual return  Sharpe ratio  Annual return  Sharpe ratio  Annual return 
1.1059 ± 0.0665  0.2840 ± 0.01853  0.7941 ± 0.0452  0.2095 ± 0.0319  0.3370  0.1296 
RIA Novosti  
STTM (LDA)  STTM (DTM)  SESTM  
Sharpe ratio  Annual return  Sharpe ratio  Annual return  Sharpe ratio  Annual return 
1.0140 ± 0.0572  0.2755 ± 0.0157  0.9691 ± 0.0570  0.2675 ± 0.0183  0.6433  0.2050 
Kommersant  

Scheme  Text  Title  
Sharpe ratio  Annual return  Sharpe ratio  Annual return  Sharpe ratio  Annual return  
GBM + Doc2Vec  0.3221  0.1301  0.9141  0.2614  0.3199  0.1294 
GBM + FastText  0.7707  0.2173  0.5149  0.1675  0.3854  0.1419 
GBM + Navec  0.471  0.1612  0.6731  0.1994  0.4041  0.1457 
GBM + Word2Vec  0.465  0.1616  0.5807  0.1806  0.1637  0.0964 
LR + Doc2Vec  0.516  0.1669  0.6622  0.1991  1.0056  0.2746 
LR + FastText  0.7131  0.2059  0.9137  0.2529  0.8933  0.2507 
LR + Navec  0.4284  0.1514  0.4553  0.1564  0.5462  0.1785 
LR + Word2Vec  0.5052  0.1644  0.8237  0.2329  0.586  0.1867 
NN + Doc2Vec  0.5267  0.1731  0.2439  0.1128  0.8192  0.236 
NN + FastText  0.7926  0.2278  0.401  0.1449  0.6687  0.2052 
NN + Navec  0.551  0.1767  0.5506  0.1719  0.44  0.156 
NN + Word2Vec  0.6792  0.1994  0.3565  0.1356  0.401  0.1449 
RF + Doc2Vec  0.0502  0.0747  0.6528  0.2056  0.6221  0.1894 
RF + FastText  0.6838  0.2022  0.4175  0.1493  0.499  0.1641 
RF + Navec  0.7199  0.2137  0.1372  0.0919  0.3583  0.1355 
RF + Word2Vec  0.3072  0.1262  0.4723  0.1578  0.203  0.1047 
SVM + Doc2Vec  0.5142  0.1742  0.6416  0.1998  0.6  0.1895 
SVM + FastText  0.1371  0.0911  0.551  0.1815  0.2911  0.1246 
SVM + Navec  0.5364  0.175  0.618  0.1933  0.3577  0.1385 
Vedomosti  

Scheme  Text  Title  First paragraph  
Sharpe ratio  Annual return  Sharpe ratio  Annual return  Sharpe ratio  Annual return  
GBM + Doc2Vec  0.1629  0.0825  −0.0552  0.0375  0.7428  0.2093 
GBM + FastText  −0.0522  0.0397  0.2586  0.1025  0.2487  0.1004 
GBM + Navec  0.7793  0.2143  0.216  0.0941  0.7344  0.2096 
GBM + Word2Vec  0.0747  0.064  −0.3653  −0.0164  0.3274  0.1161 
LR + Doc2Vec  0.3098  0.1149  0.564  0.1672  0.1519  0.0806 
LR + FastText  0.1202  0.0709  0.1435  0.0794  0.2406  0.1001 
LR + Navec  0.5791  0.1789  0.3249  0.1198  0.3092  0.1145 
LR + Word2Vec  0.2298  0.0974  0.3287  0.1168  0.5277  0.166 
NN + Doc2Vec  0.2259  0.0954  0.8721  0.2384  0.2487  0.1018 
NN + FastText  0.2525  0.1024  0.3038  0.1133  0.1523  0.0798 
NN + Navec  0.7775  0.2111  0.5409  0.1661  0.3324  0.1207 
NN + Word2Vec  0.1523  0.0798  0.3735  0.1266  0.1523  0.0798 
RF + Doc2Vec  0.0259  0.054  0.5677  0.1704  0.6729  0.1908 
RF + FastText  −0.209  0.0048  0.1558  0.0815  −0.2493  0.0052 
RF + Navec  0.449  0.1437  0.345  0.1264  0.3384  0.1191 
RF + Word2Vec  0.0329  0.0535  0.2996  0.1114  0.1973  0.0898 
SVM + Doc2Vec  0.4038  0.1384  0.1014  0.0681  0.0764  0.0654 
SVM + FastText  0.3401  0.1186  0.5436  0.1643  0.2105  0.0929 
SVM + Navec  0.0825  0.0668  −0.0008  0.0475  0.3267  0.1183 
RIA Novosti  

Scheme  Text  Title  First paragraph  
Sharpe ratio  Annual return  Sharpe ratio  Annual return  Sharpe ratio  Annual return  
GBM + Doc2Vec  0.8347  0.2414  0.9204  0.2653  0.6241  0.1953 
GBM + FastText  0.8641  0.2454  0.8971  0.2624  0.3022  0.1264 
GBM + Navec  0.3561  0.1382  0.3561  0.1382  1.1917  0.3284 
GBM + Word2Vec  0.71  0.2172  0.71  0.2172  0.602  0.1912 
LR + Doc2Vec  0.2994  0.126  0.2994  0.126  0.8791  0.2538 
LR + FastText  0.4763  0.1633  0.4763  0.1633  0.8888  0.2655 
LR + Navec  0.401  0.1485  0.401  0.1485  0.7374  0.2227 
LR + Word2Vec  0.4652  0.1612  0.4652  0.1612  0.5586  0.183 
NN + Doc2Vec  0.2146  0.1074  0.2034  0.1051  0.4429  0.1541 
NN + FastText  0.4065  0.149  0.4065  0.149  0.5885  0.1857 
NN + Navec  0.427  0.1518  0.427  0.1518  0.5871  0.19 
NN + Word2Vec  0.3289  0.1317  0.3289  0.1317  0.5885  0.1857 
RF + Doc2Vec  0.5385  0.1806  0.8189  0.2462  0.7877  0.2307 
RF + FastText  0.3982  0.1449  0.3982  0.1449  0.9746  0.2831 
RF + Navec  0.3774  0.1409  0.3774  0.1409  0.5307  0.1798 
RF + Word2Vec  0.2479  0.1144  0.2479  0.1144  0.2759  0.1205 
SVM + Doc2Vec  0.5084  0.1704  0.5055  0.1698  0.7373  0.2166 
SVM + FastText  0.3764  0.1434  0.3764  0.1434  0.611  0.1944 
SVM + Navec  0.431  0.1535  0.431  0.1535  0.4775  0.1594 
Endogen model  Sharpe ratio  Annual return 

GBM  0.7125  0.2135 
LR  0.6010  0.1851 
NN  0.6347  0.1952 
RF  0.4693  0.1617 
SVM  0.6213  0.1904 
Weekly trading strategy performance
The D facet of Fig. 9 illustrates oneweekahead performance of most economically successful portfolios (Sharpe ratio more than one), as well as the MOEX Russia Index. The top models in terms of the success of the investment portfolio built on them are as follows: STTM (LDA) on Kommersant with mean Sharpe ratio 1.3706 (annual return 36.12%), GBM + Navec based on RIA Novosti first paragraph with Sharpe ratio 1.1917 (annual return 32.84%), STTM (LDA) on Vedomosti with mean Sharpe ratio 1.1059 (annual return 28.40%), STTM (DTM) on Kommersant with mean Sharpe ratio 1.0798 (annual return 28.45%), STTM (LDA) on RIA Novosti with mean Sharpe ratio 1.0140 (annual return 27.55%), and LR + Doc2Vec based on Kommersant’s titles 1.0056 (annual return 27.46%). Portfolios built on the models mentioned above are also the most profitable. So out of 69 different approaches to stock trend prediction, only six turned out to be economically viable. Among them, four are built on our novel proposed approach STTM and only two are derived using shallow feature based text processing methods. Three out of six are based on Kommersant news agency data, two are based on RIA Novosti data, one is based on Vedomosti.
Now let us take a closer look at the best models for each individual news agency. A, B, C facets of Fig. 9 display performance for news sources Kommersant, RIA Novosti and Vedomosti, respectively. Each facet includes all topic modeling based approaches and five most successful models out of the remaining approaches (shallow feature based methods of text processing and endogenous models). For the Kommersant news agency, the STTM (LDA), model takes the first place (Sharpe ratio 1.3706 and annual return 36.12%), followed by STTM (DTM) (Sharpe ratio 1.0798 and annual return 28.45%) both in terms of Sharpe ratio and annual return. Next are five models with shallow feature based methods of text processing. For RIA Novosti STTM (LDA) and STTM (DTM) take the second (Sharpe ratio 1.0140 and annual return 27.55%) and the fourth (Sharpe ratio 0.9691 and annual return 26.75%) places respectively, the rest of the best models are shallow feature based methods of text processing. For Vedomosti STTM (LDA) and STTM (DTM) take the first (Sharpe ratio 1.1059 and annual return 28.40%) and third (Sharpe ratio 0.7941 and annual return 20.95%) places, respectively, the remaining best models again are shallow feature based methods of text processing.
Our experiments show that for all news sources the proposed STTM approach is among the best models, while maintaining the interpretability of results (see Figs. B9, B10, B11 and B12 in Appendix B). It is worth noting that endogenous models do not make it to the top of the best models, which in turn indicates that more useful economic information can be obtained from external data sources as compared to the information contained in the time series. SESTM never gets in any list of the best strategies, possibly, due to a smaller size of our dataset as compared to the dataset used by SESTM developers. However, we note that SESTM still outperforms the general economic baseline MOEX Russia Index both in terms of Sharpe ratio and annual return.
Conclusion
In this article, we have proposed a new approach—STTM—for evaluating the impact of news stream on the stock market trend, which is novel in several aspects. First, it does not use domainspecific dictionaries or any other manual markup. Next, unlike many commercial solutions, such as Reuters and Bloomberg, which produce general impact coefficients for the entire market, our algorithm can be finetuned for any individual issuer. At the same time, our analytical pipeline remains transparent and interpretable for an investor or a risk manager. It clusters news streams via topic modeling, finds the most influential terms among the most probable words of each topic with a tone assessment procedure, and offers assessment of the overall tone of each topic through tradeoff between positive and negative terms and their probabilities, as well as tone aggregation across the entire news stream. Topic tone reflects the strength and the direction of its potential impact on stock prices. Our procedure can be combined with various topic modeling techniques and time series proximity measures. It can also be generalized to other domains and used to assess the impact of text data on a various time series, both in predictive or explanatory tasks.
To illustrate the usefulness of the proposed method, we have carried out a large number of experiments on the prediction of the Russian stock market with the texts from the economic sections of the most significant Russianlanguage news editions. We investigated Granger causality between the output of the proposed STTM approach and each of the 39 tickers included in the MOEX Russia Index for six years and for two different topic modeling algorithms (LDA and DTM). The model shows significant causality across multiple tickers and can Grangercause more than 70% of those if the training data is large enough. We compared 28 different models by assessing their performance in terms of efficiency of a simple longterm trading strategy. For that, we created portfolios based on the predictions from each of these models and from each of our three news sources independently: 20 portfolios used shallow feature based methods of text processing, one was based on SESTM, five on endogen models, and two our approach (STTM). This corresponds to the construction of 2,886 different model variations, as each of the portfolio creation method was applied to each of the 39 tickers and on validated on six train/test splits. The quality of the resulting portfolios was evaluated by two metrics: Sharpe ratio and annual return.
Of all the multitude of model variations, only six turned out to be economically viable with Sharpe ratio more than one. Of them as many as four were based on STTM, and the remaining two were shallow feature based text processing methods that were initially represented by a much large number of model variations than STTM. Each of the STTMbased models ranked top of the list for various news publications, consistently outperforming the MOEX Russia Index baseline, the endogenous models, and the SESTMbased topic model. Thus, our work shows that the proposed framework is promising in explaining and predicting financial time series based on the textual data flow. The universal applicability of topic modeling to all European languages, as well as to some other languages, allows to assume that this framework has good prospects of being usable far beyond the Russian stock market.
The novelty of STTM, as compared to other approaches that make use of topic modeling SESTM (Ke, Kelly & Xiu, 2020) and ITMTF (Kim et al., 2013)—is twofold. First, STTM allows to directly optimize the efficiency of investment portfolios—a task that ITMTF does not address—and does it better than SESTM. Second, both SESTM and ITMTF work to homogenize the generated topics by the direction of their effect on the target variable—either negative and positive. For this purpose, SESTM reduces the number of topics to two only which renders them uninterpretable (and, as we have shown, less predictive than our approach). ITMTF’s approach is more nuanced: while optimizing both topics’ predictive power and their purity in terms of the effect’s direction, it yields really interpretable topics. However, it does not evaluate the overall effect of the entire news stream of a given time period on the share prices which, ultimately, is the main practical goal of using news in such models. Additionally, it is not obvious that the overall predictive power of purified topics is higher than that of naturally occurring topics. Thus, adaptation of ITMTF purification logic to the goal of direct trading strategy optimization and comparison of the resulting pipeline to STTM is a an interesting task for future research.
Our approach has several practical implications. First, its ability to create impact indices of a news stream or a stream of textual data from social media for an individual issuer should be of higher practical value for traders than overall market indices. Issuerspecific indices can be used directly in trading strategies or as a factor in more complex models. Second, transparency and interpretability of our approach should make it attractive to investment applications for the mass user that are appearing on the market in large numbers. Our approach can make decision advices rendered by such apps more understandable for lay investors and thus increase customer trust and loyalty to such apps. Finally, professional risk analysts can benefit from the indepth analysis of the rich information provided by our approach. They can numerically analyze the behavior of their companies in the past for better risk management in the future.
Limitations
Like all approaches involving topic modeling, our approach is sensitive to duplicate news. Although a large amount of duplicates may indicate topic’s importance, duplicatebased topics tend to be artificially separated from similar, but not identical texts. The effect of this phenomenon on model performance needs to be studied experimentally. Likewise, coverage of economic events may be heavily skewed by editorial choices that, like in 2018, may hinder model’s predictive power. This effect might be mitigated by broader samples of media outlets. Finally, as it was mentioned, in this article we ignore brokers’ commissions and transaction costs when evaluating the performance of our strategy. Although here our goal is to find return predictive signals, for models aiming at exact calculation of returns’ amounts these additional costs should be accounted for.