Authors:
(1) Deborah Miori, Mathematical Institute, University of Oxford, Oxford, UK and 2Oxford-Man Institute of Quantitative Finance, Oxford, UK (Corresponding author: Deborah Miori, deborah.miori@maths.ox.ac.uk);
(2) Constantin Petrov, Fidelity Investments, London, UK.
Conclusions, Acknowledgements, and References
Starting from a corpus of economic articles from The Wall Street Journal, we present a novel systematic way to analyse news content that evolves over time. We leverage on state-of-the-art natural language processing techniques (i.e. GPT3.5) to extract the most important entities of each article available, and aggregate co-occurrence of entities in a related graph at the weekly level. Network analysis techniques and fuzzy community detection are tested on the proposed set of graphs, and a framework is introduced that allows systematic but interpretable detection of topics and narratives. In parallel, we propose to consider the sentiment around main entities of an article as a more accurate proxy for the overall sentiment of such piece of text, and describe a case-study to motivate this choice. Finally, we design features that characterise the type and structure of news within each week, and map them to moments of financial markets dislocations. The latter are identified as dates with unusually high volatility across asset classes, and we find quantitative evidence that they relate to instances of high entropy in the high-dimensional space of interconnected news. This result further motivates the pursued efforts to provide a novel framework for the systematic analysis of narratives within news.
In today’s fast-paced digital age, the world is inundated with an unprecedented volume of information, particularly from news sources. The sheer magnitude of data generated daily has reached staggering proportions, making it increasingly challenging for individuals to parse and process this information accurately solely through human capabilities. News is constantly flowing from countless channels and platforms, and the need for advanced tools and technologies to sift through this vast sea of data has never been more apparent. In the realm of financial markets, the potential gain from efficiently handling such an enormous amount of news data, and extract quantitative signals from it, is even more pronounced. Financial markets are highly sensitive to information, and characterising narratives within news is surely one task that can enhance our knowledge on news’ impact on asset prices, trading strategies, and investor sentiment.
Some examples of interesting research on the topic follow. In [1], the authors investigate the high-frequency interdependent relationships between the stock market and statistics on economic news in the US context. Then, [2] investigates how news affects the trading behaviour of different categories of investors in a financial market, while [3] finds evidence that market makers demand higher expected returns prior to earnings announcements, because of increased inventory risks that stem from holding net positions through the release of anticipated earnings news. In [4], the authors measure the correlation between the returns of publicly traded companies and news about them as collected from Yahoo Financial News. Then, [5] tries to quantify how topics discussed within news influence the stock market. The authors of this paper apply a topic modeling technique called Latent Dirichlet Allocation (LDA) [6], in order to extract the keywords of information (i.e. “topics”) that synchronise well with trading activity, measured by the daily trading volume. However, LDA is a primitive technique that cannot deliver a systematic and predictive advantage. Relatedly, the very recent research in [7] investigates the field of topic modeling in the context of finance-related news impact analysis, and further stresses how very limited literature exists. The authors compare three state-of-the-art topic models, namely LDA, Top2Vec [8], and BERTopic [9], and show that the latter performs best. The framework focuses on extracting topics and related sentiment of specific stocks, whose time series of returns are in connection investigated. However, the actual identification and interpretation of topics is still of questionable level, and the framework limits itself to facilitate efficient news selection based on topics, but do not properly analyse the latter.
On the other hand, the recent emergence of novel Large Language Models (LLMs), such as OpenAI’s Generative Pre-trained Transformer (GPT) ones, clearly marks a significant departure from earlier language processing techniques. These models leverage the power of deep learning and vast amounts of text data to achieve unprecedented levels of language understanding and generation, achieving outstanding improvements in Natural Language Processing (NLP) tasks, and opening countless novel paths of research. The paper in [10] presents a study on harnessing LLMs outstanding knowledge and reasoning abilities for explainable financial time series forecasting of NASDAQ-100 stocks. Then, [11] leverages GPT parsing of companies’ annual reports, to get suggestions on investment targets. On the other hand, [12] shows the potential improvement of the GPT4 LLM in comparison to BERT for modeling same-day daily stock price movements of Apple and Tesla in 2017, based on sentiment analysis of microblogging messages.
There is a clear increasing interest in understanding how to optimally leverage on the capabilities of GPT models, since the focus has otherwise lied on their qualitative responses. For now, quantitative applications have been very constrained in breadth, and we are also not aware of any existing research that tries to leverage GPTs for topic extraction and narrative detection within news. Thus, we believe that our research provides a strongly innovative approach to the solution of such tasks, by leveraging both GPT models and network analysis techniques.
Main contributions. Our research introduces a novel framework for topic identification within economic news. Indeed, we leverage on GPT3.5 to extract the main entities (both concrete or abstract) of reference for each article in an available corpus, and aggregate entities’ co-occurrence among articles in weekly graphs. Studying the resultant set of graphs allows us to identify clusters of entities that can be mapped back to interpretable non-trivial topics. Furthermore, basic metric such as degree and eigenvector centrality of nodes are shown to characterise the evolution over time of central themes of discussion. Beyond that, we propose to consider the sentiment around main entities of an article as a more accurate proxy for the overall sentiment of such piece of text, and describe a case-study to motivate this choice. One final contribution of our study is the investigation of news’ features in relation to financial market dislocations. We design attributes that characterise both the local and global structure of news, and their sentiment and interconnections. This allows us to find quantitative evidence of high entropy in the high-dimensional space of interconnected news, when the latter are associated to moments of unusually high volatility across asset classes.
Structure of the paper. Section 2 introduces the data we collect, and some initial related processing. Section 3 clarifies the theoretical knowledge that is necessary to understand the approach taken, which is mainly comprised by notions from both NLP and graph theory, and on related embedding techniques. Then, Section 4 highlights our analyses on narratives and market dislocations, and describes the results achieved. Finally, we conclude this work with some last remarks in Section 5.
This paper is available on arxiv under CC0 1.0 DEED license.