What Can (and Does) Go Wrong with Data
Adapted from “Real-Time Risk: What Investors Should Know About Fintech, High-Frequency Trading and Flash Crashes” (with Steve Krawciw, Wiley, 2017)
Big data is all the rage, but not without pitfalls. In fact, data analysis is subject to risks that may lead to poor inferences and bad decisions that follow.
The process of analyzing data, regardless of complexity, can go off the rails on several fronts: A small data sample may pick up a pattern that does not recur on a sufficiently long timeline, misleading the researchers of the pattern’s power and predictability.
Oversampling of data may occur when researchers torture the same sample of data over and over to tell them something useful about the markets. Often, the only outcome of such analysis is a misleading forecast.
Overreliance on machine learning is another issue plaguing data scientists. Machine learning may mean many things to different people, but it usually refers to algorithmic factorization of data and iterative refinement of models based on their realized predictive power. While it is very tempting to entrust computer scientists and machines to sift through mountains of data in search of a gold nugget of predictability, the reality is that markets are driven by economic models that require deep understanding of not just mathematics and computer science, but also the market participant behavior and existing economic models. Understanding the often nonlinear economics underlying the markets helps speed up subsequent machine learning by a factor of weeks, if not months or years. How is this possible? Pure machine learning often begins with a so-called spaghetti principle, as in, “Let’s throw the spaghetti (market data) against the wall (past market data and other data), and see what sticks.” Thorough understanding of economics helps reduce the amount of wall space needed for these experiments, a.k.a the data drivers, considerably, saving time and labor for the data science crew.
Duplication of models is a serious problem that presents itself in financial circles. A blogger recently posted that the current career trajectory of financial data modelers follows a pattern: Year 1: Glory, Year 2: Sweat and Tears, and Year 3: a Wild card. In the first year at a new employer, data modelers bring over a proven successful model from the previous place of employment or deploy a model that had been in development for a while and implement it profitably, obtaining a bonus reward. In the second year, the employer’s expectations are high with hopes of a repeat performance, but with a second model. Developing this new model requires very hard work—something that only very few people can do, resulting in sweat and tears. In the third year, the workers reap the results of their previous year’s labor, and their new models either work, or the workers are sent out to pasture, which most often means to the next fund where they start by implementing the model that was successful in year one at their previous job. In the end, models tend to circulate financial shops several times over, diluting their quality and also creating systemic risks. Suppose a given model has an Achilles’ heel that is activated under certain rare market conditions. Due to the large amounts of money invested in the working models across a wide range of financial institutions, the impact of the Achilles’ heel may be greatly amplified, resulting in a major market crash or other severe destruction of wealth across the financial markets. And, if the money used to prop up the strategies is borrowed, as is customary with hedge funds, the effect of just one flaw in a single model can be disastrous for the economy as a whole.
Does this sound like an exaggeration? Think back to August 2008, when hundreds of Wall Street firms, including proprietary trading desks at the investment banks, were running the same automated medium-term statistical-arbitrage (stat-arb) strategies popularized by an overzealous group of quants. That August, in the midst of the quietest two weeks of the year when most people manage to leave for a vacation, these models broke down overnight, resulting in billion-dollar losses across many financial institutions. Rumors circulated that some firms recognized that someone figured out how to destroy the delicate equilibrium of stat-arb strategies and ran the models backward with a huge amount of capital to confuse poorly staffed markets, only to suddenly reverse the course of events and capitalize dramatically on everyone else’s failures. Most of the trading firms were trading on heavily borrowed money. It was in vogue at the time to trade on capital that was levered 200 times the actual cash. And the impact was likely the first step leading to the financial crisis of 2009—debt obligations were unmet, valuations destroyed, and panic and confusion seeded in the hearts of previously invincible quant traders.
Finally, to err is human, and it is humans who tell computers how to analyze data and to learn from it. As a result, errors creep into models and it can be very difficult and expensive to catch them. One solution to this problem deployed in banks and other large organizations is to vet models or to have validation teams on staff whose sole job is to make sure the original models are sound. The problem with this approach? Besides an outrageous expense, the validation team members have all the incentives to leave for a competitor as soon as they learn a valuable model that they can deploy elsewhere.
Still, data analysis is here to stay. Being aware of the pitfalls is the first step to data-driven success.
Irene Aldridge is a co-author of “Real-Time Risk: What Investors Should Know About Fintech, High-Frequency Trading and Flash Crashes” (Wiley, 2017). She is also a speaker at the upcoming 5th Annual Big Data Finance Conference to take place in New York City on May 19, 2017.