When we started FinBrain, we built our first price forecasting engine on a neural network — a NARXNET architecture in MATLAB, a shallow recurrent model popular for time series at the time. It was the “AI” era, and neural networks felt like the obvious choice for anything involving patterns in data. We expected the network to find non-linear structure in price series that classical statistical methods couldn’t capture.
What we got instead was a model that looked brilliant on training data and fell apart out-of-sample.
We rebuilt the entire forecasting engine around ARIMA — a classical statistical method from the 1970s. A decade later, we’re still on ARIMA. This post explains why.
What Went Wrong With Deep Learning
The core problem with deep learning for price forecasting isn’t the architecture. It’s the data.
Financial time series have three properties that make them adversarial to deep neural networks:
Low signal-to-noise ratio. Daily stock returns are dominated by noise. The true predictable component — if it exists at all — is small relative to random fluctuations. Deep networks have enormous expressive capacity, and with enormous capacity comes the ability to memorize noise. The model learns to fit the idiosyncratic movements of the specific training window rather than extracting any generalizable structure.
Non-stationarity. The statistical properties of returns change over time. Volatility clusters. Correlations shift. Market regimes rotate. A model trained on 2015-2018 data learns patterns that may not exist in 2020-2023. Deep networks that capture “subtle patterns” in the training set often capture patterns that were specific to that era, not the underlying market dynamics.
Limited data. Compared to image or language tasks where deep learning excels, price forecasting has fundamentally limited data. Even with 20 years of daily data, you have ~5,000 observations per ticker. Vision models train on millions of images; language models train on trillions of tokens. Deep networks need enough data to generalize, and financial time series simply don’t provide it at the daily frequency most traders care about.
The Overfitting Signature
When a deep network overfits financial time series, the failure mode is characteristic:
- Training error drops smoothly over epochs
- Validation error looks reasonable if validation is drawn from the same era
- Out-of-sample performance, especially across different market regimes, is inconsistent and often worse than a coin flip
The tell is that the model’s “edge” disappears whenever market conditions change. The network wasn’t learning anything real — it was curve-fitting the specific noise realization of its training window.
We saw this clearly in our NARXNET results. Pretty forecasts in-sample. Unreliable forecasts out-of-sample. Worse, the model was a black box: when the forecasts were wrong, we couldn’t tell why.
Why Statistical Models Work
We replaced the neural network with ARIMA (AutoRegressive Integrated Moving Average), a model family that dates back to Box and Jenkins in 1970. ARIMA is transparent, well-understood, and has decades of academic validation specifically in financial time series.
The advantages over deep learning for this problem:
Parsimony. An ARIMA model typically has fewer than ten parameters. A deep network has millions. With noisy data and limited observations, fewer parameters means less opportunity to overfit noise. The bias-variance tradeoff genuinely matters here — underpowered models trained on noisy data generalize better than overpowered ones.
Interpretability. Every ARIMA parameter has a clear statistical meaning: autoregressive coefficients capture momentum, moving-average coefficients capture shock persistence. When a forecast looks unreasonable, we can trace the cause. A neural network doesn’t give us that diagnostic.
Calibrated uncertainty. ARIMA produces proper prediction intervals grounded in the residual distribution. Deep networks can produce “confidence intervals” through techniques like dropout or ensembling, but calibration is notoriously weak. If an ARIMA model says there’s a 95% chance the price falls in a range, that statement is backed by statistical theory. With deep learning, we’d need extensive calibration work to say the same thing.
Scale economics. Fitting ARIMA to a single ticker is fast — milliseconds on a modern CPU. Training a deep network per ticker is slow and expensive. We forecast across 28,000+ tickers daily. That scale is trivial with ARIMA and prohibitive with deep learning unless we share models across tickers, which creates a different set of problems.
Where Deep Learning Still Makes Sense
This isn’t an argument against deep learning in finance generally. Neural networks are genuinely useful for:
- High-frequency data, where signal-to-noise improves and data volume grows
- Alternative data with rich structure (text, images, audio) where classical statistics has no tools
- Cross-sectional problems where the training set spans thousands of assets simultaneously
- Feature engineering tasks like sentiment extraction where language models genuinely outperform rule-based approaches
But for the specific task of generating a price forecast for a single ticker from that ticker’s own historical prices — statistical time-series models win. Not because they’re fancier but because they’re appropriately matched to the data.
What This Means for Users
If you’re evaluating price forecasting vendors, ask them about their methodology. The ones marketing “AI-powered” forecasts without disclosing the underlying model are either using vanilla statistical methods and calling it AI for marketing, or using deep learning and hoping their overfitting doesn’t show up in your backtests.
FinBrain’s Price Forecasts are generated from ARIMA models with calibrated confidence intervals. We think this is the right tool for the problem, and we’re transparent about it because the methodology is a feature, not a liability.
The next post explains how to read those forecasts in practice — what the mid, lower, and upper bounds mean, and how to use confidence intervals to size risk.