LSTMs

learning super tricky models

9 min read

Computer Science

Psychology

Spoiler Alert

In honor of an upcoming lecture I'm giving for the Cal Poly Quantitative Finance Club, I've condensed all the most important and vital information into this medium length article. Consequently, this article is less article and more guidebook, with examples and equations sprinkled in to make understanding this complex behemoth easier.

Prediction

Humanity has long sought to predict the future1. Ancient oracles to modern black-box models, the desire to peer into the unknown remains unchanged. Today, in quantitative finance, this pursuit takes the form of Long Short-Term Memory networks (LSTMs)—a class of neural networks designed to make sense of sequential data.

Traditional financial models like ARIMA and GARCH once ruled time series forecasting, but their Achilles' heel has always been long-term dependencies. Recurrent Neural Networks (RNNs) attempted to address this but suffered from the vanishing gradient problem, where crucial past information was lost in the depths of backpropagation2. Enter LSTMs.

I. The Core of Memory-How LSTMs Work

The LSTM network is designed to address the vanishing gradient problem by introducing a memory cell that can maintain long-term dependencies. At the core of the LSTM is its sophisticated gating mechanism, which enables it to retain, modify, and output relevant information over long sequences.

Let's zoom into this memory cell:

1. Forget Gate

The forget gate is responsible for determining which pieces of information from the cell’s memory should be discarded, essentially deciding what is irrelevant or outdated. This gate plays a pivotal role in controlling the memory's longevity.

How It Works: The forget gate takes the current input and the previous hidden state as inputs, processes them through a sigmoid activation function, and produces a value between 0 and 1. The closer the value is to 0, the more of the previous memory will be discarded. A value closer to 1 indicates that the previous memory will be retained.

The Math:

ft=σ(Wf[ht1,xt]+bf)f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)

WfW_f is the weight matrix, and bfb_f is the bias term. ht1h_{t-1} is the previous hidden state and xtx_t is the current input at time step tt.

Example: In financial forecasting, if an LSTM is modeling stock price movement, the forget gate might discard outdated information about market conditions from several time steps ago that are no longer relevant.

2. Input Gate

he input gate controls the flow of new information that will update the cell state. It decides what new data should be stored in memory, effectively allowing the LSTM to "learn" from the current time step and adjust its memory accordingly.

How It Works: The input gate involves two key components:

The first part is the sigmoid activation function, which determines which parts of the new information will be updated. The sigmoid function produces values between 0 and 1, similar to the forget gate, but here it’s for adding information rather than removing it.

The second part is the tanh activation function, which generates the new candidate values that could be added to the memory cell. These candidate values are scaled and then combined with the output of the sigmoid gate to update the memory.

The Math:

it=σ(Wi[ht1,xt]+bi)C~t=tanh(WC[ht1,xt]+bC)\begin{aligned} i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) \\ \tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C) \end{aligned}

Where iti_t is the input gate, and C~t\tilde{C}_t is the candidate cell state generated by the tanh function.

Example: In stock price forecasting, the input gate would consider new data points such as recent market news, which might affect stock prices, and decide how much of that data should be incorporated into the cell state.

3. Output Gate

The output gate controls what information from the cell’s memory should be output at each time step. It determines how much of the internal state should influence the final prediction or the hidden state, which is used for the next time step or passed to the next layer in the network.

How It Works: The output gate uses the current input and the previous hidden state to produce a value between 0 and 1 using a sigmoid function. The output is then multiplied element-wise by the updated cell state (after it has been modified by the input gate) and passed through the tanh function to scale the output values. This scaled result represents the final output of the LSTM unit for that time step.

The Math:

ot=σ(Wo[ht1,xt]+bo)ht=ottanh(Ct)\begin{aligned} o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) \\ h_t = o_t \cdot \tanh(C_t) \end{aligned}

Where oto_t is the output gate and CtC_t is the current cell state.

Example: For forecasting stock prices, the output gate might decide that only certain aspects of the cell state (like long-term trends) are relevant for making the next prediction, while discarding less important data (like short-term trends) for that particular time step.

The Overall Process

In an LSTM, these gates work together in a cycle to allow the network to make informed decisions about memory and output. Here’s how it all syncs up:

  1. Forget: Discards irrelevant information from the memory cell.
  2. Input: Updates the memory cell with new relevant information.
  3. Output: Decides what to output from the memory cell, which influences the prediction at that time step.

Putting it All Together

LSTMs operate as a network of these memory cells, each processing a single time step. The hidden state carries memory across these steps, and the final cell outputs the next predicted value—whether it be stock prices, volatility measures, or broader economic trends. Training involves minimizing error via Mean Squared Error (MSE), with Backpropagation Through Time (BPTT) fine-tuning weights.

For their prowess, LSTMs have found a home in numerous financial applications:

  • Stock Price Prediction: Capturing long-term dependencies in stock movements.
  • Options Pricing Models: Estimating volatility and pricing derivatives more effectively.
  • Portfolio Optimization: Learning patterns in asset allocation strategies.
  • Market Anomaly Detection: Identifying regime shifts and potential arbitrage opportunities.

A milestone in financial modeling definitely3, but research continues to evolve. GRUs (Gated Recurrent Units) offer a streamlined alternative, while hybrid models blend LSTMs with reinforcement learning for adaptive trading strategies. The question is no longer whether machines can predict the future—it’s how well they can do it.

In the unpredictable world of finance, mastering time is power. LSTMs are the latest tool in humanity’s eternal quest to make sense of chaos.

II. The Brain

"You know, huh, memory cells and the whole concept of LSTMs sound a lot like what we do everyday with our head." Yes. The parallels are numerous, so let's investigate them all.

Hippocampus --> Memory Cell

For us humans, the hippocampus is central in encoding, storing, and retrieving memories. It processes sensory information from the environment, stores it in long-term memory, and this information influences our future decision-making and behavior. The memory cells in an LSTM do basically the same thing, maintaining the network’s memory over time.

Both biological memory systems and LSTMs are also dynamic in nature. Just as memories are constantly updated based on new experiences and environmental factors, LSTM memory cells are updated with new inputs over time, adjusting to new data while retaining useful historical patterns.

Neurplasticity --> Back Propagation

Neuroplasticity is the brain's ability to reorganize itself by forming new connections throughout life. This allows the brain to adapt to the novel: new information, experiences, and environments. For the LSTM, "learning" is done by adjusting internal weights and biases during training, fine-tuning themselves through exposure to different datasets. Similarities run deep.

Synaptic Weights & Hebbian Learning: In biological neural networks (us), the strength of connections between neurons determines how effectively information is passed through the network. These weights are adjusted during learning through what's known as the Hebbian learning rule: "neurons that fire together wire together."

LSTM Training & Backpropagation: LSTMs rely on Backpropagation Through Time (BPTT) and various error functions to fine-tune weights and biases. Adjusting through repeated exposure to sequential data, they are able to make more accurate predictions.

In both cases, the more something is used, the more it fires and strengthens, enhancing our skills over time and the model's performance.

Prefrontal Cortex --> Output Gate

The prefrontal cortex takes care of the higher-level cognitive functions we love like decision-making, planning, and problem-solving. We evaluate multiple inputs and choose the best course of action. These decisions are often based on prior experience too, which is drawn from the memories our hippocampus has encoded and stored for us.

The output gate of an LSTM performs a similar function. By deciding which pieces of information from the memory cell should be output at each time step, this selective attention ensures that only the most relevant data contributes to the final prediction or decision.

What's Lacking?

Emotion.

As much as we can pride ourselves on calm, cool, rational thinking, we are impulsive and we are irrational. But, that's what makes us human.

III. Wrapping it up

So, what have we learned? Well, besides the fact that your brain is essentially running a state-of-the-art deep learning model for free4, we've seen just how much artificial intelligence borrows from biological intelligence.

LSTMs, like us, remember what matters, forget the noise, and make decisions based on past experiences. They fine-tune themselves over time, just like we do when learning a new skill, driving in a new city, or—let’s be honest—convincing ourselves that we’ll start that new habit tomorrow.

But for all their strengths, LSTMs lack what makes us us—emotion, instinct, irrationality, and the occasional inexplicable urge to stay up until 3 AM reading about something completely irrelevant. Until neural networks start making questionable life choices, we still have the edge.


Sources

  • Long Short-Term Memory Sepp Hochreiter & Jürgen Schmidhuber
  • Dive into Deep Learning (A helpful website)
  • The Colab From My Lecture

Footnotes

  1. Or at least, humanity has tried. From the Delphic Oracle’s cryptic mutterings to Wall Street’s algorithmic crystal balls, our track record for accuracy is, well, mixed. The Oracle of Delphi famously told Croesus that if he attacked Persia, he would destroy a great empire. He did—it was just his own. So, while LSTMs might be an upgrade over goat entrails, let’s not get too cocky.

  2. As gradients shrink during backpropagation, earlier time steps in the sequence vanish into the mathematical void, leaving RNNs clueless about anything beyond the immediate present.

  3. With most big quant firms keeping models proprietary, it’s hard to say whether we’re on the cutting edge or just watching some carefully leaked research from 20 years ago.

  4. H100s not included*

© 2025 Mark McGuire. All rights reserved.v0.1.4