PatchTST: A Breakthrough in Time Collection Forecasting | by Marco Peixeiro | Jun, 2023

From idea to observe, perceive the PatchTST algorithm and apply it in Python alongside N-BEATS and N-HiTS

Photograph by Ray Hennessy on Unsplash

Transformer-based fashions have been efficiently utilized in lots of fields like pure language processing (assume BERT or GPT fashions) and laptop imaginative and prescient to call just a few.

Nevertheless, with regards to time collection, state-of-the-art outcomes have principally been achieved by MLP fashions (multilayer perceptron) comparable to N-BEATS and N-HiTS. A latest paper even reveals that easy linear fashions outperform advanced transformer-based forecasting fashions on many benchmark datasets (see Zheng et al., 2022).

Nonetheless, a brand new transformer-based mannequin has been proposed that achieves state-of-the-art outcomes for long-term forecasting duties: PatchTST.

PatchTST stands for patch time collection transformer, and it was first proposed in March 2023 by Nie, Nguyen et al of their paper: A Time Series is Worth 64 Words: Long-Term Forecasting with Transformers. Their proposed methodology achieved state-of-the-art outcomes when in comparison with different transformer-based fashions.

On this article, we first discover the internal workings of PatchTST, utilizing instinct and no equations. Then, we apply the mannequin in a forecasting challenge and examine its efficiency to MLP fashions, like N-BEATS and N-HiTS, and assess its efficiency.

After all, for extra particulars about PatchTST, be sure that to seek advice from the original paper.

Be taught the newest time collection evaluation strategies with my free time series cheat sheet in Python! Get the implementation of statistical and deep studying strategies, all in Python and TensorFlow!

Let’s get began!

As talked about, PatchTST stands for patch time collection transformer.

Because the title suggests, it makes use of patching and of the transformer structure. It additionally consists of channel-independence to deal with multivariate time collection. The final structure is proven beneath.

The PatchTST mannequin structure. We see that the mannequin makes use of channel-independence to deal with multivariate time collection. Within the transformer spine, we additionally see using patching (illustrated by the rectangles). Plus, there are two variations to the mannequin: supervised and self-supervised. Picture by Nie Y., Nguyen N., Sinthong P., Kalagnanam J. from A Time Series is Worth 64 Words: Long-Term Forecasting with Transformers.

There may be quite a lot of info to assemble from the determine above. Right here, the important thing components are that PatchTST makes use of channel-independence to forecast multivariate time collection. Then, in its transformer spine, the mannequin makes use of patching, that are illustrated by the small vertical rectangles. Additionally, the mannequin is available in two variations: supervised and self-supervised.

Let’s discover in additional element the structure and internal workings of PatchTST.


Right here, a multivariate time collection is taken into account as a multi-channel sign. Every time collection is mainly a channel containing a sign.

Overview of the PatchTST mannequin. Right here, we actually emphasize on the implementation of channel-independence, the place every enter token to the Transformer spine comprises info from just one channel, or one time collection. Picture by Nie Y., Nguyen N., Sinthong P., Kalagnanam J. from A Time Series is Worth 64 Words: Long-Term Forecasting with Transformers.

Within the determine above, we see how a multivariate time collection is separated into particular person collection, and every is fed to the Transformer spine as an enter token. Then, predictions are made for every collection and the outcomes are concatenated for the ultimate predictions.


Most work on Transformer-based forecasting fashions centered on constructing new mechanisms to simplify the unique consideration mechanism. Nevertheless, they nonetheless relied on point-wise consideration, which isn’t best with regards to time collection.

In time collection forecasting, we wish to extract relationships between previous time steps and future time steps to make predictions. With point-wise consideration, we try to retrieve info from a single time step, with out taking a look at what surrounds that time. In different phrases, we isolate a time step, and don’t have a look at factors earlier than or after.

That is like making an attempt to know the that means of a phrase with out wanting on the phrases round it in a sentence.

Subsequently, PatchTST makes use of patching to extract native semantic info in time collection.

How patching works

Every enter collection is split into patches, that are merely shorter collection coming from the unique one.

The Transformer spine of PatchTST. Right here, we see that the enter time collection (on the backside of the determine) goes by means of patching, leading to a number of patches (the vertical rectangles) that are then despatched to the Transformer encoder. Picture by Nie Y., Nguyen N., Sinthong P., Kalagnanam J. from A Time Series is Worth 64 Words: Long-Term Forecasting with Transformers.

Right here, the patch may be overlapping or non-overlapping. The variety of patches is dependent upon the size of the patch P and the stride S. Right here, the stride is like in convolution, it’s merely what number of timesteps separate the start of consecutive patches.

Visualizing patching. Right here, we’ve a sequence of 15 timesteps, with a patch size of 5 and a stride of 5 as properly, leading to three patches. Picture by the creator.

Within the determine above, we will visualize the results of patching. Right here, we’ve a sequence size (L) of 15 time steps, with a patch size (P) of 5 and a stride (S) of 5. The result’s the collection being separated into 3 patches.

Benefits of patching

With patching, the mannequin can extract native semantic that means by taking a look at teams of time steps, as an alternative of taking a look at a single time step.

It additionally has the additional benefit of drastically decreasing the variety of token being fed to the transformer encoder. Right here, every patch turns into an enter token to be enter to the Transformer. That method, we will cut back the variety of token from L to roughly L/S.

That method, we drastically cut back the house and time complexity of the mannequin. This in flip implies that we will feed the mannequin an extended enter sequence to extract significant temporal relationships.

Subsequently, with patching, the mannequin is quicker, lighter, and might deal with an extended enter sequence, that means that it may well probably study extra in regards to the collection and make higher forecasts.

Transformer encoder

As soon as the collection is patched, it’s then fed to the transformer encoder. That is the classical transformer structure. Nothing was modified.

Then, the output is fed to linear layer, and predictions are made.

Bettering PatchTST with illustration studying

The authors of the paper prompt one other enchancment to the mannequin by utilizing illustration studying.

Visualizing self-supervised illustration studying in PatchTST. Right here, the mannequin will randomly masks patches and study to reconstruct them. Picture by Nie Y., Nguyen N., Sinthong P., Kalagnanam J. from A Time Series is Worth 64 Words: Long-Term Forecasting with Transformers.

From the determine above, we will see that PatchTST can use self-supervised illustration studying to seize summary representations of the info. This could result in potential enhancements in forecasting efficiency.

Right here, the method is pretty easy, as random patches will probably be masked, that means that they are going to be set to 0. That is proven, within the determine above, by the clean vertical rectangles. Then, the mannequin is educated to recreate the unique patches, which is what’s output on the prime of the determine, because the gray vertical rectangles.

Now that we’ve an excellent understanding of how PatchTST works, let’s check it towards different fashions and see the way it performs.

Within the paper, PatchTST is in contrast with different Transformer-based fashions. Nevertheless, latest MLP-based fashions have been revealed, like N-BEATS and N-HiTS, and have additionally demonstrated state-of-the-art efficiency on lengthy horizon forecasting duties.

The whole supply code for this part is offered on GitHub.

Right here, let’s apply PatchTST, together with N-BEATS and N-HiTS and consider its efficiency towards these two MLP-based fashions.

For this train, we use the Trade dataset, which is a standard benchmark dataset for long-term forecasting in analysis. The dataset comprises each day trade charges of eight international locations relative to the US greenback, from 1990 to 2016. The dataset is made obtainable by means of the MIT License.

Preliminary setup

Let’s begin by importing the required libraries. Right here, we’ll work with neuralforecast, as they’ve an out-of-the-box implementation of PatchTST. For the dataset, we use the datasetsforecast library, which incorporates all fashionable datasets for evaluating forecasting algorithms.

import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from neuralforecast.core import NeuralForecast
from neuralforecast.fashions import NHITS, NBEATS, PatchTST

from neuralforecast.losses.pytorch import MAE
from neuralforecast.losses.numpy import mae, mse

from datasetsforecast.long_horizon import LongHorizon

You probably have CUDA put in, then neuralforecast will robotically leverage your GPU to coach the fashions. On my finish, I would not have it put in, which is why I’m not doing in depth hyperparameter tuning, or coaching on very giant datasets.

As soon as that’s performed, let’s obtain the Trade dataset.

Y_df, X_df, S_df = LongHorizon.load(listing="./knowledge", group="Trade")

Right here, we see that we get three DataFrames. The primary one comprises the each day trade charges for every nation. The second comprises exogenous time collection. The third one, comprises static exogenous variables (like day, month, yr, hour, or any future info that we all know).

For this train, we solely work with Y_df.

Then, let’s guarantee that the dates have the correct kind.

Y_df['ds'] = pd.to_datetime(Y_df['ds'])


First 5 rows of the Trade dataset. Picture by the creator.

Within the determine above, we see that we’ve three columns. The primary column is a novel identifier and it’s essential to have an id column when working with neuralforecast. Then, the ds column has the date, and the y column has the trade fee.

Exhibiting the variety of observations per distinctive id. Picture by the creator.

From the image above, we will see that every distinctive id corresponds to a rustic, and that we’ve 7588 observations per nation.

Now, we outline the sizes of our validation and check units. Right here, I selected 760 time steps for validation, and 1517 for the check set, as specified by the datasets library.

val_size = 760
test_size = 1517

print(n_time, val_size, test_size)

Then, let’s plot one of many collection, to see what we’re working with. Right here, I made a decision to plot the collection for the primary nation (unique_id = 0), however be happy to plot one other collection.

u_id = '0'

x_plot = pd.to_datetime(Y_df[Y_df.unique_id==u_id].ds)
y_plot = Y_df[Y_df.unique_id==u_id].y.values


x_val = x_plot[n_time - val_size - test_size]
x_test = x_plot[n_time - test_size]

fig, ax = plt.subplots(figsize=(12,8))
ax.plot(x_plot, y_plot)
ax.set_ylabel('Exhange fee')
ax.axvline(x_val, coloration='black', linestyle='--')
ax.axvline(x_test, coloration='black', linestyle='--')

plt.textual content(x_val, -2, 'Validation', fontsize=12)
plt.textual content(x_test,-2, 'Take a look at', fontsize=12)


Each day trade fee for the primary nation, from 1990 to 2016. Picture by the creator.

From the determine above, we see that we’ve pretty noisy knowledge with no clear seasonality.


Having explored the info, let’s get began on modelling with neuralforecast.

First, we have to set the horizon. On this case, I exploit 96 time steps, as this horizon can also be used within the PatchTST paper.

Then, to have a good analysis of every mannequin, I made a decision to set the enter measurement to twice the horizon (so 192 time steps), and set the utmost variety of epochs to 50. All different hyperparameters are saved to their default values.

horizon = 96

fashions = [NHITS(h=horizon,

Then, we initialize the NeuralForecastobject, by specifying the fashions we wish to use and the frequency of the forecast, which in that is case is each day.

nf = NeuralForecast(fashions=fashions, freq='D')

We at the moment are able to make predictions.


To generate predictions, we use the cross_validation methodology to utilize the validation and check units. It can return a DataFrame with predictions from all fashions and the related true worth.

preds_df = nf.cross_validation(df=Y_df, val_size=val_size, test_size=test_size, n_windows=None)
First 5 rows of the predictions DataFrame. Picture by the creator.

As you may see, for every id, we’ve the predictions from every mannequin in addition to the true worth within the y column.

Now, to guage the fashions, we’ve to reshape the arrays of precise and predicted values to have the form (variety of collection, variety of home windows, forecast horizon).

y_true = preds_df['y'].values
y_pred_nhits = preds_df['NHITS'].values
y_pred_nbeats = preds_df['NBEATS'].values
y_pred_patchtst = preds_df['PatchTST'].values

n_series = len(Y_df['unique_id'].distinctive())

y_true = y_true.reshape(n_series, -1, horizon)
y_pred_nhits = y_pred_nhits.reshape(n_series, -1, horizon)
y_pred_nbeats = y_pred_nbeats.reshape(n_series, -1, horizon)
y_pred_patchtst = y_pred_patchtst.reshape(n_series, -1, horizon)

With that performed, we will optionally plot the predictions of our fashions. Right here, we plot the predictions within the first window of the primary collection.

fig, ax = plt.subplots(figsize=(12,8))

ax.plot(y_true[0, 0, :], label='True')
ax.plot(y_pred_nhits[0, 0, :], label='N-HiTS', ls='--')
ax.plot(y_pred_nbeats[0, 0, :], label='N-BEATS', ls=':')
ax.plot(y_pred_patchtst[0, 0, :], label='PatchTST', ls='-.')
ax.set_ylabel('Trade fee')
ax.set_xlabel('Forecast horizon')


Predictions of the each day trade fee for the primary collection, within the first window. Picture by the creator.

This determine is a bit underwhelming, as N-BEATS and N-HiTS appear to have predictions which are very off from the precise values. Nevertheless, PatchTST, whereas additionally off, appears to be the closest to the precise values.

After all, we should takes this with a grain of salt, as a result of we’re solely visualizing the prediction for one collection, in a single prediction window.


So, let’s consider the efficiency of every mannequin. To duplicate the methodology from the paper, we use each the MAE and MSE as efficiency metrics.

knowledge = {'N-HiTS': [mae(y_pred_nhits, y_true), mse(y_pred_nhits, y_true)],
'N-BEATS': [mae(y_pred_nbeats, y_true), mse(y_pred_nbeats, y_true)],
'PatchTST': [mae(y_pred_patchtst, y_true), mse(y_pred_patchtst, y_true)]}

metrics_df = pd.DataFrame(knowledge=knowledge)
metrics_df.index = ['mae', 'mse']

metrics_df.type.highlight_min(coloration='lightgreen', axis=1)

Efficiency of all fashions. Right here, PatchTST achieves the bottom MAE and MSE. Picture by the creator.

Within the desk above, we see that PatchTST is the champion mannequin because it achieves the bottom MAE and MSE.

After all, this was not essentially the most thorough experiment, as we solely used one dataset and one forecast horizon. Nonetheless, it’s attention-grabbing to see {that a} Transformer-based mannequin can compete with state-of-the-art MLP fashions.

PatchTST is a Transformer-based fashions that makes use of patching to extract native semantic that means in time collection knowledge. This enables the mannequin to be sooner to coach and to have an extended enter window.

It has achieved state-of-the-art performances when in comparison with different Transformer-based fashions. In our little train, we noticed that it additionally achieved higher performances than N-BEATS and N-HiTS.

Whereas this doesn’t imply that it’s higher than N-HiTS or N-BEATS, it stays an attention-grabbing choice when forecasting on a protracted horizon.

Thanks for studying! I hope that you just loved it and that you just discovered one thing new!

Cheers 🍻

A Time Series is Worth 64 Words: Long-Term Forecasting with Transformers by Nie Y., Nguyen N. et al.

Neuralforecast by Olivares Okay., Challu C., Garza F., Canseco M., Dubrawski A.

3 Methods to Embed a Matplotlib Chart into an HTML Web page | by Angelica Lo Duca | Jun, 2023

The way to Create Lovely Age Distribution Graphs With Seaborn and Matplotlib (Together with Animation) | by Oscar Leo | Jun, 2023