In this article, I attempt to compare the results of the auto arima function with the ARIMA model we developed in the article Forecasting Time Series with ARIMA (https://www.alldatascience.com/time-series/forecasting-time-series-with-arima/).
I made this attempt to see how it works and what the differences are.The parameters selected by auto-arima are slightly different than the ones selected by me in the other article.
Auto arima has the advantage of attempting to find the best ARIMA parameters by comparing the AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) of the tested models, but as demonstrated in this test, it is not always able to do so; the results, as described later, are very similar to those obtained by running the ARIMA algorithm and estimating the parameters «manually,» if not slightly worse.
The first steps in the notebook (data acquisition and cleaning) are the same as in the original article.
Let's import the necessary libraries
import pandas as pd
import glob
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_pacf, plot_acf
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.stattools import pacf
Mount google drive.
from google.colab import drive
drive.mount('/content/drive')
Let's import the data, before doing it I've had to remove the double quotes from the csv file.
data = pd.read_csv('/content/drive/MyDrive/DP_LIVE_10082021082010744.csv', parse_dates=['TIME'], index_col='TIME', header=0,sep=",")
data.head()
Let's remove the unnecessary columns, we are only interested in TIME and Value
data.drop(['LOCATION','INDICATOR','SUBJECT','MEASURE','FREQUENCY','Flag Codes'],axis=1,inplace=True)
Let's plot the data
data.plot(figsize=(15,5))
#In google colab we should install pmdarima in order to use it.
!pip install pmdarima
Let's try it. First we should split the data into train and test sets.
TEST_SIZE = 36
train, test = data.iloc[:-TEST_SIZE], data.iloc[-TEST_SIZE:]
x_train, x_test = np.array(range(train.shape[0])), np.array(range(train.shape[0], data.shape[0]))
train.shape, x_train.shape, test.shape, x_test.shape
fig, ax = plt.subplots(1, 1, figsize=(15, 5))
ax.plot(x_train, train)
ax.plot(x_test, test)
from pmdarima.arima import auto_arima
Let's fit the auto_arima model.
model = auto_arima(train, start_p=1, start_q=1,
test='adf',
max_p=5, max_q=5,
m=1,
d=1,
seasonal=False,
start_P=0,
D=None,
trace=True,
error_action='ignore',
suppress_warnings=True,
stepwise=True)
According to the algorithm the best ARIMA model is ARIMA (2,1,1)
model.summary()
According to the model summary, the model meets the condition of independence in the residuals (no correlation) because the p-value of the Ljung-Box test (Prob(Q)) is greater than 0.05, so we cannot reject the null hypothesis of independence, but we cannot say that the residual distribution is homoscedastic (constant variance) because the p-value of the Heteroskedasticity test (Prob(H)) is smaller than 0.05.
Let's make the prediction
# Forecast
prediction, confint = model.predict(n_periods=TEST_SIZE, return_conf_int=True)
prediction
cf= pd.DataFrame(confint)
#Mostramos la gráfica con la predicción de los 2 últimos años en naranja
#sobre la serie real
prediction_series = pd.Series(prediction,index=test.index)
fig, ax = plt.subplots(1, 1, figsize=(15, 5))
ax.plot(data.Value)
ax.plot(prediction_series)
ax.fill_between(prediction_series.index,
cf[0],
cf[1],color='grey',alpha=.3)
model.plot_diagnostics(figsize=(14,10))
plt.show()
We can see from the model plots that the Correlogram does not show any significant correlation in the residuals. The standardized residual plot depicts variance change, while the normal Q-Q plot demonstrates that the residuals do not follow a normal distribution (but this is not a strict requirement to validate the model).
Let's calculate the SMAPE (Symmetric Mean Absolute Error)
def calcsmape(actual, forecast):
return 1/len(actual) * np.sum(2 * np.abs(forecast-actual) / (np.abs(actual) + np.abs(forecast)))
smape=calcsmape(test.Value,prediction)
smape
Now with 1 year prediction
TEST_SIZE = 12
train, test = data.iloc[:-TEST_SIZE], data.iloc[-TEST_SIZE:]
x_train, x_test = np.array(range(train.shape[0])), np.array(range(train.shape[0], data.shape[0]))
train.shape, x_train.shape, test.shape, x_test.shape
fig, ax = plt.subplots(1, 1, figsize=(15, 5))
ax.plot(x_train, train)
ax.plot(x_test, test)
Let's fit the new model.
model = auto_arima(train, start_p=1, start_q=1,
test='adf',
max_p=5, max_q=5,
m=1,
d=1,
seasonal=False,
start_P=0,
D=None,
trace=True,
error_action='ignore',
suppress_warnings=True,
stepwise=True)
model.summary()
This time, the p-value of the Heteroskedasticity test is greater than 0.05, indicating that the model explains better the variance in the data. This model outperforms the previous one in terms of dependability. This demonstrates that ARIMA is superior for short-term forecasting.
Let's predict!
prediction, confint = model.predict(n_periods=TEST_SIZE, return_conf_int=True)
prediction
cf= pd.DataFrame(confint)
prediction_series = pd.Series(prediction,index=test.index)
fig, ax = plt.subplots(1, 1, figsize=(15, 5))
ax.plot(data.Value)
ax.plot(prediction_series)
ax.fill_between(prediction_series.index,
cf[0],
cf[1],color='grey',alpha=.3)
model.plot_diagnostics(figsize=(14,10))
plt.show()
Again, the plots show that the residuals are not correlated. The standardized residuals model seems similar to the previous one but as we have seen in the model summary this model is more reliable because we can't reject the null hypothesis of homoscedasticity (constant variance).
smape=calcsmape(test.Value,prediction)
smape
Conclusion
As we have seen in the notebook the results are slightly different than in the original attempt. The parameters selAs we can see from the notebook, the results differ slightly from the first attempt. The parameters chosen by auto-arima differ. The results are very similar, though slightly worse with the auto-arima model, as evidenced by the SMAPE metric.
Again, the 3-year prediction produces a higher SMAPE value, but the heteroscedasticity test reveals that this model does not account for all of the variance in the data. The 1-year prediction is more reliable because it meets all of the conditions in a valid ARIMA model (non-correlation and constant variance in the residuals).
This has been a really interesting project that has helped me better understand how ARIMA works and how to handle real-world data in order to make a prediction.