A trip down memory lane: from Box-Jenkins to InstantML
It’s been 50 years since ARIMA models were introduced to the public. It’s been 50 years since George Box and Gwilym Jenkins contributed to a great leap forward in the field of time series analysis. What has happened since then? Have we moved on from their findings, or has time seemingly been standing still? How much do the current cutting-edge algorithms for time series analysis – think InstantML – still rely on their findings? As a birthday celebration, let’s take a trip down memory lane.
The Box-Jenkins method
In 1970, Box, Jenkins, Reinsel and Ljung published “Time Series Analysis, Forecasting and Control”, considered to be a fundamental contribution to the field of time series analysis. The Box-Jenkins method has since become a canonical technique for determining the optimal parameters for an ARIMA model, which can analyse many different data types for forecasting the time series’ next values. (Nielsen, 2019, pp. 1–3; Scott, 2019; Wilson, 2016, p. 709)
1970 might sound like a long time ago, but the methodology they came up with is still taught in universities today, and is still applied across a wide range of use cases, which speaks to the significance of this contribution. Of course, today’s computational power has allowed the calculation to be more automated and much faster than was possible at the time of publishing, but the underlying mathematics are still the same.
Quick overview: ARIMA
It’s clear that ARIMA models are of great importance, but what are those models, really? ARIMA stands for Auto-Regressive Integrated Moving Average; we will take a look at each of those components separately:
- Integrated: The methodology requires the data to be stationary. This means there is no (upward or downward) trend present in the data. To accomplish this, the time series may be differenced as many times as necessary. Since the time series is differenced, the model is called ‘integrated’. The so-called order of integration expresses the number of times the data is differenced before becoming stationary. (Nau, 2014; Wikipedia contributors, 2021)
- Auto-regressive: The auto-regressive component looks at lags (i.e. past values) of the data itself. As a concrete example, imagine the temperature is measured each day. To forecast tomorrow’s temperature, one might look at the temperature that was measured today. Today’s temperature is one ‘sample’ or one observation before tomorrow’s temperature, and is thus we are looking at a lag of the temperature. The order of the auto-regressive components expresses the number of time lags that are taken into account. (Nau, 2014; Wikipedia contributors, 2021)
- Moving Average: The moving average component looks at lags of the forecast errors. Here too, the order expresses the number of time lags that are taken into account. (Nau, 2014; Wikipedia contributors, 2021)
ARIMA models are often extended by adding seasonality to them, to help capture periodic (seasonal) patterns in the underlying data.
Limitations
While the ARIMA methodology has been of great importance to the field, it also has its limitations.
ARIMA models focus on capturing seasonality and trends in time series data, and for that very reason they have a hard time accurately forecasting outliers (Grogan, 2020). Depending on the use case, these exceptional values can be a very important part of the data.
Moreover, Box-Jenkins has emerged as a top choice specifically for datasets with low volatility (Scott, 2019). Not all datasets meet this requirement, so additional measures need to be taken to better capture volatility in time series data.
The ARIMA methodology as proposed by Box and Jenkins is a univariate modelling technique. Therefore, the model is limited to only take the past observations of one variable – the one to be modelled – into account. In reality, many different variables influence each other, so it’s intuitive that including additional variables into the model could improve the model’s results. (Iwok & Okpe, 2016, p. 211)
Lastly, estimation of the parameters of a Box-Jenkins model can be very complicated. The process takes time, and requires a certain level of understanding from the person who’s doing the modelling to avoid overfitting or mis-identification of parameters. The final model can differ (to an extend) based on the intuition and decisions of the person who made it. This prevents a high level of scalability, which many time series forecasting challenges would greatly benefit from. (Nau, 2014; Scott, 2019)
Cherry-picking
Do the limitations mentioned above give reason to abolish ARIMA models completely? No! We find ourselves in a time with more computing power and more available data than at the time this methodology emerged. This brings many advantages, and allows us to cherry-pick aspects we like from established and well-researched techniques.
With its InstantML technology, TIM (Tangent Works’ Tangent Information Modeller) incorporates different components that are used in ARIMA models into its own models, such as the AR and MA components. TIM doesn’t stop there, but generates many other feature transformations too. On top of that, InstantML is a multivariate modelling technique, and thus builds on more data than just the variable to be forecasted. These changes result in the ability to build better models and handle an even wider range of use cases than before. Thanks to today’s computing power and InstantML, the modelling process can be automated completely. The result is a fast and scalable algorithm building the cutting-edge time series models of today.
As George Box famously said “All models are wrong, but some are useful.” (Nielsen, 2019, pp. 1–3). Models offer a representation of reality after all, and simplifying the complexity of reality while still captivating the essence of the aspect of interest is exactly what a model should do. The evolution of technology continuously shifts the barrier of the level of complexity that we are able to handle. So who knows what tomorrow will bring?
About the author
Elke Van Santvliet is a Product Manager at Tangent Works. She focuses on bringing TIM’s capabilities to business users, by exposing the underlying functionality through various platforms and tools. This includes Tangent Works’ own web interface TIM Studio, as well as a range of data-related products such as Alteryx, Power BI and Qlik Sense.
Elke is passionate about data in all its aspects, and is always open to discuss the newest trends in AI or dive deep into a specific data science use case.
References
Grogan, m. (2020, september 3). Limitations of arima: dealing with outliers. Towards data science. Https://towardsdatascience.com/limitations-of-arima-dealing-with-outliers-30cc0c6ddf33
Iwok, i. A., & okpe, a. S. (2016). A comparative study between univariate and multivariate linear stationary time series models. American journal of mathematics and statistics, 6(5), 203–212. Https://doi.org/10.5923/j.ajms.20160605.02
Nau, r. (2014). Lecture notes on forecasting. Duke fuqua school of business. Https://people.duke.edu/~rnau/slides_on_arima_models–robert_nau.pdf
Nielsen, a. (2019). Practical time series analysis: prediction with statistics and machine learning (1st ed.). O’reilly media.
Scott, g. (2019, june 27). Box-jenkins model. Investopedia. Https://www.investopedia.com/terms/b/box-jenkins-model.asp#:%7e:text=the%20box%2djenkins%20model%20is,data%20points%20to%20determine%20outcomes.
Wikipedia contributors. (2021, february 19). Autoregressive integrated moving average. Wikipedia. Https://en.wikipedia.org/wiki/autoregressive_integrated_moving_average
Wilson, g. T. (2016). Time series analysis: forecasting and control, 5th edition, by george e. P. Box, gwilym m. Jenkins, gregory c. Reinsel and greta m. Ljung, 2015. Published by john wiley and sons inc., hoboken, new jersey, pp. 712. Isbn: 978-1-118-67502-1. Journal of time series analysis, 37(5), 709–711. Https://doi.org/10.1111/jtsa.12194