Hello first time everybody!
What picture do you imagine when you read yet another article about sales forecasting? Something like this, with stable seasonality and trend, with 10 years of history available?
Having SPSS Modeler, it’s too easy to make a very accurate forecast for such series, just using “Time Series” node with “Expert modeler” enabled.
This article is different. Look at this diagram.
This diagram shows Primary sales amount from a manufacturing company to region distributor. In this article, I will show you how to create a forecast model for such time series, using PA, SPSS and python.
Planning Analytics part
… is very easy.
We have “Sales” cube with dimensions: Product, Store, Week, Measures.
There are 2 measures:
- “Sales” with actual sales
- “Sales forecast”, where we will export forecast from SPSS
The server data can be downloaded here: https://github.com/Dvoynev/Blog/tree/main/Iterative%20forecast%20model/PA
SPSS Modeler part
Let’s look closer at figure 2. We can hardly recognize the seasonality of weeks in each year. But the peaks spoil this seasonality.
If we select 2017 year and look closer at this peaks, we will see that there is no any seasonality in them, the distance differs much.
This peaks are wholesale orders. And we can assume that the wholesale order appears when distributor’s stock is close to irreducible stock balance. So, in addition to obvious “from_date” features (like week number in month, month number and others), we will try “history” features (sales N months ago) and “moving sum” features.
A simple SPSS Modeler stream with predictors described can be downloaded here: https://github.com/Dvoynev/Blog/tree/main/Iterative%20forecast%20model/SPSS
It performs the following tasks:
- Load actual sales from PA;
- Split it to “train” and “test” partitions (validation start and test length are set in stream parameters);
- Perform feature engineering:
- Create “date” features: week, month and so on;
- Create “history” features: previous weeks’ values and moving sums.
- Make forecast with XGBoost;
Let’s look at the resulting forecast and eyebleed:
Accuracy is rather good in test partition, but it degrades dramatically in validation partition. This is because of our “history” features:
They stop working in validation partition. We need to use previous weeks’ forecasts (instead of actual sales) to count “history” features in validation partition.
To do this, we can create 52 model nuggets for each week in year, like this:
Stop, no! This looks ugly. We need something more elegant. We need the…
We will use:
- Python to loop through all forecast weeks and make 1-week-forward forecast for every week;
- Planning Analytics as a buffer to store the forecasts.
First we should make a stream to forecast 1 week. The stream can be downloaded here: https://github.com/Dvoynev/Blog/tree/main/Iterative%20forecast%20model/SPSS
We just add 3 improvements to the previous stream:
- Before making “history” features: add “tm1 import” to import previous weeks’ forecasts, and replace actual zero sales with forecasts;
- Leave only 1 forecast (set in parameter) week before model nugget;
- Create a view with all the forecast periods in Planning Analytics. It will be used to loop through weeks.
The resulting stream will look like this:
Finally, the python part begins. To be short, the script will:
- Train XGBoost model;
- Load forecast periods to list;
- Loop through periods to forecast each week.
- We first define stream variable and get nodes by ID:
There is a very useful button to get all the needed node IDs in SPSS Modeler:
- Then we run the table node to list all forecast periods
The result will be as follows:
- “ForecPeriod” parameter is used to calculate partitions. We set first period listed in the table node as the first forecast period, and then train model once.
- Finally, loop through all forecast periods, put them to the “ForecPeriod” parameter and export forecasts to tm1.
Do not forget to check “Run this script” On stream execution.
Finally, we have an accurate primary sales forecast:
- History node was used for simplicity here. If you have data for several customers, products and others, you should not use “history” node: it will take other product’s/client’s data. You should use several derive nodes with @OFFSET (with check that the client/product is the same in this offset).
- «@SUM» nodes use «Sales__1» (sales with 1 offset) field because the @SUM function takes current period also to calculate sum. It is called «data leakage» and leads to overfitting.
- I do not describe time series forecasting theory and practice here. Just an example of how to use PA+SPSS+python for this.
You need to do a lot of other things to do a good time series forecast, like feature selection, data preparation, use of promotion data, use of external (like weather) data and many many others.
This duplicates the functionality of “split” role. But this is the only way to make forecasts with “Time series” node with many splits: when the TS node encounters a time series with poor quality, the whole Time Series node fails, for all the split models.
There is an FRE for this, please vote: https://ibm-data-and-ai.ideas.aha.io/ideas/MDLR-I-349