Time series is a sequence in which the values of the same statistical indicator are arranged in chronological order in which they occur. The main purpose of time series analysis is to predict the future based on existing historical data. A guidance of IBM SPSS time series algorithms can be found in the blog "Guidance for IBM SPSS Time Series Analysis Methods".
The Autoregressive Integrated Moving Average (ARIMA) model is a typical time series model that can be used to forecast future values, provides more sophisticated methods for modeling trend and seasonal components compared to exponential smoothing models. The model includes the autoregressive model (AR), moving average model (MA), and autoregressive moving average model (ARMA).
In practical terms, ARIMA models are most useful if you want to include predictors that might help to explain the behavior of the series that is being forecast, such as the number of catalogs that are mailed or the number of hits to a company web page.
User Scenario
A catalog company is interested in forecasting next year’s monthly sales based on their sales data from the last 10 years. They also plan to optimize resources to get more benefits, but they don’t know what areas impact sales the most. The catalog_seasfac.csv file contains the number of catalogs that are mailed, the number of pages in the catalog, the number of phone lines open for ordering, the amount spent on print advertising and the number of customer service representatives. These five features impact monthly sales of the company’s men’s clothing line, but the company wants to identify the top two features that have the biggest impact on sales.
Tom is a data analyst at this company, and he plans to analyze and identify the top two features for the company. Because forecasting is related to time, the time series related algorithm should be his first choice. Tom is familiar with Python and notebook, so he uses IBM Watson Studio notebook, and the notebook uses kernel Python 3.5 with Spark 2.1.
Here the input data is catalog_seasfac.csv (The corresponding catalog_seasfac.sav file is also available from the Demos directory of any IBM® SPSS® Modeler installation, and catalog_forecast.str file is in the streams directory).
Data Preparation
To prepare the data
- From the Watson Studio notebook, click Find and add data to upload input data catalog_seasfac.csv.
- Click catalog_seasfac.csv and select Insert SparkSession DataFrame from Insert to Code. The notebook inserts the code to create Spark DataFrame automatically, and then Tom updates the code like following to specify input data schema:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession.builder.getOrCreate()
df_schema = StructType([
StructField("date",TimestampType(),True),
StructField("men",DoubleType(),True),
StructField("women",DoubleType(),True),
StructField("jewel",DoubleType(),True),
StructField("mail",DoubleType(),True),
StructField("page",DoubleType(),True),
StructField("phone",DoubleType(),True),
StructField("print",DoubleType(),True),
StructField("service",DoubleType(),True),
StructField("YEAR_",DoubleType(),True),
StructField("MONTH_",DoubleType(),True),
StructField("DATE_",StringType(),True),
StructField("Seasonal_Err_Men",DoubleType(),True),
StructField("Seasonal_AdjSer_Men",DoubleType(),True),
StructField("Seasonal_Factors_Men",DoubleType(),True),
StructField("Seasonal_TrendCycle_Men",DoubleType(),True)])
df_data_1 = spark.read\
.format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
.option('header', 'true')\
.schema(df_schema)\
.load(cos.url('catalog_seasfac.csv', '*****************************'))
print(df_data_1.schema)
df_data_1.show(5)
Data Structure
There are 16 fields in the input data catalog_seasfac.csv. The following information is what Tom is interested in:
- men: the sales of the men's clothing line
- mail: the number of catalogs mailed
- page: the number of pages in the catalog
- phone: the number of phone lines open for ordering
- print: the amount spent on print advertising
- service: the number of customer service representatives
Tom also prints the top five rows of the Spark DataFrame from input data to check whether the data is loaded correctly:
![InputSparkDataFrame InputSparkDataFrame](https://dw1.s81c.com/IMWUC/MessageImages/TinyMce/0f31ecee-4c5b-4931-8106-31135c9f2469.jpg)
Time Series Plot
Tom plots the time series first and he wants to know:
- Does the series have an overall trend? If so, does the trend appear constant or does it appear to be dying out with time?
- Does the series show seasonality? If so, do the seasonal fluctuations seem to grow with time or do they appear constant over successive periods?
The following figure shows the time series plot of actual sales of men's clothing in the past 10 years:
![TimeSeriesPlot TimeSeriesPlot](https://dw1.s81c.com/IMWUC/MessageImages/TinyMce/f0b2e094-a03b-4fe5-9a7e-124ad7e2ebe1.jpg)
From the figure above, the time series shows a general upward trend; that is, the series values tend to increase over time. The upward trend is seemingly constant, which indicates a linear trend.
The series also has a distinct seasonal pattern with annual highs in December, as indicated by the vertical lines on the graph. The seasonal variations appear to grow with the upward series trend, which suggests multiplicative rather than additive seasonality.
From the time series plot, the model of the time series can be built with an ARIMA model or an ES model or other models. Because Tom also needs to know the impacts from the predictors, he first builds the model with ARIMA.
Build Model
Based on current design of SPSS algorithms, input raw data must be processed by Time Series Data Preparation (TSDP) to a suitable format for ARIMA model or other SPSS time series models to consume. The Reverse Time Series Data Preparation (RTSDP) will convert the output from ARIMA or other SPSS time series model to a readable format as raw data.
Unlike in Watson Studio, in SPSS Modeler and SPSS Statistics, TSDP and RTSDP are designed to be embedded with the products and they are transparent to users.
So Tom uses TSDP to convert input raw data first. In code below tsdpOutput is the transformed DataFrame :
lcm = LocalContainerManager()
tsdp = TimeSeriesDataPreparation(lcm). \
setMetricFieldList(["men","mail","page","phone","print","service"]). \
setDateTimeField("date"). \
setEncodeSeriesID(False). \
setInputTimeInterval("MONTH"). \
setOutTimeInterval("MONTH")
tsdpOutput=tsdp.transform(df_data_1)
Based on Tom’s experience, he selects mail, page, phone, print and service as predictors, which he believes impact target men. He also sets setCalcPI=True to check which predictor impacts the target the most. He tries (p=1,d=0,q=1) for non-seasonal and (seasonalP=1, seasonalD=0, seasonalQ=1) for seasonal as he is not very sure what values are better. For comparison between the original value and the predicted value from the prediction output, he sets setOutInputData=True. He builds ARIMA model with the following settings:
arima = TimeSeriesForecastingArima(lcm). \
setInputContainerKeys([tsdp.uid]). \
setTargetPredictorList([Predictor(targetList = [["men"]],
predictorIncludeList=[["mail"],["page"],["phone"],["print"],["service"]])]).\
setTargetOrderList([TargetOrderList(
targetList=[["men"]],
nonSeasonal=[1,0,1],
seasonal=[1,0,1],
transType="none"
)]). \
setBuildScoringModelOnly(False). \
setOutInputData(True). \
setCalcPI(True)
arima_model=arima.fit(tsdpOutput)
The following list defines and explains p, d and q:
- Autoregressive (p): The number of autoregressive orders in the model. Autoregressive orders specify which previous values from the series are used to predict current values. 1 specifies that the value of the series one time period in the past is used to predict the current value.
- Difference (d): Specifies the order of differencing applied to the series before estimating models. Differencing is necessary when trends are present and is used to remove their effect. 0 specifies no differencing is considered here.
- Moving Average (q): The number of moving average orders in the model. Moving average orders specify how deviations from the series mean for previous values are used to predict current values. 1 specifies that deviations from the mean value of the series from each of the last one time periods be considered when predicting current values of the series.
Analysis of the Model
After he built the ARIMA model, Tom also does some analysis for the model and runs the following code:
![ModelOutput ModelOutput](https://dw1.s81c.com/IMWUC/MessageImages/TinyMce/35a7c906-db55-4071-94d8-2748d95e6805.jpg)
This shows that there are a total of 10 output available from the model he built. Tom runs the following code to print all of them:
containers = arima_container.containers();
print("containers[0] 0.xml:\n"+containers[0].containerEntry('0.xml').stringContent()+"\n")
print("containers[0] TSDPOutput.json:\n"+containers[0].containerEntry('TSDPOutput.json').stringContent()+"\n")
print("containers[0] datamodel.xml:\n"+containers[0].containerEntry('datamodel.xml').stringContent()+"\n")
print("containers[0] 1.es.xml:\n"+containers[0].containerEntry('1.es.xml').stringContent()+"\n")
print("containers[0] 2.es.xml:\n"+containers[0].containerEntry('2.es.xml').stringContent()+"\n")
print("containers[0] 3.es.xml:\n"+containers[0].containerEntry('3.es.xml').stringContent()+"\n")
print("containers[0] 4.es.xml:\n"+containers[0].containerEntry('4.es.xml').stringContent()+"\n")
print("containers[0] 5.es.xml:\n"+containers[0].containerEntry('5.es.xml').stringContent()+"\n")
print("containers[1] StatXML.xml:\n"+containers[1].containerEntry('StatXML.xml').stringContent()+"\n")
print("containers[1] 0.xml:\n"+containers[1].containerEntry('0.xml').stringContent()+"\n")
To show the output of the print above more clearly, Tom copies the contents of the output and opens the contents with a file edit tool. He then checks the PMML file 0.xml that is used for scoring from the first output:
![PMML_0_XML PMML_0_XML](https://dw1.s81c.com/IMWUC/MessageImages/TinyMce/e0ef4efb-a445-4d20-815c-cc5f0f511cff.jpg)
From the following DataModel, Tom knows “men” corresponds to encode “0”, ”mail” corresponds to encode “1”, “service” corresponds to encode “5”.
![DataModel DataModel](https://dw1.s81c.com/IMWUC/MessageImages/TinyMce/9545052d-80b9-4a86-81ba-dc406c6cd754.jpg)
From PMML file 0.xml above in the MiningSchema section, Tom finds predictor “mail” has the biggest impact to target, as the importance value 0.4520397919959903 is the biggest of the five predictors, and “phone” is the second contributor to target “men,” see the following plot for predictor importance.
The predictor importance chart helps you focus your modeling efforts on the predictor fields that matter the most. Consider dropping or ignoring those predictor fields that matter the least by specifying the relative importance of each predictor while estimating the model.
![PredictorImportance PredictorImportance](https://dw1.s81c.com/IMWUC/MessageImages/TinyMce/ec3f2a78-b4bc-4609-ae5a-55b7dc3a7da4.jpg)
Tom checks the model building settings in the StatxXML.xml file. He only sets part of the settings and all the other settings are the default values:
![StatXML StatXML](https://dw1.s81c.com/IMWUC/MessageImages/TinyMce/467fe175-76a1-4555-80ad-fee4ca61bbdc.jpg)
Tom checks StatXML file 0.xml to evaluate the model. He plots autocorrelation (ACF) and partial autocorrelation (PACF) values first to check the order of the ARIMA model. The ACF and PACF values are from the MultivariateData section in the 0.xml file and from the plots, the current order is acceptable.
ACF and PACF are measures of association between current and past series values. They indicate which past series values are most useful in predicting future values.
![ACP_Plot ACP_Plot](https://dw1.s81c.com/IMWUC/MessageImages/TinyMce/0de36fdc-ddf2-4254-a584-99e2a6c09cb1.jpg)
![PACF_Plot PACF_Plot](https://dw1.s81c.com/IMWUC/MessageImages/TinyMce/9114a05e-014c-4227-9b71-d5898bbf68bb.jpg)
In the ModelDiagnostics section, Tom checks the statistical information of the model. The RSquared value is 0.8333228451390295, which is close to 1. Other statistics like MAPE and MaxAPE also look good and Tom thinks the model is acceptable. He uses this model to predict the sales.
![StatXML_0_XML StatXML_0_XML](https://dw1.s81c.com/IMWUC/MessageImages/TinyMce/ce4113d7-15e6-45b5-b402-3a345af98610.jpg)
Prediction
Tom wants to know how the sales of men’s clothing will change in the next year, so he specifies forecastSpan=12. He also wants to know the Lower Confidence Interval (LCI) and the Upper Confidence Interval (UCI) of the predicted value, which shows the upper and lower boundary of the predicted value. He then sets outCI=Trueand runs the following code to predict the sales:
transformer = arima_model.setCILevel(0.95). \
setTargets(ScorePredictor(targetIncludedList=[["men"]])). \
setForecast(ForecastEs(outForecast = True, forecastSpan = 12,outCI = True))
predictions = transformer.transform(tsdpOutput)
rtsdp = ReverseTimeSeriesDataPreparation(lcm). \
setInputContainerKeys([tsdp.uid]). \
setDeriveFutureIndicatorField(True).\
setDerivePartOfTimeField(True)
score = rtsdp.transform(predictions)
score.show(score.count())
The predictions are displayed below. Tom specifies “setDeriveFutureIndicatorField(True)” in RTSDP, the output contains a flag field $FutureFlag with value 1 to show that the time is in the future, so there are total 12 months a year with 1 flag, in $TS-men field we could see the prediction for sales of men’s clothing.
![Score_Head Score_Head](https://dw1.s81c.com/IMWUC/MessageImages/TinyMce/6ce966a2-c265-4a8c-bac0-74fea989925c.jpg)
……
![Score_Tail Score_Tail](https://dw1.s81c.com/IMWUC/MessageImages/TinyMce/92b83542-94a4-4879-9d7e-ad4b3a93f265.jpg)
The following plot shows the sales comparisons and predicted value $TS-men. The forecast for 1999 looks good; as expected, after the December peak in 1998, there's a return to normal sales levels in the first half of 1999 and a steady upward trend in the second half of the year. The total sales in 1999 is larger than the previous year.
![Prediction_Plot Prediction_Plot](https://dw1.s81c.com/IMWUC/MessageImages/TinyMce/cc91a986-a980-4572-917c-1f675608306f.jpg)
The sample Notebook is available in ARIMA_Notebook.
The details of ARIMA Python API can be referenced in following link:
Conclusion
Tom completes his forecasting of the sales of men’s clothing for the next year by using the ARIMA model. He can predict that sales will increase in the following year. The main contributors to the increase in sales (“men”) are the catalogs that the company mails (“mail”) and the phone lines open for completing orders (“phone”). This information helps his company prioritize resources and get the maximum benefits.
Locating IBM SPSS ARIMA
You can also locate SPSS ARIMA from the SPSS Modeler or SPSS Statistics.
SPSS Modeler
To locate SPSS ARIMA
- In the SPSS Modeler, first select the Time Series node.
- From the Build Options tab, specify ARIMA.
![Modeler_ARIMA Modeler_ARIMA](https://dw1.s81c.com/IMWUC/MessageImages/TinyMce/f3fe89b9-20f4-4c69-b57b-9225487915ea.jpg)
Detailed information can be checked from the IBM SPSS Modeler online help.
SPSS Statistics
To locate SPSS ARIMA
- In SPSS Statistics, select Analyze > Forecasting > Create Traditional Models.
- Select ARIMA.
![Statistics_ARIMA Statistics_ARIMA](https://dw1.s81c.com/IMWUC/MessageImages/TinyMce/40742b18-3530-4ea0-947b-d22c09afcf0c.jpg)
Detailed information is located in the IBM SPSS Statistics online help.
#GlobalAIandDataScience#GlobalDataScience