Global AI and Data Science

Global AI & Data Science

Train, tune and distribute models with generative AI and machine learning capabilities

 View Only

Survival Analysis - An Example: How to Predict Future Repair Amount Apart?

By Yushu(Jade) Zhou posted Tue July 02, 2019 04:33 PM

  

Survival Analysis
Part 2 - An Example: How to Predict Future Repair Amount?

Introduction

The idea of survival analysis comes from a businessman, John Gaunt. He built the life table including 3 columns (Age, Died, Survived) to analyze mortality statistics in London. Thirty years after Mr.Gaunt publishing his book, Edmund Halley started to apply mathematics representation to life table and survival analysis gradually became a mature research field.

Survival Analysis is a branch of statistics for analyzing the expected duration of time until one or more events happen, such as death of patients and failure in mechanical systems. It can answer questions like after certain time, what percent of patients can still be alive? However, we don’t need to limit the usage of it to only survival analysis problem. In this blog, we will show you how to reshape a repair amount prediction question, usually considered as time series forecasting problem, into survival analysis structure and how to fit survival models to get predictions. Data sets come from Kaggle - ASUS Malfunctional Components Prediction. The goal is to predict future repair amount for each module-component pair at monthly level.

To familiarize yourself with survival analysis, I recommend to learn basic concepts first. You can refer to my colleague Kunal's blog.

We use lifelines package for fitting survival models.

 

What data we have?

Basically, we have three tables:
1) SaleTrain.csv
It contains information about date of sales and number of sales for each module-component pair. We need to sum over number of sales group by module, component and date of sale.
2) RepairTrain.csv
It contains information about date of sales, date of repair and amount of repair for each pair. We need to sum over number of repairs group by module, component, date of sale and date of repair. Then we will merge “SaleTrain” table and “RepairTrain” table to get one input table.
input.png

Fig 1. Input Table after Preprocessing


3) Output_TargetID_Mapping.csv
output.png

Fig 2. Output Table after Preprocessing

As shown in above pictures, we defined some new columns including “sale_date” and “repair_date”, converting actual dates into month index. I defined earliest month in data sets as my month 1 and then calculate the difference in how many months between sale/repair date and month 1. This step will simplify our later modeling and predicting process.

 

How we reshape the problem into survival analysis structure?

It’s natural to consider our example as time series forecasting problem, so how to make up for the gap between time series and survival analysis? Actually, in every survival analysis problem, we have key concepts like right-censor waiting to define. Following this structure, we start to fill in blanks.

Event is if the component was sent back to repair, so we need to add one column to called “event’ and values are all 1. Then duration, which is called “months_after_sold” in Fig 1, is how long the repair of specific component was requested after sale. As for right-censor, it means we didn’t observe the event during experiment period, so that we cannot rule out the possibility of event beyond time frame. A common misunderstanding about right-censor is the target for each survival model is event or not, but actually right-censored or not. Therefore, it’s really important to find our right-censored data which requires a trick. For each module-component pair, we already know the amount of sales in total and how many repairs were requested during observed period, so the deduction of them can get how many components are still “alive” after observed period. Therefore, we define right-censored data to be the number of components which didn't require repairing during observed period here and the corresponding value in “event” column is 0. Finally, we define the number of events happening in one month as frequency. 

final_repair_date = round(((end_date - start_date)/np.timedelta64(1, 'M')))

def add_censor(x):

    new_row = pd.DataFrame(x.iloc[-1, :]).transpose()

    new_row['repair_date'] = final_repair_date

    new_row['months_after_sold'] = new_row['repair_date'] - new_row['sale_date']

    new_row['event'] = 0

    new_row['freq'] = new_row['number_sale'] - sum(x['freq'])

    return x.append(new_row)

   

df_input = df_input.groupby(['module_category', 'component_category', 'sale_date']).apply(add_censor).reset_index(drop = True)


right-censor.png
Fig 3. Input Table after Adding Right-Censored Data

Next, we need to figure out what predictions the model should output in our example. We want to predict future repair amount which is how many components are “dead” in the future, so hazard ratio seems to be the one we want to estimate. Usually, calculating hazard ratio at specific time is hard when using lifelines package. You need to manually specify band width to tell the program how to estimate that. However, in our case, problem is much easier than your thought. Since we are predicting amount at monthly level, we should get monthly cumulative hazard ratio instead of hazard ratio. This is not hard to calculate.

To conclude, let’s review key concepts in our case now:

  • Event: repair or not
  • Duration: number of months after being sold
  • Frequency: How many events happened during specific period
  • Right-censor: components don’t request repair during observed period
  • Estimation Goal: monthly cumulative hazard ratio

 

What models we will use?

In general, there are two kinds of models for survival analysis. The first type is univariate models, only taking information about time, right-censor and frequency. The second one is regression models which can incorporate influences from exogenous variables as well. In this blog, I’ll start with a very simple non-parametric model, Nelson-Aalen fitter, first and go through steps in detail and then mention cox regression not that deep because you can follow similar steps for Nelson-Aalen fitter.

 

Modeling and Predicting – Nelson-Aalen Fitter

  • Modeling: Component Level Fitter

Since missing data is all around, we will come into a problem that there are not enough observations for specific component-module pair to fit a model. Therefore, we can fit data at component-level first, which means fitting a model for every component, and then use predictions of repair amount for corresponding component to fill in missing pieces.

from lifelines import NelsonAalenFitter
naf = lifelines.NelsonAalenFitter(nelson_aalen_smoothing = False)
comp_hazard = pd.DataFrame(columns = ['component_category', 'months_after_sold', 'hazard'])
for component in df_input['component_category'].unique():
    print('Current component %s' % component)
    cur_df = pd.DataFrame(columns = ['component_category', 'months_after_sold', 'hazard'])
    df_input_specific = df_input[df_input['component_category'].str.match(component)]
    naf.fit(durations = df_input_specific['months_after_sold'], event_observed = df_input_specific['event'], weights = df_input_specific['freq'])
    cur_df['months_after_sold'] = naf.cumulative_hazard_.index
    cur_df['hazard'] = naf.cumulative_hazard_.diff().fillna(0).values
    cur_df['component_category'] = component
    comp_hazard = pd.concat([comp_hazard, cur_df])
    naf.plot_hazard(bandwidth=3)
plt.show()

From above codes, you can feel it’s pretty simple to fit a survival model. You just need to specify duration, event and weight. That’s why we spent much time explaining how to find corresponding pieces in second part.
comp_hazard.png

Fig 4. Estimated Monthly Cumulative Hazard Ratio at Component Level

 

To give you a rough image of specific hazard ratio, I set bandwith = 3 for visualization:
hazard_ratio.png

Fig 5. Estimated Hazard Ratio for Component P04

In above figure, the unit of X axis is month and Y axis represents hazard ratio. This picture indicates a pattern that the risk of repair gets higher and higher at the beginning and then fluctuates within small range until 24 months after being sold. Then the risk decreases sharply and finally stabilizes around 0. I found similar patterns from most of components. Some people in Kaggle community said the reason might be warranty period is 2 years, so after 24 months, repair costs money so that people stop it.

 
  • Modeling: Module-Component Level Fitter

Run similar codes as above but at module-component level.
mod_comp_hazard.png

Fig 6. Estimated Monthly Cumulative Hazard Ratio at Module-Component Level
 
  • Predicting

Once we have two above tables, we will first merge output table with the hazard ratio table for module-component pairs and then merge results from component level model with output table to fill in missing values from hazard column.

df_final_output = df_output.merge(mod_comp_hazard[['module_category', 'component_category', 'months_after_sold', 'hazard']], left_on = ['module_category', 'component_category', 'months_after_sold'], right_on = ['module_category', 'component_category', 'months_after_sold'], how = 'left')

df_final_output_nona = df_final_output[~df_final_output['hazard'].isnull()]
df_final_output_na = df_final_output[df_final_output['hazard'].isnull()].drop('hazard', axis = 1)
df_final_output_na = df_final_output_na.merge(comp_hazard, left_on = ['component_category', 'months_after_sold'], right_on = ['component_category', 'months_after_sold'], how = 'left')
df_final_output_na.isnull().sum()
missing_valuess.pngFig 7. Number of Missing Values for Each Column

However, there are still lots of missing values, so we need to look into them and find ways of dealing that. After some exploration, we can conclude patterns for module-component pairs or components who have missing values:

  • Duration is beyond the range of fitted models
  • There are not enough observations to fit a model whose time frame is long enough
  • Missing monthly cumulative hazard ratio for specific months

To finish predictions simply, we propose preliminary ways to deal with above cases correspondingly:

  • Replace missing values with 0 because the estimated hazard ratio plot shows ratio will stabilize around 0 in far future for these components.
  • Fit linear models on monthly cumulative hazard ratio based on historical data.
  • Directly interpolate hazard ratios.

Lastly, we multiply number of sales by monthly cumulative hazard ratio and sum over that by module, component and repair date to get final predictions.

df_final_output['pred_repair'] = df_final_output['number_sale'] * df_final_output['hazard']
df_final_output = df_final_output.groupby(['module_category', 'component_category', 'repair_date'])['pred_repair'].sum().reset_index()
df_final_output.head()

final_predictions.pngFig 8. Final Predictions

  

Beginning of Cox Regression

After playing with univariate models, we start to think about incorporating other information like how many sales were made at specific time into our model. Now it's time to introduce cox proportional hazard model. To put it simply, this model thinks the log-hazard of an individual is a linear function of their static covariates and a population-level baseline hazard that changes over time. You can refer to this page here for better understanding.

We can use “LabelEncoder” from sklearn package to convert categorical variables like “module_category” to be integers and fit one model at first.

from sklearn import preprocessing
le_comp = preprocessing.LabelEncoder()
le_mod = preprocessing.LabelEncoder()

df_cox_input = df_input[['module_category', 'component_category', 'number_sale', 'sale_date', 'months_after_sold', 'event', 'freq']]

le_comp.fit(df_cox_input['component_category'])
df_cox_input['component_category'] = le_comp.transform(df_cox_input['component_category'])

le_mod.fit(df_cox_input['module_category'])
df_cox_input['module_category'] = le_mod.transform(df_cox_input['module_category'])
cox_input.pngFig 9. Input Table for Cox Regression

 

One thing worth to mention is we need to check if proportional assumption holds for every cox regression model. There is a function called check_assumptions in lifelines to check that.
cox_output.png

Fig 10. Results from Check_assumptions

Following above results, we have to go back and fit component-level models again. Now it’s your turn to finish whole modeling and predicting flow!


Summary

Congrats! Now you know how to analyze time series problems into survival analysis structure and corresponding codes to fit models and make predictions. To conclude, there are many models to deal with time series forecasting, but best model type is always case by case. If you are stuck in time series, trying to change a way of thinking and starting to apply survival analysis in your case might be helpful. Hope this series of blogs about survival analysis can provide you with a new view when dealing with forecasting problems.



The notebook and data that form the basis for this blog are available on Github .

Click here to learn more about the IBM Data Science Elite.



#GlobalAIandDataScience
#GlobalDataScience
#Hands-on
0 comments
66 views

Permalink