IBM and a Department Store Build a Recommender
-----------------------
Intro: Why this Article?
Recommendation systems are an especially important tool for businesses today selling a multitude of products. They allow those businesses to infer for their customers their potential interest in products they haven't yet consumed; and possibly have never seen. This additive feature to a customer's experience also helps the business by increasing their customers' purchases and share of basket. IBM's Data Science Elite (DSE) worked with a department store on an engagement to build out a recommendation system that fit within the department store's architecture and marketing mechanisms.
As an interlude, for those unfamiliar, the IBM DSE group is an expert resource within IBM that travels to clients and assists them with engagements to up-level our IBM clients in data science. It is an investment made by IBM in our clients. We have certain preconditions for clients to participate, and one of those is that we require clients bring their own data science resources so that the lessons persists even after IBM leaves. We want to help build the next set of AI-enabled organizations and fundamentally change the way our clients work.
In this article, we'll walk you through the use-case of a recommendation system, the theoretical underpinnings of a sample of these systems, and the specifics of this business case where possible. Keep in mind that to respect the privacy of the IBM Client, their name and information that would identify them (specific features) may be omitted. You'll walk away with an intuition for the goals of these applications, an overview of the frameworks, and a walkthrough of the algorithm ultimately used in the IBM client engagement.
What is a recommender system?
A recommender system is an application that looks at the purchases or product ratings from existing clients and identifies additional products they may enjoy based on that historical behavior. There are multiple types of recommenders within the broader category of these systems that differ primarily on the data available to them.
Data may, or may not, be available on the opinions of your customers, hence some recommenders are based on explicit opinions while others are based on implicit opinions. Explicit opinions mean that the consumer has intentionally provided feedback on the good being sold —I rated these shoes four stars out of five on my stores website. Implicit recommendations rely on proxy information to indicate interest in the good being sold —I clicked on that product's listing on my department store's website.
Data also may or may not be available to describe your products, hence some recommenders are designed to leverage descriptive product characteristics (content-based filtering) while others are based primarily on finding customers who have similar behavior patterns (collaborative filtering). When data is available to describe and characterize your product, you can leverage it for the sake of finding items with similar characteristics to those previously purchased. Those data are essential for content-based filtering recommenders (CBF), and often take the form of metadata i.e. a given movie is in the romance genre. When data that describe products aren't as readily available, or is deemed not to be instructive, recommenders compare the behavior patterns between users in an attempt to find similar profiles. In such an example in our department store setting, user A buys hoodies and shorts from a specific brand, and user B buys hoodies, shorts, and sneakers from that same specific brand. User A may be interested to hear about those sneaker products.
Each of the above recommendation engine designs is significant in its tradeoffs given to achieve the ultimate end goal of recommending a product to a user who has not tried it before. When it comes to processing implicit or explicit opinions, explicit opinions require surveying your consumers (along with the required infrastructure to gather and persist these opinions) while implicit opinions demand you make an assumption about the meaning of certain behaviors. With either, you have to answer a variety of questions. Where and how do you persist your opinion data? Do you model how opinions of goods (tastes) change over time? If your consumer has never clicked on a product, is that because they were not interested or because did they just not know it existed? Your interpretation will determine how you treat your data.
Client Data
Hopefully what is clear from the above is that the data available drives the choices IBM and our client ultimately had to make as it pertains to architectures of the recommender system; a principle that would apply to anyone else building a recommender system as well. The IBM client data looks as follows:
Customer
|
|
|
|
-
Date of Birth
-
Gender
-
Marital Status
-
State (City ID)
-
Residence Type
-
Residence # Dependents
-
Residence # Children
|
-
Education Level
-
Company Type
-
Salary
-
Credit Limit
|
From the data schemes, we see that we lack information that compares the products to each other (product class and family aren't useful for this purpose). As a result, the most feasible recommender systems would compare the behavior of like-consumers. Additionally, because our consumers had not explicitly rated their consumed goods, implicit ratings are the only way to understand their behavior by aggregating transactional information. Those are the constraints and assumptions we have to make based on the data available.
As a note of instruction, in order to aggregate transactional information to build the training set from our sample of transactional data, the Pandas API is perfectly sufficient:
[#]: dataframe.groupby (["USERID","ITEMID"])["FREQUENCY"].sum()
We leverage that each record in Customer Transactions is a product of the user purchasing a single item. We can use those individual records, and we can produce a user-item interaction matrix that treats the sums totals as the frequency of purchase. How do we make use of that data?
Models Tested
Together with our client, we tested a few of the above approaches (including content-based filtering algorithms such as K-NearestNeighbors) in the first couple of weeks of our engagement, and agreed the approaches with a higher probability for success were systems such as Alternating Least Squares using transactional information; or some derivative hybrid system.
K-Nearest Neighbors (KNN)
In this section we'll explain the intuition and implications of the KNN algorithm. In the nearest neighbors algorithm, which is useful for content based filtering, you choose classifications based on the majority voting of the samples closest to your point of interest; closeness in this case is measured in Euclidean distance. In other words, when shown an unlabeled sample point, your model will look at some K number of points (the 'K' in K-NN) and take a poll between them. If you've chosen 'K' to be 3, it will ask the 3 closest neighbors to your unlabeled sample for their label and tally the count. In a two-class example, whichever label tallies 2 will be the label we assign to our unlabeled sampled.
Figure 1. GIF of KNN
I want to rehash the point about 'closeness' as measured by Euclidean distance and explain its intuition as well as drawbacks. In the above GIF you see 6 training points (crosses and dots) whose X, Y coordinates are plotted as measures of two features. As a result, those features become the determinants for how we classify the unlabeled point of data. For items in our department store, as an example, they could be an item's volume or count in an SKU (12 fluid oz, 6 bottles) plotted against its price. The intuition, therefore, is that we use the information about the products to identify similar products others might like to buy.
There are, however, a few drawbacks to using KNN as your algorithm. First, the more dimensions you add for product comparison, the more training data you'll need for having neighboring products. The intuition there is complex, but the boiled down version is that the more dimensions there are to your data (features), the worse off you are using Euclidean distance to differentiate two data points (see Curse of Dimensionality). The second drawback is that this algorithm requires you calculate the distance of every single point of training data from your point in question which becomes problematic as your volume of data grows. It's computationally inefficient to have to calculate and compare that space to find the shortest distances for potentially millions of data points.
While this algorithm was reviewed as a potential candidate, the transactional/product information was insufficient for item:item comparisons using an algorithm like KNN.
Alternating least squares
In the section we'll walk through the intuition and implications of alternating least squares (ALS), as well as the success rates for the model.
ALS is an algorithm that approximates missing data in a matrix of information in a stepwise fashion, and allows us to build a recommender by filtering out users with similar behavior. Again, my broadly defined goal is to discern that if I know two consumers have equal interest in items A and B, but only one consumes item C, I can recommend that unconsumed good C to the second person. That is the basis for the idea behind collaborative filtering.
When working with our IBM client, we consider that we're using implicit ratings information gathered from their customer transactional data; that's in comparison with the canonical example for ALS which is explicit ratings of products such as movies where users have indicated how much they enjoyed viewing that movie. In our case the sum of transactions by our user serve as our implied rating. If someone has repeatedly purchased a product, they're indicating strong preference for it. Even with transactional data, it is likely that our User x Items matrix (R) is sparsely populated. We'll depend upon the principles of matrix factorization in order to fill our matrix R.
Harkening back to some of the linear algebra you may have forgotten, if our Matrix R is composed of rows of users (matrix U) and columns of the products they've rated (matrix V), then matrix U * matrix V = matrix R where U & V share some dimension K. We need to factorize R into its component user matrix (U) with the latent factors for users, and the component item matrix (V). Alternating least squares solves these matrices by randomly initializing each matrix and then solving for its counterpart while minimizing a loss function.
Figure 2. GIF of ALS
As each matrix is held constant, and its counterpart is solved for, there is the opportunity to use a distributed/parallel computing architecture. As our data scales, so does the reward for discovering any chance to parallelize our compute workloads. Transactional information can be especially large when it concerns large retailers and many customers, so alternating least squares is a very appealing option.
Neural Networks
In this section, because there are already many exceptional, and thorough, explainers on the topic, I will offer only a cursory explanation of the intuition behind neural networks - and the success of the net in modeling recommendations for our client. Unlike in other modeling exercises described above we refactored the dataset for the neural net to include: UserId, ItemID, Gender, MaritalStatus, Age, Location, Purchase Frequency, with a label Bought. The basis of the dataset, therefore, was primarily demographic and summary purchase information. How do neural nets make use of that information?
Neural networks translate vectors of input data into our ultimate classification of whether or not someone purchased our department stores product. How they do so requires a few steps, but consider the process by first thinking about a general neural network with a generic template of two hidden layers of nodes - the middle two fully connected layers in the figure 3. We want to teach the system of connected neurons to give us back opinions on how much each record corresponds to a likelihood to purchase a product.
Figure 3. Neural Network
How does that system translate data into opinions? Each input data feature is fed into its own input neuron in our input layer (our first column of circles), and it will spark a set of activations in the subsequent layer. In this case, activations are most easily understood as turning "on" a node to a value between 0 to 1.
In fact the process is similar across the full network (visualized in steps in figure 3). Some values (activations) in some specific nodes/neurons (our rainbow circles) will combine with weights and biases across the network to activate other nodes in the next layer. There is simplicity and beauty in the idea: it's the previous nodes activations that determine subsequent neuron/node activations. Keep in mind that as the weights and biases are initialized randomly, the first weights and biases we pass through the network are random values, and so consequently the predictions will be wrong. Some questions to keep in mind: how many layers do you need in this architecture? What effect do the activations of a previous layer have on the activations of the next? What to do about the wrong guesses?
Figure 4. Node Activation Process
Again, we begin in the input layer with the input features, a set of randomly initialized weights and biases, and the labels for the data. Those input features are our first activations. As an example of this process, frame 3 of figure 4 shows the start where we calculate our weighted sum of the previous layers activations and weights. That weighted sum and a bias term are given to the next node where we want to use some function to compress the weighted sums and biases to a value between 0 and 1. (Note that figure 4 shows that function that compresses our weighted sum between 0 and 1 to be the sigmoid function because the function is very good with that responsibility. It is however more typical today to see neural nets train using an activation function called ReLU. The principle of constraining the values between two boundaries are the same, but I'll let you follow other more in depth resources to understand the differences).
How do we actually train the network to predict to our data and address incorrect guesses? In figure 3, the connecting lines represent differing weights across all connections, and those are an essential component in our tuning and training process. Those weights also combine with biases. Biases are just scalar values added to the weighted sum to help determine whether the network feels a value should be higher or lower to activate that node (frame 4, figure 4). As those are initialized randomly, they're also our tuning knobs when we get to the final prediction and want to penalize the model for its guessing choice. It is those weights and biases that the model changes in hopes to minimize how frequently it predicts the label incorrectly - minimizing its loss function.
The loss minimization process typically depends on Stochastic Gradient Descent. There are also many excellent explainers out there for SGD, but there's an easy way to visualize the process. Imagine you're standing at the top of a hilly area blindfolded and want to get to the bottom. You probably would take small steps all around and feel with your feet for the "downhill" direction and move that direction cautiously. Eventually you'll get to some low point from the peak and will have minimized your loss. Those are the theory fundamentals you need in order to understand what a neural network is, and how it is trained. It's a short explanation so that we can talk about how our IBM client utilized the trained neural network in order to make recommendations.
Once we had trained our neural network to predict product purchases based on demographic and purchase frequency data, we feed all of the items back into the neural network and use the top 5-10 items predicted probabilities as the items we recommend. Our client uses these recommendations on a monthly basis to send out emails to try and help clients find products they might enjoy.
Frameworks
We tested multiple frameworks in Watson Studio (also see free trial below) when it came to implementing the recommendation engine for our IBM client - LightFM, TensorRec, and PyTorch. When trying to evaluate the merits of LightFM and TensorRec, consider TensorRec's author's own description:
"LightFM and TensorRec are very similar -- LightFM was the inspiration for TensorRec and I've used LightFM extensively. The difference comes from the ability to build and customize your representation, prediction, and loss functions in TensorRec. If you build a TensorRec model with linear embedding representations and dot-product prediction, they're equivalent algorithms. In fact, if you're doing so, I'd advise you to work with LightFM as it is a more mature package. If you need to customize that system, though, that is where TensorRec comes in."
Finally, the IBM DSE team chose to use PyTorch for our neural net implementation over TensorFlow. We encourage experimentation and exploration as a part of the team's ethos, and PyTorch is a new framework from the DSE team's perspective. We hope more teams executing neural net implementations will consider its use, as we valued its intuitive and pythonic API design. There are multitude resources [1][2] on the differences between what you can accomplish with each framework, but one of the primary points of comparison is how each handles their computational graphs - TensorFlow creates a static graph, and PyTorch creates a dynamic graph.
Each project has great documentation and examples to get you started in building your own recommendation engines.
Summary
This recommender project was a great exercise for the IBM Data Science Elite to propose efficient systems for improving their client's shopping experience. On the client engagement, there were three different data scientists and each proposed/modeled their own technique. That meant that each model had an advocate and sponsor to speak to the benefits and tradeoffs in each architecture decision (Note: one of the models they were advocates for isn't explained in this article, and will have to wait for another one).
In many ways the tradeoffs we discussed from this exercise are the same faced with any modeling exercise. There are multiple ways to tackle building recommendation engines, and in some ways the choice is between power or parsimony; a parsimonious model is easy to explain. Sometimes a simpler model is better due to its simplicity! To make that comparison, however, let's consider the results of testing the multiple approaches:
Evaluation:
Success was measured by taking each models recommendations, and sending out emails with the recommendations contained in them. We could then analyze how efficient each recommendation engine, derived from each model, behaved in terms of predicted purchases!
There are many applications for recommendation engines, and implementing them alone is never easy. If you have a use case for building a recommendation engine, and want to talk to IBM's experts, consider contacting the IBM DSE or writing a article of your own! We would love to hear from you.
If you're curious about the tools we used for this engagement, consider a Watson Studio Trial.
Thanks
A special thanks to @Dheeraj Arremsetty of the IBM Data Science Elite (DSE) for the resources used as the basis of this blog post.
References
[1] https://medium.com/@UdacityINDIA/tensorflow-or-pytorch-the-force-is-strong-with-which-one-68226bb7dab4
[2] https://towardsdatascience.com/pytorch-vs-tensorflow-spotting-the-difference-25c75777377b
Resources
http://stanford.edu/~rezab/classes/cme323/S15/notes/lec14.pdf
https://towardsdatascience.com/getting-started-with-recommender-systems-and-tensorrec-8f50a9943eef
https://towardsdatascience.com/solving-business-usecases-by-recommender-system-using-lightfm-4ba7b3ac8e62
https://medium.com/radon-dev/implicit-bayesian-personalized-ranking-in-tensorflow-b4dfa733c478
https://arxiv.org/ftp/arxiv/papers/1205/1205.2618.pdf
https://arxiv.org/pdf/1207.0580.pdf
https://www.utc.fr/~bordesan/dokuwiki/_media/en/glorot10nipsworkshop.pdf
https://medium.com/@gabrieltseng/intro-to-warp-loss-automatic-differentiation-and-pytorch-b6aa5083187a
http://lyst.github.io/lightfm/docs/home.html
http://yifanhu.net/PUB/cf.pdf
https://hackernoon.com/introduction-to-recommender-system-part-1-collaborative-filtering-singular-value-decomposition-44c9659c5e75
https://www.youtube.com/watch?v=9gBC9R-msAk
https://www.youtube.com/watch?v=4-f77HjB_CI
https://www.youtube.com/watch?v=E8aMcwmqsTg
https://www.youtube.com/watch?v=GGWBMg0i9d4
https://www.youtube.com/watch?v=HY3Csl52PfE
https://www.youtube.com/watch?v=58OjaDH2FI0