Recommender Systems and Link Prediction
Introduction
Recommender systems play an integral part in our digital lives. Whether you’re discovering a new movie on your favorite streaming platform or browsing Amazon, recommender systems are working behind the scenes presenting to you the most relevant content to keep you from searching for a needle in a haystack.
Traditionally, recommender systems have been developed using various approaches, such as content-based filtering, user-based collaborative filtering, item-based collaborative filtering, amongst others. In user-based collaborative filtering, recommendations for a user are made based on the preferences of similar users. Conversely, item-based collaborative filtering suggests items based on the similarity between items themselves. Collaborative filtering is a popular method due to its robustness in handling large datasets and its ease of implementation.
However, beyond their widespread use, traditional collaborative filtering techniques can be enriched by considering an alternative perspective: reframing the problem as a network link prediction problem. This viewpoint not only broadens our understanding but also opens up new avenues for applying network theory to recommender systems. We will focus on item-based collaborative filtering, but the ideas presented in this post can be generalized.
Item-Based Collaborative Filtering
Let’s begin our discussion with a crash course on item-based collaborative filtering (IBCF). In IBCF, the basic idea is to recommend items (such as movies, music, books, etc.) to users based on their similarity to other items users have interacted with.
There is nothing better than working with a concrete example, so we start with loading in some data. Throughout this post, we will be working with the MovieLens dataset (small).
import pandas as pd
# read in ratings and movie data
ratings = pd.read_csv('ml-latest-small/ratings.csv')
movies = pd.read_csv('ml-latest-small/movies.csv')
The movies
dataframe contains movie IDs, their titles, and genres. The ratings
dataframe contains ratings that each user ID gave to a movie ID along with a timestamp (which we will not make use of). We will combine these two dataframes so that we can have all the data in one place:
movie_user = movies.merge(ratings, on = "movieId")
This movie_user
dataframe has \(100,836\) rows representing the total number of interactions between users and the movies.
Mean-Centering Ratings
Next, we will perform one common preprocessing step: mean-centering the ratings to adjust for user bias — some users rate more generously while others are more conservative.
To illustrate why we do this, consider the rough analogy of weighing flour with a kitchen scale. Everyone one uses different bowls, so the reading on the scale will be off by a different amount for every person. To get an accurate measurement, you would first tare the scale, setting it to zero with the empty bowl on it. Mean-centering is akin to taring your scale before weighing the flour. By doing this, we adjust for individual biases in rating styles, leading to more robust recommendations. Here’s how you can mean-center the ratings:
# calculate mean centered ratings
user_mean = movie_user.groupby('userId')['rating'].mean()
# Add a column for mean-centered ratings
movie_user['rating_mean_centered'] = movie_user['rating'] - movie_user['userId'].map(user_mean)
To make this example easier to run on laptop, we will reduce the size of the dataset by considering only movies with at least 100 ratings.
movie_counts = movie_user.groupby('movieId')['rating'].count()
popular_movies = movie_counts[movie_counts >= 100].index
movie_user_pop = movie_user[movie_user['movieId'].isin(popular_movies)]
The resulting dataframe should have \(138\) unique movies, \(597\) unique users, and \(20,188\) ratings.
Utility Matrix
A utility matrix is a fundamental concept in collaborative filtering. It is a matrix where rows represent users, columns represent items (in this case, movies), and the values are the mean-centered ratings given by users to items. This matrix forms the basis for calculating item similarities and generating recommendations.
To construct the utility matrix from our preprocessed movie_user_pop
dataframe, we use the pivot_table
function:
# Create the utility matrix
user_movie_mat = movie_user_pop.pivot_table(values='rating_mean_centered', index='userId', columns='movieId')
user_movie_mat.head()
movieId | 1 | 2 | … | 68954 | 79132 |
---|---|---|---|---|---|
userId | |||||
1 | -0.366379 | NaN | … | NaN | NaN |
2 | NaN | NaN | … | NaN | 0.051724 |
3 | NaN | NaN | … | NaN | NaN |
4 | NaN | NaN | … | NaN | NaN |
5 | 0.363636 | NaN | … | NaN | NaN |
This user_movie_mat
matrix now has users as rows, movies as columns, and the mean-centered ratings as values. Any missing values in the matrix represent situations where a user hasn’t rated a particular movie. Ultimately, the goal is to predict these missing values.
Cosine Similarity
Now that we have our utility matrix, we can calculate the similarity between items. Cosine similarity is a measure that calculates the cosine of the angle between two vectors, providing a similarity score between -1 and 1. It is widely used in item-based collaborative filtering because it effectively captures the degree of similarity between items based on user ratings. Moreover, doing this calculation on the mean-centered ratings is closely related to the Pearson correlation between the items.
Here’s how we can compute the cosine similarity between the items (movies) in our utility matrix:
from sklearn.metrics.pairwise import cosine_similarity
item_similarity_cosine = cosine_similarity(user_movie_mat.fillna(0).T)
movie_sim_df = pd.DataFrame(item_similarity_cosine, index=user_movie_mat.columns, columns=user_movie_mat.columns)
movie_sim_df.head()
movieId | 1 | 2 | … | 68954 | 79132 |
---|---|---|---|---|---|
movieId | |||||
1 | 1.000000 | 0.061511 | … | 0.184418 | 0.042688 |
2 | 0.061511 | 1.000000 | … | -0.027452 | -0.053825 |
6 | 0.089142 | -0.031387 | … | 0.025957 | 0.105738 |
10 | -0.053247 | -0.131687 | … | -0.069963 | 0.034767 |
32 | 0.122093 | -0.103307 | … | 0.139529 | 0.117363 |
The resulting item_similarity_cosine
dataframe contains similarity scores for each pair of movies. These scores form the basis of our item-based collaborative filtering recommendations, as we can use them to identify movies that are similar to ones a user has already rated highly.
Making Recommendations
One way to make predictions (recommendations) is to aggregate similarity scores between movies unwatched by the user and the movies that have been watched by the user. Thus, the predicted rating user \(u\) might give to an unrated movie \(m\), can be expressed as follows:
$$ \begin{align*} \hat{r}_{u,m} &= \bar{r}_u + \frac{\sum\limits_{m’ \in W_u} s_{m,m’}\cdot \widetilde{r}_{u, m’}}{\sum\limits_{m’ \in W_u} \lvert s_{m,m’}\rvert} \end{align*} \tag{$1$} $$
Where \(\bar{r}_u\) is the mean rating user \(u\) gives, \(W_u\) is the set of movies that user \(u\) has previously rated, \(s_{m,m’}\) the similarity between an unrated movie \(m\) and a previously rated movie \(m’\), and \(\widetilde{r}_{u, m’}\) is the mean-centered rating given to movie \(m’\) by users \(u\). This formula, effectively calculates a waited average of the ratings user \(u\) gives to movies similar to \(m\).
We can use this approach to make recommendations for user with ID 1:
recommended_movies_df = recommend_movies(user_id = 1, user_mean, user_movie_mat, movie_user_pop, movie_sim_df)
recommended_movies_df.head(5)
movieId | title | predicted_rating | |
---|---|---|---|
0 | 7153 | Lord of the Rings: The Return of the King, The… | 4.771283 |
1 | 3996 | Crouching Tiger, Hidden Dragon (Wo hu cang lon… | 4.732515 |
2 | 7361 | Eternal Sunshine of the Spotless Mind (2004) | 4.718561 |
3 | 4993 | Lord of the Rings: The Fellowship of the Ring,… | 4.715945 |
4 | 1193 | One Flew Over the Cuckoo’s Nest (1975) | 4.714788 |
I’ve left the implementation out of the post, but it can be found in the github repo here.
The Network Approach
Networks are a very versatile representation of data that particularly lends itself well to systems with sparse pairwise interactions. In the context of recommender systems, sparsity refers to the observation that most users will only interact with a small subset of items (in this case, movies). We don’t expect users to rate or even watch most of the available movies. This sparse nature of user-item interactions is a fundamental assumption in recommender systems: since users can’t possibly watch every movie, our goal is to make the best possible recommendations based on the limited interactions they have.
We will create a bipartite graph where users and movies are nodes, and the edges between them represent the ratings users have given to movies. Recommendations, are akin to predicting an edge (and in this case also its weight) between a user node and a movie node. In network theory, this is known as link prediction.
Constructing the Network
Let’s construct the bipartite graph from the movie_user_pop
dataframe. A bipartite graph is a type of graph where nodes can be divided into two disjoint sets, such that each edge connects a node from one set to a node in the other set (but never two nodes of the same set). In our case, one set of nodes represents users, and the other set represents movies. The edges between these nodes are weighted by the ratings users have given to the movies.
We will use the graph-tool library to create this graph, where each user and each movie will be a node, and an edge between them will represent a rating, with the rating as the edge weight.
Here is an example of how we can do this:
import graph_tool.all as gt
# Initialize an undirected graph
g = gt.Graph(directed=False)
# add a user node
u1 = g.add_vertex()
# add a movie node
m1 = g.add_vertex()
# add an edge between user and movie
e = g.add_edge(u1, m1)
# We can add metadata to the network using property maps
nodeID = g.new_vertex_property("string") # user or movie ID
mtitle = g.new_vertex_property("string") # movie title
node_type = g.new_vertex_property("string")
eweight = g.new_edge_property("double") # the mean-centered rating
nodeID[u1] = 'u'+str(movie_user_pop['userId'][0])
nodeID[m1] = 'm'+str(movie_user_pop['movieId'][0])
node_type[u1] = 'user'
node_type[m1] = 'movie'
mtitle[m1] = movie_user_pop['title'][0]
eweight[e] = movie_user_pop['rating_mean_centered'][0]
Doing this for every pair of nodes one-by-one is a bit of an undertaking, so we will automate it using a function that takes in a pandas dataframe and outputs a graph-tool network, the implementation of which we have left in the repo.
mlens_net = construct_network(movie_user_pop)
Cosine Similarity Revisited
In network theory, a commonly studied concept is node similarity. In the simplest sense, two nodes are similar if they share many of the same neighbors, how does this intuition hold up with the network we have constructed? Well, let’s take an arbitrary user node \(u\) and movie node \(m\): any neighbor of a user node is a movie node and any neighbor of a movie node is a user node, so their similarity must necessarily be 0. This make sense, these are two different entities that we are modeling with the nodes: users and movies. We can, however, have non-zero similarity between nodes of the same type: users \(u_1\) and \(u_2\) might rate highly many of the same movies making them similar. So it is natural to assume that cosine similarity could be equivalently reformulated in the network context. As before, we can use the utility matrix, \(B\) (usually called an incidence matrix in the context of networks) to calculate the cosine similarity between two nodes.
$$ \begin{align*} s_{m,m’} = \frac{\sum\limits_{u \in \Gamma\left(m\right) \cap \Gamma\left(m’\right)} B_{u,m}B_{u,m’}}{\sqrt{\sum\limits_{u \in \Gamma\left(m\right)} B_{u,m}^2}\sqrt{\sum\limits_{u \in \Gamma\left(m’\right)} B_{u,m’}^2}} \tag{$2$} \end{align*} $$
Where \(\Gamma\left(m\right)\) is the set of neighbors of movie \(m\). This expression is in every way equivalent to previously discussed formulation (in the linked Wikipedia page), it is still fundamentally a dot product. But by writing it in terms of sums of neighbors, we have more explicitly illustrated the network structure of the data.
A quick and dirty implementation of the above formula can be found in the github repo. Just as a good sanity check let’s see if the similarity calculation produces sensible results, here are the top \(10\) similar movies to The Lion King (movieId: 364):
rank |
movieId | title | similarity_score |
---|---|---|---|
1 | 588 | Aladdin (1992) | 0.356733 |
2 | 595 | Beauty and the Beast (1991) | 0.293266 |
3 | 4995 | Beautiful Mind, A (2001) | 0.266820 |
4 | 356 | Forrest Gump (1994) | 0.256900 |
5 | 4306 | Shrek (2001) | 0.246165 |
6 | 2571 | Matrix, The (1999) | 0.241855 |
7 | 6377 | Finding Nemo (2003) | 0.236003 |
8 | 1 | Toy Story (1995) | 0.235402 |
9 | 457 | Fugitive, The (1993) | 0.219770 |
10 | 480 | Jurassic Park (1993) | 0.214132 |
On the left we have a table of the top 10 most similar movies to The Lion King and on the right another hairball visualization with the node corresponding to The Lion King shown in red. From the table, you see that many of the movies most similar to The Lion King are other childrens movies. However, we also see some more adult movies in the mix that do not necessarily align with our human intuition for what is similar. Why do you think this is? Do you think this is just a poor metric for similarity or are we instead seeing some of the underlying social dynamics behind movie watching? Do children normally have their own accounts or do they generally use the accounts of the adults in the family? These types of questions demonstrate some of the complexity in real world data, and highlight the challenge of building good recommender systems.
Network Based Recommendations
The last step is to make the actual recommendations. This can be done in exactly the same way using the Equation (1), and if we apply it to the user with ID \(1\) we get the following top-5 recommendations with predicted rating:
Conclusions
Traditional collaborative filtering techniques have been honed over years and are supported by highly optimized libraries that are often best-suited for straightforward recommendation tasks. If your goal is to do traditional collaborative filtering, then you should just do traditional collaborative filtering. However, by viewing our data through the lens of networks, we gain access to a broader array of methodologies that can yield deeper insights and potentially unveil complex patterns that might remain hidden otherwise.
While reframing collaborative filtering as a network link prediction problem may not necessarily offer superior performance over traditional methods, it serves as a compelling motivation for incorporating network-based approaches into machine learning. The exploration of this perspective is a simple yet illustrative example of how considering data as a network can enrich our analytical toolkit. For a non-trivial application of link prediction that also ties into recommendation systems, take a look at our paper on link prediction and drug repurposing here. Where we discuss and compare various link prediction methods, evaluating their performance on the very important problem of drug repurposing.