Companies all over the world are increasingly utilizing recommender systems. These algorithms can be used by online stores, streaming services, or social networks to recommend items to users based on their previous behavior (either consumed items or searched items).
There are several approaches to developing recommendation systems. We can build a recommender system based on the content of the item so that the system recommends similar items to the ones the user usually likes (Content-Based recommender systems), or we can use user similarity to recommend items that other users have rated highly (Collaborative-filtering recommender systems).
In this post we will create a really simple recommender system using the package Surprise, we will use the standard surprise functions to create a Collaborative Filtering recommender system based on user ratings. The dataset I chose for this exercise is the Recipes from Food.com dataset, which is available on Kaggle and contains over 180K recipes and 700K recipe reviews. It’s a massive dataset that’s ideal for experimenting with recommender systems.
The dataset consists of several files containing the raw data and the processed data which is great for our purpose (thanks to the authors of the paper, you can find the citation at the end of the post).
I also want to experiment with different evaluation metrics for the recommender system. Evaluating a recommender system is difficult because user behavior changes over time, but we must stick to the metrics we have. I’ll experiment with MAE (Mean Absolute Error) and RMSE (Root Mean Squared Error). These metrics account for the mean of the errors (differences among the predicted and the actual ratings). Large errors are given more weight by RMSE, so if we have large errors in our predictions, the RMSE will be higher.
This notebook has been created in DeepNote environment and following the tutorials in the scikit-surprise documentation.
Let’s start!
The first step is importing the necessary libraries.
import pandas as pd
import difflib
import numpy as np
import pickle
Loading Data¶
The dataset is divided into four files: recipes, user ratings, and interactions. Two of them have RAW data, while the others have processed data. We will use processed ratings data and raw recipe data for this recommender system. It simply works best for our needs.
Let's load and check the data
recipe_data = pd.read_csv('/work/RAW_recipes.csv',header=0,sep=",")
recipe_data.head()
user_data = pd.read_csv('/work/PP_users.csv',header=0,sep=",")
user_data.head()
Okay, we can see the data on each file. The column names are self-explanatory, so we can get started.
Data Preparation and Exploration.¶
To build this simple recommender system, we must first prepare the data in a Surprise-compatible dataset. We're only interested in user ratings, so we'll pull them from the recipe ratings dataset.
The first step is to write a function that reads the items (recipes) and user ratings.
def getRecipeRatings(idx):
user_items = [int(s) for s in user_data.loc[idx]['items'].replace('[','').replace(']','').replace(',','').split()]
user_ratings = [float(s) for s in user_data.loc[idx]['ratings'].replace('[','').replace(']','').replace(',','').split()]
df = pd.DataFrame(list(zip(user_items,user_ratings)),columns = ['Item','Rating'])
df.insert(loc=0,column='User',value = user_data.loc[idx].u)
return df
Then, create a dataset with one row for each User, Item, and Rating.
#recipe_ratings = pd.DataFrame(columns = ['User','Item','Rating'])
#for idx,row in user_data.iterrows():
# recipe_ratings = recipe_ratings.append(getRecipeRatings(row['u']),ignore_index=True)
Because the dataset is large and the previous code takes time to execute, we only want to create it once so that we can use pickle to save it to disk and read it back whenever we need to. This saves us a significant amount of time.
#recipe_ratings.to_pickle('/work/recipe_ratings.pkl')
recipe_ratings = pd.read_pickle('/work/recipe_ratings.pkl')
It's a good idea to do some data exploration, so let's get started. We know this is high-quality data, so we'll just make a bar chart to see how the ratings are distributed.
import seaborn as sns
sns.barplot(x=recipe_ratings.Rating.value_counts().index, y=recipe_ratings.Rating.value_counts())
Good, we see that the majority of the ratings are 5.0, indicating that there are a lot of satisfied users with the recipes.
Because the dataset is large, we will reduce it to save time and avoid running out of memory. Let's only look at the recipes with the most ratings. Let's get rid of the recipes with fewer than 10 ratings.
recipe_counts = recipe_ratings.groupby(['Item']).size()
filtered_recipes = recipe_counts[recipe_counts>30]
filtered_recipes_list = filtered_recipes.index.tolist()
filtered_recipes_list = filtered_recipes.index.tolist()
len(filtered_recipes_list)
recipe_ratings = recipe_ratings[recipe_ratings['Item'].isin(filtered_recipes_list)]
recipe_ratings.count()
Let's take a look at the new rating distribution. As we can see, it is similar to the distribution of the entire dataset.
sns.barplot(x=recipe_ratings.Rating.value_counts().index, y=recipe_ratings.Rating.value_counts())
Okay, we now have a dataset with over 300,000 ratings and approximately 11000 recipes. Enough for our purposes and manageable via Google Colab. Let's get to work on the model!
Model Creation¶
In Google Colab, we have to install the Surprise package in order to start working with it.
The package surprise includes a number of prediction algorithms that will assist us in developing the recommendation system and selecting a number of recipes that a given user might enjoy. We have the option of using basic collaborative filtering algorithms (KNN) or Matrix Factorization algorithms such as SVD or SVDpp.
KNN-based algorithms choose user or item neighbors based on similarity (taking into account the mean or z-score normalization of each item or user rating). We can specify whether we want to run the user-based or item-based algorithm using the user_based parameter.
Matrix Factorization algorithms translate the user-item matrix into a lower-dimensional space and predict ratings from there.
More information on the definition and behavior of the algorithms can be found on the surprise documentation site.
We'll run some of them through cross-validation to compare the metrics (RMSE) and (MAE) and see how they work with this dataset.
As a baseline, let's run the most basic algorithm (NormalPredictor), which makes random predictions, and then see how the other algorithms improve the evaluation metrics.
from surprise import NormalPredictor
from surprise import Dataset
from surprise import Reader
from surprise import SVD
from surprise import SVDpp
from surprise import KNNBasic
from surprise.model_selection import cross_validate
reader = Reader(rating_scale=(0, 5))
data = Dataset.load_from_df(recipe_ratings[['User', 'Item', 'Rating']], reader)
trainSet = data.build_full_trainset()
algo = NormalPredictor()
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
Let's see the predictions this algorithm yields for a given user. We need to fit the algorithm with the whole trainset, then make predictions with a test set that contains the user-item pairs that do not exist in the training set. This testSet can be easily built with the function build_anti_testset(), but in this case in order to save resources and time we are going to build a testset for just one user. We need to iterate over all the ratings in the trainSet and select the items that the user has not rated. We also need to fill a rating value for those (user,item) pairs, so we are going to use the trainSet global mean (which is the default value used by surprise).
anti_testset_user = []
targetUser = 0 #inner_id of the target user
fillValue = trainSet.global_mean
user_item_ratings = trainSet.ur[targetUser]
user_items = [item for (item,_) in (user_item_ratings)]
user_items
ratings = trainSet.all_ratings()
for iid in trainSet.all_items():
if(iid not in user_items):
anti_testset_user.append((trainSet.to_raw_uid(targetUser),trainSet.to_raw_iid(iid),fillValue))
predictions = algo.test(anti_testset_user)
predictions[0]
Let's see the 10 recipes with better estimated rating for this user. I like to convert the predictions object into a DataFrame so that I can work better with it.
pred = pd.DataFrame(predictions)
pred.sort_values(by=['est'],inplace=True,ascending = False)
recipe_list = pred.head(10)['iid'].to_list()
recipe_data.loc[recipe_list]
OK, with that baseline let's check if other algorithms can improve the metrics. Let's try with neighbourhoud based algorithm (KNNBasic) computing similarities between items.
sim_options = {'name': 'cosine',
'user_based': False # compute similarities between items
}
algo = KNNBasic(sim_options=sim_options)
# Run 5-fold cross-validation and print results
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
This algorithm clearly outperformed our baseline. As can be seen, the MAE and RMSE means are better (lower) than those of the NormalPredictors. Let's see which recipes do this algorithm recommends.
predictions = algo.test(anti_testset_user)
pred = pd.DataFrame(predictions)
pred.sort_values(by=['est'],inplace=True,ascending = False)
recipe_list = pred.head(10)['iid'].to_list()
recipe_data.loc[recipe_list]
Let's take a look at a Matrix Factorization algorithm now.
algo = SVD()
# Run 5-fold cross-validation and print results
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
We appear to have improved the KNNBasic algorithm slightly. The mean of the MAE is similar, but we improved the RMSE, resulting in smaller errors in our rating predictions.
predictions = algo.test(anti_testset_user)
pred = pd.DataFrame(predictions)
pred.sort_values(by=['est'],inplace=True,ascending = False)
recipe_list = pred.head(10)['iid'].to_list()
recipe_data.loc[recipe_list]
Parameter tuning with GridSearchCV¶
Scikit-surprise also allows us to tune the algorithms through GridSearchCV, which allows to execute the algorithm repeteadly using a predefined list of parameters values and returning the best set of parameters given the defined error metrics.
from surprise.model_selection import GridSearchCV
param_grid = {'n_factors': [100,150],
'n_epochs': [20,25,30],
'lr_all':[0.005,0.01,0.1],
'reg_all':[0.02,0.05,0.1]}
grid_search = GridSearchCV(SVD, param_grid, measures=['rmse','mae'], cv=3)
grid_search.fit(data)
Let's see the scores for the best parameters found.
print(grid_search.best_score['rmse'])
print(grid_search.best_score['mae'])
Because the model takes time to run, it's a good idea to save it to disk so we can reuse it and save time.
# save the model to disk
pickle.dump(grid_search, open('/work/surprise_grid_search_svd.sav', 'wb'))
#Load the model from disk
grid_search = pickle.load(open('/work/surprise_grid_search_svd.sav', 'rb'))
Let's take a look at the best parameters found by GridSearchCV.
print(grid_search.best_params['rmse'])
We can now repeat the cross validation with the best parameters and compare the results.
algo = grid_search.best_estimator['rmse']
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
By tuning the method parameters, we were able to slightly outperform the previous SVD technique.
Conclusion
We have seen how to make a simple Recommender System model with scikit Surprise.
To build a recommender system the only needed data are a list of items and a list of ratings the user gave to these items.
With Scikit-Surprise, we learned how to create a simple Recommender System model.
The only data required to create a recommender system is a list of items and a list of user ratings for these items. For this, we have downloaded an appropriate dataset.
We learned how to prepare data and generate a dataset suitable for scikit Surprise in order to calculate user or item similarity, estimate user ratings for objects, and build recommendations from there.
We’ve also experimented with several algorithms to see what metrics they provide and how to fine-tune the algorithms’ settings to improve our metrics.
More information on how to customize surprise algorithms to create more reliable recommender systems can be found in our next post.
The data for this project was obtained via Kaggle. Please see the following cita:
Generating Personalized Recipes from Historical User Preferences
Bodhisattwa Prasad Majumder, Shuyang Li, Jianmo Ni, Julian McAuley
EMNLP, 2019