In the article Using Scikit-Surprise to Create a Simple Recipe Collaborative Filtering Recommender System we developed the simplest recommender system using the scikit-surprise package and saw how to use the built-in algorithms it contains, such as KNN or SVD.
I’d like to take my recommender systems practice a step further and attempt to create my own prediction algorithm. Surprise allows you to override its core classes and methods in order to tailor your own algorithm and try to improve the recommender system’s outcomes, or at the very least get it closer to what you want from your own recommender system. It’s important to remember that recommender systems aren’t only about accuracy; they’re also about knowing the recommendations you want to make to your clients, which can differ from one company to the next.
The only good metrics for recommender systems are user tests to see how they react to your recommendations, so in this post, I’ll focus on building my own recommender system to make recommendations of recipes that are similar in content to the ones the users have rated previously (a Content-Based recommender system).
We’ll utilize the content of the recipe collection to determine the degree of similarity. We may assess similarity in a variety of ways, but I’d like to use some NLP methods here, so we’ll base our algorithm on the similarity of the recipe text, which includes the title, steps, and description.
The first step is to use WordNet to tokenize and lemmatize the words in the recipes, and then we’ll use TfidfVectorizer to generate a vector from the lemmatized vocabulary and calculate the recipes cosine similarity. Finally, we’ll tweak our Surprise algorithm to find the most similar recipes to a given one and provide recommendations based on them.
The first two sections (data loading and preparation) are identical to those described in our prior post. The creation of the model creation section has new content.
The first step is importing the necessary libraries.
import pandas as pd
import difflib
import numpy as np
import pickle
Loading Data¶
So let's load again the two datasets we need.
recipe_data = pd.read_csv('/work/RAW_recipes.csv',header=0,sep=",")
recipe_data.head()
user_data = pd.read_csv('/work/PP_users.csv',header=0,sep=",")
user_data.head()
Okay, we can see the data on each file. The column names are self-explanatory, so we can get started.
Data Preparation and exploration.¶
We must first prepare the data in a dataset that is compatible with Surprise. The surprise algorithm will utilize this dataset to read the items, users, and recipe ratings. The ratings are required for the dataset, but we will not utilize them; I'll explain why later in this post.
The first step is to write a function that reads the items (recipes) and user ratings.
def getRecipeRatings(idx):
user_items = [int(s) for s in user_data.loc[idx]['items'].replace('[','').replace(']','').replace(',','').split()]
user_ratings = [float(s) for s in user_data.loc[idx]['ratings'].replace('[','').replace(']','').replace(',','').split()]
df = pd.DataFrame(list(zip(user_items,user_ratings)),columns = ['Item','Rating'])
df.insert(loc=0,column='User',value = user_data.loc[idx].u)
return df
We'll make a dataset with one row for each User, Item, and Rating in this step. We only run this piece of code once, thus the code is commented. Pickle is used to read the saved dataset for other runs.
#recipe_ratings = pd.DataFrame(columns = ['User','Item','Rating'])
#for idx,row in user_data.iterrows():
# recipe_ratings = recipe_ratings.append(getRecipeRatings(row['u']),ignore_index=True)
Pickle saves the dataset to disk (first run), then reads it for subsequent runs.
#recipe_ratings.to_pickle('/work/recipe_ratings.pkl')
recipe_ratings = pd.read_pickle('/work/recipe_ratings.pkl')
Let's check the rating distribution.
import seaborn as sns
sns.barplot(x=recipe_ratings.Rating.value_counts().index, y=recipe_ratings.Rating.value_counts())
Good, we see that the majority of the ratings are 5.0, indicating that there are a lot of satisfied users with the recipes.
Only the recipes with more than 30 ratings are selected to reduce the dataset size and save time.
recipe_counts = recipe_ratings.groupby(['Item']).size()
filtered_recipes = recipe_counts[recipe_counts>30]
filtered_recipes_list = filtered_recipes.index.tolist()
filtered_recipes_list = filtered_recipes.index.tolist()
len(filtered_recipes_list)
recipe_ratings = recipe_ratings[recipe_ratings['Item'].isin(filtered_recipes_list)]
recipe_ratings.count()
The ratings distribution in the filtered dataset is similar to the distribution in the entire dataset.
sns.barplot(x=recipe_ratings.Rating.value_counts().index, y=recipe_ratings.Rating.value_counts())
Identifying the Similarity Between Recipes¶
Let's create our custom model with scikit-surprise. The first step is creating a dataset with the filtered recipes.
recipe_filtered = recipe_data.loc[filtered_recipes_list]
recipe_filtered
len(recipe_filtered)
Let's import the nltk libraries and download the necessary packages.
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('words')
nltk.download('omw-1.4')
Let's import the corpora from nltk, which is a collection of known words, so we can filter out non-English and strange words from our recipe text afterwards.
words = set(nltk.corpus.words.words())
Let's get started by creating the necessary functions to retrieve the terms we're looking for. The following activities are performed by these functions:
Tokenizing the sentences is the first stage; we extract the words from the dataset using RegexpTokenizer, and then we eliminate the stopwords (words that are very common but not very important in the text such as conjunctions or prepositions)
The second stage is to lemmatize the sentence, which involves reducing a word's forms to its base word and keeping only verbs and nouns in this case (using nltk pos tagger).
lemmatizer = WordNetLemmatizer()
def nltk_pos_tagger(nltk_tag):
if nltk_tag.startswith('V'):
return wordnet.VERB
elif nltk_tag.startswith('N'):
return wordnet.NOUN
else:
return None
def tokenize_sentence(sentence):
tokenizer = nltk.RegexpTokenizer(r"[^\d\W]+")
tokenized = tokenizer.tokenize(sentence)
stopwords = nltk.corpus.stopwords.words('english')
finalsentence = [word for word in tokenized if word not in stopwords]
return(finalsentence)
def lemmatize_sentence(sentence):
nltk_tagged = nltk.pos_tag(sentence)
wordnet_tagged = map(lambda x: (x[0], nltk_pos_tagger(x[1])), nltk_tagged)
lemmatized_sentence = []
for word, tag in wordnet_tagged:
if not (tag is None):
lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
return (lemmatized_sentence)
def tokenize_lemmatize(sentence):
tokenized = tokenize_sentence(sentence)
lemmatized = lemmatize_sentence(tokenized)
selectedwords = [word for word in lemmatized if word in words]
final = list(dict.fromkeys(selectedwords))
return(final)
Let's use the previous functions to construct a new column containing the lemmatized words from the recipe name, steps, and ingredients.
recipe_filtered['recsys'] = recipe_filtered.apply(lambda row: tokenize_lemmatize(row['name']+row['steps']+row['ingredients']),axis=1)
The next step is to calculate a similarity score between the recipes. To do so, we must vectorize the words we obtained in the previous steps, which means assigning a numeric value to each dish based on the words it contains. Using the TfidfVectorizer package, we can accomplish this. We get a matrix with one vector for each recipe (the matrix contains one row per recipe).
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
recipe_filtered['recsys'] = recipe_filtered['recsys'].fillna('')
tfidf_matrix = tfidf.fit_transform(recipe_filtered['recsys'].astype(str))
tfidf_matrix.shape
We can check the words from the recipes that we've vectorized.
tfidf.get_feature_names_out ()[1:100]
The linear_kernel function, which is comparable to the cosine similarity in this circumstance, can now be used to calculate the similarity between the recipes. The distance between two vectors is defined as the cosine of the angle between them. Visit the sklearn metrics page for further information.
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel
recipe_cs = linear_kernel(tfidf_matrix, tfidf_matrix)
We can now see which recipes are the most similar to a particular one by sorting one of the rows of our matrix.
idx = (-recipe_cs[4]).argsort()[:10]
idx
As can be seen below, the recipes obtained appear to be similar. There's a lot of cheesecake!
recipe_filtered.iloc[idx]
Customizing our Surprise Algorithm.¶
Let's make some changes to the surprise base algorithm. The first step is to load the required libraries.
from surprise import Dataset
from surprise import Reader
from surprise import PredictionImpossible
from surprise import AlgoBase
To overwrite the base algorithm, we must first create a class that inherits from AlgoBase. The init, fit, and estimate methods must then be rewritten.
The init method simply calls the base init method.
The fit method handles the calculation of similarities. We must first invoke the base fit method before calculating the similiarity matrix. We simply assign the precalculated similarity matrix to a class object, as we have already done. (In this method, we could also call the code to calculate the similarities.)
Finally, the estimate method must return an estimated rating for a given user-item(recipe) pair. To accomplish this, we compute the total similarity of the given item to the other items rated by the user.
Note: The proper way to do this would be to calculate an average of the similarities weighted by the user ratings, but the problem with our dataset is that the majority of the ratings are 4 or 5, which results in very high estimated ratings (a lot of estimated ratings of 5), which prevents us from creating a properly ordered list of recommended recipes, so our estimated rating will be just the total similarity of the given recipe with the recipes the user has rated. We will receive extremely low estimated ratings, but the rating itself is unimportant; we simply need a "ranking" of the best recipes for this user so that it will work here.
class recipeAlgo(AlgoBase):
def __init__(self):
# Always call base method before doing anything.
AlgoBase.__init__(self)
def fit(self, trainset):
# Here again: call base method before doing anything.
AlgoBase.fit(self, trainset)
self.similarities = recipe_cs
return self
def estimate(self, u, i):
if not (self.trainset.knows_user(u) and self.trainset.knows_item(i)):
raise PredictionImpossible('User and/or item is unkown.')
sim_recipes = []
#We have to search of the similarity between the input item (i)
# and the recipes the user (u) rated
item_idx=recipe_filtered.index.get_loc(self.trainset.to_raw_iid(i))
for rating in self.trainset.ur[u]:
rating_idx = recipe_filtered.index.get_loc(self.trainset.to_raw_iid(rating[0]))
recipeSimilarity = self.similarities[item_idx,rating_idx]
sim_recipes.append((recipeSimilarity, rating[1]))
highest_sims = sorted(sim_recipes,key=lambda x: x[0])
totalSimilarity=0
ratingWeighted=0
#Now we use the similarities to predict a rating
for(similarity,rating) in highest_sims[:10]:
totalSimilarity += similarity
return totalSimilarity
Ok let's fit our algorithm, all we need to do is create our data object, create a suprise trainSet from it and fit the algorithm
reader = Reader(rating_scale=(0, 5))
data = Dataset.load_from_df(recipe_ratings[['User', 'Item', 'Rating']], reader)
trainSet = data.build_full_trainset()
algo = recipeAlgo()
algo.fit(trainSet)
We have successfully fitted our algorithm. Let's now create a test set with only one user to see how our recommender system performs. Our test set will include all of the recipes that the user hasn't rated yet, allowing our recommender system to assign a predicted rating to each of them and sort the best recipes for this user.
anti_testset_user = []
targetUser = 0 #inner_id of the target user
fillValue = trainSet.global_mean
user_item_ratings = trainSet.ur[targetUser]
user_items = [item for (item,_) in (user_item_ratings)]
user_items
ratings = trainSet.all_ratings()
for iid in trainSet.all_items():
if(iid not in user_items):
anti_testset_user.append((trainSet.to_raw_uid(targetUser),trainSet.to_raw_iid(iid),fillValue))
len(anti_testset_user)
Let's call the test method of our recommender system.
predictions = algo.test(anti_testset_user)
As you can see below, the estimated ratings are extremely low; our algorithm generates the estimated rating by adding the similarities between a given recipe rated by the user and the most similar recipes rated by other users. However, we have a good classification of the best recipes for our test user based on content similarity.
pred = pd.DataFrame(predictions)
pred.sort_values(by=['est'],inplace=True,ascending = False)
pred
pred = pd.DataFrame(predictions)
pred.sort_values(by=['est'],inplace=True,ascending = False)
recipe_list = pred.head(10)['iid'].to_list()
recipe_data.loc[recipe_list]
Above this text are the suggested recipes, and below are the actual recipes rated by our user. Only by looking at the title can we see that there are similarities in the recipes; the user appears to enjoy braised meat, garlic, and chicken, and these ingredients are included in the recommended recipes.
recipe_data.loc[recipe_ratings[recipe_ratings['User']==0]['Item']]
Conclusion
We’ve seen how to use the surprise package to create a Recommender System. We concentrated on the data content and used NLP techniques to determine the degree of similarity between our documents (recipes).
We learned how to easily customize the base surprise algorithm to help us get recommendations for users and build a recommender system.
As we’ve seen, all we need to do is return an estimated rating for the items that the user hasn’t rated; this can be accomplished in a variety of ways; we’ve chosen the similarity between items to calculate it. The rating does not have to be a «real» rating; we simply need a list sorted by rating in order to generate recommendations that meet our criteria.
The data for this project was obtained via Kaggle. Please see the following cita:
Generating Personalized Recipes from Historical User Preferences
Bodhisattwa Prasad Majumder, Shuyang Li, Jianmo Ni, Julian McAuley
EMNLP, 2019