Using NLP to Create a Recommender System

Publicado por

In the article Using Scikit-Surprise to Create a Simple Recipe Collaborative Filtering Recommender System we developed the simplest recommender system using the scikit-surprise package and saw how to use the built-in algorithms it contains, such as KNN or SVD.

I’d like to take my recommender systems practice a step further and attempt to create my own prediction algorithm. Surprise allows you to override its core classes and methods in order to tailor your own algorithm and try to improve the recommender system’s outcomes, or at the very least get it closer to what you want from your own recommender system. It’s important to remember that recommender systems aren’t only about accuracy; they’re also about knowing the recommendations you want to make to your clients, which can differ from one company to the next.

The only good metrics for recommender systems are user tests to see how they react to your recommendations, so in this post, I’ll focus on building my own recommender system to make recommendations of recipes that are similar in content to the ones the users have rated previously (a Content-Based recommender system).

We’ll utilize the content of the recipe collection to determine the degree of similarity. We may assess similarity in a variety of ways, but I’d like to use some NLP methods here, so we’ll base our algorithm on the similarity of the recipe text, which includes the title, steps, and description.

The first step is to use WordNet to tokenize and lemmatize the words in the recipes, and then we’ll use TfidfVectorizer to generate a vector from the lemmatized vocabulary and calculate the recipes cosine similarity. Finally, we’ll tweak our Surprise algorithm to find the most similar recipes to a given one and provide recommendations based on them.

The first two sections (data loading and preparation) are identical to those described in our prior post. The creation of the model creation section has new content.

Recommendation_Systems_Surprise_2

The first step is importing the necessary libraries.

In [ ]:
import pandas as pd
import difflib
import numpy as np
import pickle

Loading Data

So let's load again the two datasets we need.

In [ ]:
recipe_data = pd.read_csv('/work/RAW_recipes.csv',header=0,sep=",")
recipe_data.head()
Out[ ]:
name id minutes contributor_id submitted tags nutrition n_steps steps description ingredients n_ingredients
0 arriba baked winter squash mexican style 137739 55 47892 2005-09-16 ['60-minutes-or-less', 'time-to-make', 'course... [51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0] 11 ['make a choice and proceed with recipe', 'dep... autumn is my favorite time of year to cook! th... ['winter squash', 'mexican seasoning', 'mixed ... 7
1 a bit different breakfast pizza 31490 30 26278 2002-06-17 ['30-minutes-or-less', 'time-to-make', 'course... [173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0] 9 ['preheat oven to 425 degrees f', 'press dough... this recipe calls for the crust to be prebaked... ['prepared pizza crust', 'sausage patty', 'egg... 6
2 all in the kitchen chili 112140 130 196586 2005-02-25 ['time-to-make', 'course', 'preparation', 'mai... [269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0] 6 ['brown ground beef in large pot', 'add choppe... this modified version of 'mom's' chili was a h... ['ground beef', 'yellow onions', 'diced tomato... 13
3 alouette potatoes 59389 45 68585 2003-04-14 ['60-minutes-or-less', 'time-to-make', 'course... [368.1, 17.0, 10.0, 2.0, 14.0, 8.0, 20.0] 11 ['place potatoes in a large pot of lightly sal... this is a super easy, great tasting, make ahea... ['spreadable cheese with garlic and herbs', 'n... 11
4 amish tomato ketchup for canning 44061 190 41706 2002-10-25 ['weeknight', 'time-to-make', 'course', 'main-... [352.9, 1.0, 337.0, 23.0, 3.0, 0.0, 28.0] 5 ['mix all ingredients& boil for 2 1 / 2 hours ... my dh's amish mother raised him on this recipe... ['tomato juice', 'apple cider vinegar', 'sugar... 8
In [ ]:
user_data = pd.read_csv('/work/PP_users.csv',header=0,sep=",")
user_data.head()
Out[ ]:
u techniques items n_items ratings n_ratings
0 0 [8, 0, 0, 5, 6, 0, 0, 1, 0, 9, 1, 0, 0, 0, 1, ... [1118, 27680, 32541, 137353, 16428, 28815, 658... 31 [5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, ... 31
1 1 [11, 0, 0, 2, 12, 0, 0, 0, 0, 14, 5, 0, 0, 0, ... [122140, 77036, 156817, 76957, 68818, 155600, ... 39 [5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, ... 39
2 2 [13, 0, 0, 7, 5, 0, 1, 2, 1, 11, 0, 1, 0, 0, 1... [168054, 87218, 35731, 1, 20475, 9039, 124834,... 27 [3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, ... 27
3 3 [498, 13, 4, 218, 376, 3, 2, 33, 16, 591, 10, ... [163193, 156352, 102888, 19914, 169438, 55772,... 1513 [5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 5.0, 5.0, 5.0, ... 1513
4 4 [161, 1, 1, 86, 93, 0, 0, 11, 2, 141, 0, 16, 0... [72857, 38652, 160427, 55772, 119999, 141777, ... 376 [5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 5.0, 4.0, 5.0, ... 376

Okay, we can see the data on each file. The column names are self-explanatory, so we can get started.

Data Preparation and exploration.

We must first prepare the data in a dataset that is compatible with Surprise. The surprise algorithm will utilize this dataset to read the items, users, and recipe ratings. The ratings are required for the dataset, but we will not utilize them; I'll explain why later in this post.

The first step is to write a function that reads the items (recipes) and user ratings.

In [ ]:
def getRecipeRatings(idx):
  user_items = [int(s) for s in user_data.loc[idx]['items'].replace('[','').replace(']','').replace(',','').split()]
  user_ratings = [float(s) for s in user_data.loc[idx]['ratings'].replace('[','').replace(']','').replace(',','').split()]
  df = pd.DataFrame(list(zip(user_items,user_ratings)),columns = ['Item','Rating'])
  df.insert(loc=0,column='User',value = user_data.loc[idx].u)
  return df

We'll make a dataset with one row for each User, Item, and Rating in this step. We only run this piece of code once, thus the code is commented. Pickle is used to read the saved dataset for other runs.

In [ ]:
#recipe_ratings = pd.DataFrame(columns = ['User','Item','Rating'])
#for idx,row in user_data.iterrows():
#  recipe_ratings = recipe_ratings.append(getRecipeRatings(row['u']),ignore_index=True)

Pickle saves the dataset to disk (first run), then reads it for subsequent runs.

In [ ]:
#recipe_ratings.to_pickle('/work/recipe_ratings.pkl')
recipe_ratings = pd.read_pickle('/work/recipe_ratings.pkl')

Let's check the rating distribution.

In [ ]:
import seaborn as sns
sns.barplot(x=recipe_ratings.Rating.value_counts().index, y=recipe_ratings.Rating.value_counts())
Out[ ]:

Good, we see that the majority of the ratings are 5.0, indicating that there are a lot of satisfied users with the recipes.

Only the recipes with more than 30 ratings are selected to reduce the dataset size and save time.

In [ ]:
recipe_counts = recipe_ratings.groupby(['Item']).size()
filtered_recipes = recipe_counts[recipe_counts>30]
filtered_recipes_list = filtered_recipes.index.tolist()
filtered_recipes_list = filtered_recipes.index.tolist()
len(filtered_recipes_list)
Out[ ]:
2349
In [ ]:
recipe_ratings = recipe_ratings[recipe_ratings['Item'].isin(filtered_recipes_list)]
In [ ]:
recipe_ratings.count()
Out[ ]:
User      174359
Item      174359
Rating    174359
dtype: int64

The ratings distribution in the filtered dataset is similar to the distribution in the entire dataset.

In [ ]:
sns.barplot(x=recipe_ratings.Rating.value_counts().index, y=recipe_ratings.Rating.value_counts())
Out[ ]:

Identifying the Similarity Between Recipes

Let's create our custom model with scikit-surprise. The first step is creating a dataset with the filtered recipes.

In [ ]:
recipe_filtered = recipe_data.loc[filtered_recipes_list]
In [ ]:
recipe_filtered
Out[ ]:
name id minutes contributor_id submitted tags nutrition n_steps steps description ingredients n_ingredients
66 my muffuletta sandwich 78655 20 12875 2003-12-12 ['30-minutes-or-less', 'time-to-make', 'course... [181.1, 26.0, 6.0, 17.0, 2.0, 11.0, 2.0] 3 ['mix everything in food processor', 'chop fin... watched a documentary about the ['ciabatta', 'provolone cheese', 'genoa salami... 17
156 better than cinnabon cinnamon rolls 149887 85 245081 2006-01-01 ['time-to-make', 'course', 'preparation', 'occ... [365.4, 23.0, 105.0, 10.0, 10.0, 46.0, 17.0] 18 ['combine warm water , sugar and yeast', 'set ... i snagged this off of a website and made my ow... ['active dry yeast', 'warm water', 'granulated... 13
193 flipped roast turkey 268169 195 120121 2007-11-27 ['time-to-make', 'course', 'main-ingredient', ... [498.9, 40.0, 3.0, 8.0, 119.0, 41.0, 0.0] 15 ['preheat oven to 450f', 'remove neck and gibl... of all the recipes i've posted on 'zaar, one o... ['whole turkey', 'olive oil', 'butter', 'onion... 7
267 sangria fruit cups non alcoholic 232044 260 327600 2007-06-03 ['course', 'main-ingredient', 'preparation', '... [122.1, 0.0, 104.0, 4.0, 4.0, 0.0, 9.0] 8 ['bring orange juice to a boil', 'add to jelly... a wonderful light dessert recipe from the peop... ['orange juice', 'strawberry gelatin', 'peach ... 9
279 splenda d cheesecake sugar free low carb 185799 80 340556 2006-09-13 ['time-to-make', 'course', 'main-ingredient', ... [566.5, 80.0, 16.0, 14.0, 21.0, 150.0, 4.0] 27 ['grahm cracker crumb crust:', 'mix together: ... this is my own recipe for a yummy, creamy, thi... ['graham cracker crumbs', 'splenda granular', ... 13
... ... ... ... ... ... ... ... ... ... ... ... ...
177548 rotini pasta with broccoli cream sauce 89894 40 24386 2004-04-24 ['60-minutes-or-less', 'time-to-make', 'course... [913.3, 58.0, 36.0, 16.0, 56.0, 110.0, 41.0] 10 ['cook broccoli in boiling salted water until ... a sure to please pasta dish made with a garlic... ['fresh broccoli', 'rotini pasta', 'garlic', '... 7
177592 rotkrautsalat red cabbage salad 54471 40 54716 2003-02-21 ['bacon', '60-minutes-or-less', 'time-to-make'... [245.9, 30.0, 21.0, 23.0, 10.0, 25.0, 3.0] 8 ['fry bacon in medium-size fry pan until crisp... NaN ['bacon', 'vegetable oil', 'sugar', 'salt', 'v... 9
177884 rum raisin muffins 373886 45 65720 2009-05-23 ['60-minutes-or-less', 'time-to-make', 'course... [233.8, 12.0, 75.0, 5.0, 6.0, 23.0, 12.0] 17 ['soak raisins and currants in rum to cover ov... rum-soaked raisins and currants dot these glaz... ['golden raisin', 'dried currant', 'dark rum',... 15
178034 russian stroganoff with bacon 51732 105 67992 2003-01-16 ['weeknight', 'time-to-make', 'course', 'main-... [528.7, 60.0, 12.0, 32.0, 79.0, 85.0, 1.0] 7 ['mix flour , salt and pepper and place the ro... this recipe came from betty feezor's show on w... ['round steaks', 'salt and pepper', 'flour', '... 8
178044 russian tea non tea 102396 25 116469 2004-10-20 ['30-minutes-or-less', 'time-to-make', 'course... [187.6, 0.0, 179.0, 0.0, 3.0, 0.0, 16.0] 8 ['in a large saucepan , bring 2 cups of water ... although this doesn't have tea in it, it is a ... ['water', 'ground cinnamon', 'ground ginger', ... 9

2349 rows × 12 columns

In [ ]:
len(recipe_filtered)
Out[ ]:
2349

Let's import the nltk libraries and download the necessary packages.

In [ ]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('words')
nltk.download('omw-1.4')
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Unzipping corpora/omw-1.4.zip.
Out[ ]:
True

Let's import the corpora from nltk, which is a collection of known words, so we can filter out non-English and strange words from our recipe text afterwards.

In [ ]:
words = set(nltk.corpus.words.words())

Let's get started by creating the necessary functions to retrieve the terms we're looking for. The following activities are performed by these functions:

  • Tokenizing the sentences is the first stage; we extract the words from the dataset using RegexpTokenizer, and then we eliminate the stopwords (words that are very common but not very important in the text such as conjunctions or prepositions)

  • The second stage is to lemmatize the sentence, which involves reducing a word's forms to its base word and keeping only verbs and nouns in this case (using nltk pos tagger).

In [ ]:
lemmatizer = WordNetLemmatizer()

def nltk_pos_tagger(nltk_tag):
    if nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    else:          
        return None

def tokenize_sentence(sentence):
    tokenizer = nltk.RegexpTokenizer(r"[^\d\W]+")
    tokenized = tokenizer.tokenize(sentence)
    stopwords = nltk.corpus.stopwords.words('english')
    finalsentence = [word for word in tokenized if word not in stopwords]
    return(finalsentence)

def lemmatize_sentence(sentence):

    nltk_tagged = nltk.pos_tag(sentence)  
    wordnet_tagged = map(lambda x: (x[0], nltk_pos_tagger(x[1])), nltk_tagged)
    lemmatized_sentence = []
    
    for word, tag in wordnet_tagged:
        if not (tag is None):    
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return (lemmatized_sentence)

def tokenize_lemmatize(sentence):
  tokenized = tokenize_sentence(sentence)
  lemmatized = lemmatize_sentence(tokenized)
  selectedwords = [word for word in lemmatized if word in words]
  final = list(dict.fromkeys(selectedwords))
  return(final)

Let's use the previous functions to construct a new column containing the lemmatized words from the recipe name, steps, and ingredients.

In [ ]:
recipe_filtered['recsys'] = recipe_filtered.apply(lambda row: tokenize_lemmatize(row['name']+row['steps']+row['ingredients']),axis=1)

The next step is to calculate a similarity score between the recipes. To do so, we must vectorize the words we obtained in the previous steps, which means assigning a numeric value to each dish based on the words it contains. Using the TfidfVectorizer package, we can accomplish this. We get a matrix with one vector for each recipe (the matrix contains one row per recipe).

In [ ]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
recipe_filtered['recsys'] = recipe_filtered['recsys'].fillna('')
tfidf_matrix = tfidf.fit_transform(recipe_filtered['recsys'].astype(str))
In [ ]:
tfidf_matrix.shape
Out[ ]:
(2349, 2579)

We can check the words from the recipes that we've vectorized.

In [ ]:
tfidf.get_feature_names_out ()[1:100]
Out[ ]:
array(['absorb', 'absorption', 'accent', 'accompany', 'accord',
       'accumulate', 'accustom', 'achieve', 'acorn', 'act', 'activate',
       'ad', 'adapt', 'add', 'addition', 'adjust', 'adjustment',
       'advance', 'advise', 'agar', 'age', 'air', 'airtight', 'airy',
       'aka', 'ake', 'al', 'ala', 'alcohol', 'ale', 'allergic', 'allergy',
       'alley', 'alligator', 'allow', 'allspice', 'almond', 'alternate',
       'altitude', 'alum', 'aluminum', 'amaze', 'amber', 'ambrosia',
       'amino', 'anchovy', 'angel', 'angle', 'anise', 'anoint', 'apart',
       'appear', 'appearance', 'appetizer', 'apple', 'applesauce',
       'application', 'apply', 'approximate', 'apricot', 'area', 'arent',
       'armadillo', 'aroma', 'aromatize', 'arrange', 'arrangement',
       'arrow', 'arrowroot', 'artichoke', 'ash', 'aside', 'ask',
       'asparagus', 'assemble', 'assembly', 'assort', 'assure', 'ate',
       'atlas', 'atop', 'attach', 'attachment', 'attain', 'attempt',
       'aunt', 'autumn', 'avocado', 'avoid', 'baby', 'backbone', 'bacon',
       'bag', 'bagel', 'baggie', 'baguette', 'bailey', 'bake', 'baker'],
      dtype=object)

The linear_kernel function, which is comparable to the cosine similarity in this circumstance, can now be used to calculate the similarity between the recipes. The distance between two vectors is defined as the cosine of the angle between them. Visit the sklearn metrics page for further information.

In [ ]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

recipe_cs = linear_kernel(tfidf_matrix, tfidf_matrix)

We can now see which recipes are the most similar to a particular one by sorting one of the rows of our matrix.

In [ ]:
idx = (-recipe_cs[4]).argsort()[:10]
idx
Out[ ]:
array([   4, 1123,  600,  108,  291,  717, 2289,  111, 1660,  730])

As can be seen below, the recipes obtained appear to be similar. There's a lot of cheesecake!

In [ ]:
recipe_filtered.iloc[idx]
Out[ ]:
name id minutes contributor_id submitted tags nutrition n_steps steps description ingredients n_ingredients recsys
279 splenda d cheesecake sugar free low carb 185799 80 340556 2006-09-13 ['time-to-make', 'course', 'main-ingredient', ... [566.5, 80.0, 16.0, 14.0, 21.0, 150.0, 4.0] 27 ['grahm cracker crumb crust:', 'mix together: ... this is my own recipe for a yummy, creamy, thi... ['graham cracker crumbs', 'splenda granular', ... 13 [cheesecake, cracker, crumb, crust, mix, spice...
80236 elsie s cherry delight dessert 267164 30 274666 2007-11-21 ['30-minutes-or-less', 'time-to-make', 'course... [320.0, 25.0, 50.0, 6.0, 4.0, 49.0, 13.0] 11 ['combine butter and graham crackers', 'mix we... can be made with blueberry pie filling as an a... ['butter', 'graham crackers', 'cherry pie fill... 6 [cherry, delight, combine, butter, graham, cra...
39832 cheesecake boo raj 108865 390 56112 2005-01-20 ['time-to-make', 'course', 'preparation', 'for... [768.0, 92.0, 129.0, 22.0, 25.0, 166.0, 15.0] 16 ['use a 12" spring form pan- i\'ve used a 10" ... my beloved 8-year-old nephew nicknamed himself ['graham crackers', 'pecans', 'brown sugar', '... 12 [cheesecake, boo, raj, use, spring, form, pan,...
6098 apple peach breakfast bake 241280 30 512834 2007-07-18 ['30-minutes-or-less', 'time-to-make', 'course... [397.5, 27.0, 164.0, 9.0, 16.0, 46.0, 17.0] 8 ['preheat the oven to 400f', 'melt the butter ... this makes an easy, delicious, and elegant bre... ['butter', 'brown sugar', 'cinnamon', 'nutmeg'... 13 [apple, peach, breakfast, bake, preheat, f, bu...
18868 basset hound cheesecake 447526 80 870705 2011-01-27 ['time-to-make', 'course', 'main-ingredient', ... [414.9, 49.0, 92.0, 12.0, 13.0, 82.0, 8.0] 9 ['preheat oven to 350', 'mix graham cracker cr... we call this basset hound cheesecake because t... ['graham cracker crumbs', 'nuts', 'butter', 'c... 9 [basset, hound, cheesecake, preheat, mix, grah...
49659 chocolate cheesecake 17910 70 15609 2002-01-24 ['weeknight', 'time-to-make', 'course', 'cuisi... [4873.3, 441.0, 1214.0, 255.0, 204.0, 665.0, 1... 9 ['set aside one cup of dry cake mix', 'mix rem... this is one of the easy recipes i've gotten ov... ["devil's food cake mix", 'oil', 'eggs', 'crea... 9 [chocolate, cheesecake, set, cup, cake, mix, r...
172451 red velvet cheese cake 499710 685 2549237 2013-05-02 ['weeknight', 'course', 'main-ingredient', 'pr... [798.0, 81.0, 291.0, 19.0, 19.0, 150.0, 25.0] 18 ['stir together graham cracker crums , melted ... this is a recipe posted by my sister on facebo... ['chocolate graham cracker crumbs', 'butter', ... 13 [velvet, cheese, cake, graham, cracker, melt, ...
6446 apple cheesecake pie 14218 40 20754 2001-11-13 ['60-minutes-or-less', 'time-to-make', 'course... [368.6, 29.0, 123.0, 10.0, 10.0, 39.0, 15.0] 13 ['preheat oven to 325 degrees', 'combine cream... simply delicious! ['graham cracker pie crusts', 'cream cheese', ... 8 [apple, cheesecake, pie, preheat, degree, comb...
121514 lemon delight 322575 15 942142 2008-09-03 ['15-minutes-or-less', 'time-to-make', 'course... [397.2, 26.0, 165.0, 13.0, 11.0, 49.0, 18.0] 15 ['put 1 can of carnation milk in the freezer f... very light and fluffy, lemon desert. ['graham wafer crumbs', 'butter', 'brown sugar... 7 [delight, put, carnation, milk, freezer, prepa...
51060 chocolate mousse cake 1977 78042 65 106624 2003-12-06 ['weeknight', 'time-to-make', 'course', 'prepa... [385.9, 56.0, 17.0, 2.0, 18.0, 86.0, 5.0] 33 ['butter a 9 inch spring form pan', 'combine h... easy, almost flourlesss chocolate cake. ['hazelnuts', 'butter', 'semisweet chocolate',... 8 [chocolate, mousse, cake, butter, inch, spring...

Customizing our Surprise Algorithm.

Let's make some changes to the surprise base algorithm. The first step is to load the required libraries.

In [ ]:
from surprise import Dataset
from surprise import Reader
from surprise import PredictionImpossible
from surprise import AlgoBase

To overwrite the base algorithm, we must first create a class that inherits from AlgoBase. The init, fit, and estimate methods must then be rewritten.

  • The init method simply calls the base init method.

  • The fit method handles the calculation of similarities. We must first invoke the base fit method before calculating the similiarity matrix. We simply assign the precalculated similarity matrix to a class object, as we have already done. (In this method, we could also call the code to calculate the similarities.)

  • Finally, the estimate method must return an estimated rating for a given user-item(recipe) pair. To accomplish this, we compute the total similarity of the given item to the other items rated by the user.

Note: The proper way to do this would be to calculate an average of the similarities weighted by the user ratings, but the problem with our dataset is that the majority of the ratings are 4 or 5, which results in very high estimated ratings (a lot of estimated ratings of 5), which prevents us from creating a properly ordered list of recommended recipes, so our estimated rating will be just the total similarity of the given recipe with the recipes the user has rated. We will receive extremely low estimated ratings, but the rating itself is unimportant; we simply need a "ranking" of the best recipes for this user so that it will work here.

In [ ]:
class recipeAlgo(AlgoBase):

    def __init__(self):

        # Always call base method before doing anything.
        AlgoBase.__init__(self)

    def fit(self, trainset):

        # Here again: call base method before doing anything.
        AlgoBase.fit(self, trainset)

        self.similarities = recipe_cs

        return self

    def estimate(self, u, i):

        if not (self.trainset.knows_user(u) and self.trainset.knows_item(i)):
            raise PredictionImpossible('User and/or item is unkown.')

        sim_recipes = []
        #We have to search of the similarity between the input item (i)
        # and the recipes the user (u) rated
        item_idx=recipe_filtered.index.get_loc(self.trainset.to_raw_iid(i))
        for rating in self.trainset.ur[u]:
            rating_idx = recipe_filtered.index.get_loc(self.trainset.to_raw_iid(rating[0]))
            recipeSimilarity = self.similarities[item_idx,rating_idx]
            sim_recipes.append((recipeSimilarity, rating[1]))

        highest_sims = sorted(sim_recipes,key=lambda x: x[0])

        totalSimilarity=0
        ratingWeighted=0

        #Now we use the similarities to predict a rating
        for(similarity,rating) in highest_sims[:10]:
            totalSimilarity += similarity

        return totalSimilarity

Ok let's fit our algorithm, all we need to do is create our data object, create a suprise trainSet from it and fit the algorithm

In [ ]:
reader = Reader(rating_scale=(0, 5))

data = Dataset.load_from_df(recipe_ratings[['User', 'Item', 'Rating']], reader)
trainSet = data.build_full_trainset()

algo = recipeAlgo()
algo.fit(trainSet)
Out[ ]:
<__main__.recipeAlgo at 0x7fb1522aba10>

We have successfully fitted our algorithm. Let's now create a test set with only one user to see how our recommender system performs. Our test set will include all of the recipes that the user hasn't rated yet, allowing our recommender system to assign a predicted rating to each of them and sort the best recipes for this user.

In [ ]:
anti_testset_user = []
targetUser = 0 #inner_id of the target user
fillValue = trainSet.global_mean
user_item_ratings = trainSet.ur[targetUser]
user_items = [item for (item,_) in (user_item_ratings)]
user_items
ratings = trainSet.all_ratings()
for iid in trainSet.all_items():
  if(iid not in user_items):
    anti_testset_user.append((trainSet.to_raw_uid(targetUser),trainSet.to_raw_iid(iid),fillValue))
In [ ]:
len(anti_testset_user)
Out[ ]:
2344

Let's call the test method of our recommender system.

In [ ]:
predictions = algo.test(anti_testset_user)

As you can see below, the estimated ratings are extremely low; our algorithm generates the estimated rating by adding the similarities between a given recipe rated by the user and the most similar recipes rated by other users. However, we have a good classification of the best recipes for our test user based on content similarity.

In [ ]:
pred = pd.DataFrame(predictions)
pred.sort_values(by=['est'],inplace=True,ascending = False)
pred
Out[ ]:
uid iid r_ui est details
1884 0 44513 4.60216 0.936429 {'was_impossible': False}
56 0 15728 4.60216 0.896340 {'was_impossible': False}
1311 0 80022 4.60216 0.871413 {'was_impossible': False}
1168 0 161791 4.60216 0.871043 {'was_impossible': False}
1631 0 133192 4.60216 0.864810 {'was_impossible': False}
... ... ... ... ... ...
1794 0 135807 4.60216 0.032356 {'was_impossible': False}
966 0 16435 4.60216 0.029600 {'was_impossible': False}
2267 0 1080 4.60216 0.027599 {'was_impossible': False}
555 0 28552 4.60216 0.018871 {'was_impossible': False}
673 0 106202 4.60216 0.000000 {'was_impossible': False}

2344 rows × 5 columns

In [ ]:
pred = pd.DataFrame(predictions)
pred.sort_values(by=['est'],inplace=True,ascending = False)
recipe_list = pred.head(10)['iid'].to_list()
recipe_data.loc[recipe_list]
Out[ ]:
name id minutes contributor_id submitted tags nutrition n_steps steps description ingredients n_ingredients
44513 chicken in stilton 40623 55 31914 2002-09-19 ['60-minutes-or-less', 'time-to-make', 'main-i... [481.2, 43.0, 14.0, 14.0, 13.0, 87.0, 4.0] 8 ['chop onion , crush garlic and fry in butter ... luscious company fare, very rich and tasty. ['mushroom', 'garlic', 'onion', 'chicken', 'wh... 11
15728 balsamic braised chicken 112258 85 17803 2005-02-27 ['time-to-make', 'course', 'main-ingredient', ... [423.7, 35.0, 66.0, 25.0, 64.0, 30.0, 6.0] 10 ['sprinkle chicken pieces evenly with pepper a... when i was given this free recipe in the groce... ['chicken thighs', 'chicken drumsticks', 'pepp... 11
80022 elegant garlic chicken for two 335764 40 871001 2008-11-08 ['60-minutes-or-less', 'time-to-make', 'course... [477.6, 38.0, 14.0, 8.0, 59.0, 76.0, 3.0] 9 ['in a medium frying pan , lightly brown garli... an easy elegant dinner for two. ['boneless skinless chicken breasts', 'butter'... 7
161791 poached salmon with a mustard dill sauce 268514 40 594139 2007-11-28 ['60-minutes-or-less', 'time-to-make', 'course... [517.9, 31.0, 27.0, 17.0, 133.0, 29.0, 2.0] 5 ['in a large pan , add water , wine , lemon ju... here is a light delicious salmon recipe, i mad... ['salmon fillets', 'water', 'lemon juice', 'dr... 13
133192 mediterranean spaghetti with tomatoes and feta 170625 30 38418 2006-05-30 ['30-minutes-or-less', 'time-to-make', 'course... [290.9, 32.0, 35.0, 35.0, 26.0, 59.0, 4.0] 8 ['heat oil in large non-stick skillet over med... you can use grape tomatoes in place of the cho... ['olive oil', 'dried oregano', 'garlic clove',... 10
132677 meatless cassoulet au vin 233218 70 283251 2007-06-07 ['time-to-make', 'course', 'main-ingredient', ... [270.7, 1.0, 20.0, 16.0, 27.0, 0.0, 13.0] 10 ['in a large saucepan , bring navy beans , 6 c... this is from my 365 ways to cook vegetarian co... ['dried navy beans', 'garlic cloves', 'onion',... 13
151059 oven baked chicken with fresh mozzarella tom... 345986 40 883141 2008-12-30 ['60-minutes-or-less', 'time-to-make', 'course... [396.6, 22.0, 14.0, 29.0, 97.0, 31.0, 4.0] 12 ['preheat oven to 400 degrees f', 'coat casser... vine-ripened tomatoes and fresh mozzarella che... ['italian seasoning', 'garlic powder', 'onion ... 15
1022 1950 s hamburger goulash 109232 50 160974 2005-01-24 ['60-minutes-or-less', 'time-to-make', 'main-i... [518.4, 39.0, 29.0, 47.0, 53.0, 48.0, 15.0] 15 ['cook macaroni in boiling salted water until ... we grew up with this casserole that we called ... ['macaroni', 'tomato paste', 'water', 'baking ... 16
26686 boiled bacon with parsley sauce fresh ham 423263 94 865936 2010-05-04 ['ham', 'time-to-make', 'course', 'main-ingred... [428.9, 28.0, 11.0, 146.0, 105.0, 38.0, 3.0] 11 ['place the ham in a large pot or dutch oven a... sounds gross, right? this is a traditional br... ['ham', 'onions', 'carrots', 'celery ribs', 'b... 11
164043 portabella sandwich with garlic and lemon 179067 15 183872 2006-07-24 ['15-minutes-or-less', 'time-to-make', 'course... [208.6, 5.0, 39.0, 13.0, 18.0, 5.0, 12.0] 6 ['melt butter in skillet over medium heat', 'a... i can't remember where i got this recipe, but ... ['butter', 'portabella mushrooms', 'salt', 'ga... 9

Above this text are the suggested recipes, and below are the actual recipes rated by our user. Only by looking at the title can we see that there are similarities in the recipes; the user appears to enjoy braised meat, garlic, and chicken, and these ingredients are included in the recommended recipes.

In [ ]:
recipe_data.loc[recipe_ratings[recipe_ratings['User']==0]['Item']]
Out[ ]:
name id minutes contributor_id submitted tags nutrition n_steps steps description ingredients n_ingredients
27680 braised pork chops 92678 65 39835 2004-06-04 ['time-to-make', 'course', 'main-ingredient', ... [743.2, 86.0, 11.0, 8.0, 75.0, 128.0, 7.0] 18 ['preheat oven to 325-degrees f', 'seaon the p... this is the first successful pork chop i made ... ['pork chops', 'salt', 'pepper', 'flour', 'but... 15
90038 garlic bread with sauteed mushrooms 88274 20 101275 2004-04-05 ['30-minutes-or-less', 'time-to-make', 'course... [222.9, 8.0, 4.0, 16.0, 15.0, 4.0, 12.0] 6 ['heat the 1 tblsp olive oil in a non stick sk... i've been on a mediterranean kick in the last ... ['olive oil', 'button mushrooms', 'parsley', '... 7
71578 deviled mushrooms 28367 20 23302 2002-05-13 ['30-minutes-or-less', 'time-to-make', 'course... [209.4, 15.0, 6.0, 9.0, 13.0, 25.0, 8.0] 4 ['melt the butter in a saucepan , and saute th... yum ['unsalted butter', 'mushrooms', 'plain flour'... 9
27749 braised veal shanks 185494 200 305531 2006-09-11 ['time-to-make', 'course', 'main-ingredient', ... [537.7, 25.0, 0.0, 31.0, 177.0, 24.0, 1.0] 19 ['lay veal in a single layer in a 9x13 inch ba... my family loves this dish. i have also made i... ['veal shanks', 'lemon', 'beef broth', 'dried ... 10
89385 funky chicken 265625 45 265694 2007-11-13 ['60-minutes-or-less', 'time-to-make', 'course... [558.7, 33.0, 182.0, 110.0, 74.0, 30.0, 19.0] 6 ['preheat oven to 350f', 'in a baking dish , p... i just threw this together right now--yes, rig... ['chicken thighs', 'chicken drumsticks', 'hone... 9

Conclusion

We’ve seen how to use the surprise package to create a Recommender System. We concentrated on the data content and used NLP techniques to determine the degree of similarity between our documents (recipes).

We learned how to easily customize the base surprise algorithm to help us get recommendations for users and build a recommender system.

As we’ve seen, all we need to do is return an estimated rating for the items that the user hasn’t rated; this can be accomplished in a variety of ways; we’ve chosen the similarity between items to calculate it. The rating does not have to be a «real» rating; we simply need a list sorted by rating in order to generate recommendations that meet our criteria.

The data for this project was obtained via Kaggle. Please see the following cita:

Generating Personalized Recipes from Historical User Preferences

Bodhisattwa Prasad Majumder, Shuyang Li, Jianmo Ni, Julian McAuley

EMNLP, 2019

https://www.aclweb.org/anthology/D19-1613/