Back to blog

Sunday, January 22, 2023

Building a Simple Recommendation System with Scikit-Learn

A beginner's guide to creating a content-based recommendation system using the popular Python library.

Building a recommendation system can seem like a daunting task, but with the right tools and knowledge, it can be a fun and rewarding experience. In this tutorial, we will be building a simple recommendation system using the popular library sci-kit-learn.

Before we begin, let's go over the basics of recommendation systems. A recommendation system is a tool that suggests items to users based on their preferences and past interactions. There are several types of recommendation systems, such as content-based, collaborative filtering, and hybrid recommendation systems. In this tutorial, we will be building a simple content-based recommendation system.

A content-based recommendation system suggests items to users based on their past interactions. For example, if a user has previously watched a lot of action movies, the recommendation system will suggest more action movies to the user. To build a content-based recommendation system, we will need a dataset containing information about the items and the users' interactions with them.

The first step in building a recommendation system is to prepare the data. In this tutorial, we will be using the MovieLens dataset, which contains information about movies and the users' interactions with them. The dataset can be downloaded from the MovieLens website (https://grouplens.org/datasets/movielens/). Once you have the dataset, you'll need to load it into your Python environment.

import pandas as pd

movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')

Next, we will need to clean and preprocess the data. In this tutorial, we will be focusing on the movie dataset and the rating dataset. We will be removing any duplicate rows and any missing values.

# Removing duplicate rows
movies.drop_duplicates(inplace=True)
ratings.drop_duplicates(inplace=True)

# Removing missing values
movies.dropna(inplace=True)
ratings.dropna(inplace=True)

Now that the data is cleaned and preprocessed, we can start building the recommendation system. One popular library for building recommendation systems in Python is scikit-learn. It provides a variety of tools for building, evaluating, and improving recommendation systems.

To build a content-based recommendation system, we will first need to extract the features from the movie dataset. In this example, we will be using the movie's genres as the features. We will be using the OneHotEncoder class from scikit-learn to convert the genres into a numerical format that can be used as input to the recommendation system.

from sklearn.preprocessing import OneHotEncoder

# Extracting the genres column
genres = movies['genres']

# Creating an instance of the OneHotEncoder
encoder = OneHotEncoder()

# Fitting and transforming the genres column
genres_encoded = encoder.fit_transform(genres.values.reshape(-1, 1))

Now that we have extracted and encoded the features, we can start building the recommendation system. In this example, we will be using the NearestNeighbors class from scikit-learn to build the recommendation system. We will be using cosine similarity as the metric for measuring the similarity between the movies.

from sklearn.neighbors import NearestNeighbors

# Creating an instance of the NearestNeighbors class
recommender = NearestNeighbors(metric='cosine')

# Fitting the encoded genres to the recommender
recommender.fit(genres_encoded.toarray())

Now that the recommendation system is built, we can start making recommendations to the users. To make a recommendation, we will need to pass in the index of a movie that the user has previously watched. The recommendation system will then return the indexes of the most similar movies.

# Index of the movie the user has previously watched
movie_index = 0

# Number of recommendations to return
num_recommendations = 5

# Getting the recommendations
_, recommendations = recommender.kneighbors(genres_encoded[movie_index].toarray(), n_neighbors=num_recommendations)

# Extracting the movie titles from the recommendations
recommended_movie_titles = movies.iloc[recommendations[0]]['title']

And that's it! We have successfully built a simple content-based recommendation system using scikit-learn. You can experiment with different features and metrics to see how it affects the recommendations.

Remember, that recommendation system are an iterative process, you will need to test, evaluate, and improve your model over time. You can use libraries such as TensorFlow or Keras to build more complex recommendation systems.

To conclude, building a recommendation system may seem like a daunting task, but with the right tools and knowledge, it can be a fun and rewarding experience. Scikit-learn is a powerful library that makes it easy to build, evaluate, and improve recommendation systems. By following this tutorial, you should now have a basic understanding of how to build a simple content-based recommendation system.

Happy coding!