A basic strategy for recommendation is, find users similar to the new user, and recommend the new user with what similar users like. Based on this strategy, a technique, collaborative filtering, was invented to making predictions (filtering) about the interests of a user by collecting preferences or taste information from many similar users (collaborating).

Here we implemented a user similarity based collaborating filtering algorithm to predict ratings, and this algorithm was further applied to our shiny app.

Dependencies

library(tidyverse)

Data preparation

path = "./data/small/ratings.csv"

ratings_tidy = read_csv(path) %>% 
  janitor::clean_names() %>% 
  select(-timestamp)

ratings = ratings_tidy %>% 
  nest(movie_id, rating)

In the data preparation process, we read the raw rating file and nested the two column movie_id and rating for following analyses’ convenience.

Algorithm

The user similarity based recommendation algorithm predict ratings with the following procedures:

Find \(K\) most similar users to the given user (Here we set \(K\) to 10)
Find all movies that these similar users have rated and the given user has not watched
Predict the given user’s ratings on these movies based on the similarity measure and ratings of similar user’s

Measure of similarity

We built two functions, calc_pc_sim, calc_cos_sim, to calculate the similarity between users. In subsequent analysis, we will compare the performance of these two measure and choose the better one.

Measure by Pearson correlation

Let’s denote \(w_{uv}\) as the similarity between two users(\(u\), \(v\)), \(r\) as movie ratings, \(m\) as movie, then \(w_{uv}\) can be quantitatively defined by Pearson correlation:

\[ w_{u v}=\frac{\displaystyle \sum_{m \in M}\left(r_{u m}-\bar{r}_{u}\right) \cdot\left(r_{v m}-\bar{r}_{v}\right)}{\displaystyle \sqrt{\sum_{m \in M}\left(r_{u m}-\bar{r}_{u}\right)^{2} \sum_{m \in M}\left(r_{v m}-\bar{r}_{v}\right)^{2}}} \]

calc_pc_sim = function(u_data, v_data){
  
  calc_df = u_data %>% 
    inner_join(v_data, by = "movie_id", suffix = c("_u", "_v"))
  
  # two users should at least watched 3 same movie
  if (nrow(calc_df) < 3) {
    return(-1)
  }
  
  r_mean_u = u_data %>% pull(rating) %>% mean()
  r_mean_v = v_data %>% pull(rating) %>% mean()
  
  calc_df = calc_df %>% 
    mutate(
      r_u_ct = rating_u - r_mean_u,
      r_v_ct = rating_v - r_mean_v,
      r_uv_ct = r_u_ct * r_v_ct,
      r_u_ct_sq = r_u_ct^2,
      r_v_ct_sq = r_v_ct^2
    )
  
  sum_r_uv_ct = calc_df %>% pull(r_uv_ct) %>% sum()
  sum_r_u_ct_sq = calc_df %>% pull(r_u_ct_sq) %>% sum()
  sum_r_v_ct_sq = calc_df %>% pull(r_v_ct_sq) %>% sum()
  
  sim = sum_r_uv_ct / sqrt(sum_r_u_ct_sq * sum_r_v_ct_sq)
  
  return(sim)
}

We built a function, calc_pc_sim, to calculate the pearson correlation according to the above formula.

Measure by cosine similarity

Alternatively, \(w_{uv}\) can also be defined by cosine similarity:

\[ w_{u v}=\frac{\displaystyle \sum_{m \in M} r_{u m} \cdot r_{v m}}{ \displaystyle \sqrt{\sum_{m \in M}r_{u m}^{2} \cdot \sum_{m \in M}r_{v m}^{2}}} \]

calc_cos_sim = function(u_data, v_data){
  
  calc_df = u_data %>% 
    inner_join(v_data, by = "movie_id", suffix = c("_u", "_v"))
  
  # two users should at least watched 3 same movie
  if (nrow(calc_df) < 3) {
    return(-1)
  }
  
  calc_df = calc_df %>% 
    mutate(
      r_uv = rating_u * rating_v,
      r_u_sq = rating_u^2,
      r_v_sq = rating_v^2
    )
  
  sum_r_uv = calc_df %>% pull(r_uv) %>% sum()
  sum_r_u_sq = calc_df %>% pull(r_u_sq) %>% sum()
  sum_r_v_sq = calc_df %>% pull(r_v_sq) %>% sum()
  
  sim = sum_r_uv / sqrt(sum_r_u_sq * sum_r_v_sq)
  
  return(sim)
}

We built the function calc_cos_sim to calculate the cosine similarity according to the above formula.

Identify similar users

acquire_data = function(user){
  return(ratings %>% filter(user_id == user) %>% pull(data))
}

get_similar = function(user, calc_sim){
  
  user_data = acquire_data(user)
  calc_df = ratings %>% 
    filter(user_id != user) %>%
    rename(v_data = data) %>% 
    mutate(u_data = user_data) %>% 
    mutate(
      sim = map2_dbl(u_data, v_data, calc_sim)
    ) %>% 
    select(-u_data, -v_data) %>% 
    arrange(desc(sim)) %>% 
    head(10)
  
  return(calc_df)
}

We calculate the similarity score between the given user and every other users, and choose 10 users with highest similarity as similar users.

Rating prediction

Once we found users similar to the new user, we can predict the new user’s rating towards a movie \(m\) by:

\[ \hat{r}_{u m}=\bar{r}_{u}+\frac{\sum_{v \in S(u, K) \cap N(m)} w_{u v}\left(r_{v m}-\bar{r}_{v}\right)}{\sum_{v \in S(u, K) \cap N(m)}\left|w_{u v}\right|} \]

\(S(u, K)\): the most similar \(K\) users of \(u\)
\(N(m)\): users who have rated on movie \(m\)

predict_rating = function(user, movie, sim_users){
  
  r_means = ratings_tidy %>% 
    group_by(user_id) %>% 
    summarize(r_mean = mean(rating))
  
  calc_df = 
    ratings_tidy %>% 
    filter(movie_id == movie) %>% 
    inner_join(sim_users, by = "user_id") %>% 
    left_join(r_means, by = "user_id") %>% 
    mutate(
      sim_r_diff = sim * (rating - r_mean),
      abs_sim = abs(sim)
    )
  
  sum_sim_r_diff = calc_df %>% pull(sim_r_diff) %>% sum()
  sum_abs_sim = calc_df %>% pull(abs_sim) %>% sum()
  
  r_u_mean = acquire_data(user)[[1]] %>% pull(rating) %>% mean()
  
  rating = r_u_mean + sum_sim_r_diff/sum_abs_sim
  
  return(rating)
}

Test & Comparison

experiment data

set.seed(777)
experiment_data = ratings_tidy %>% 
  sample_frac(0.01, weight = user_id)

Due to limited computing power, we sample 1% of all ratings data as our experiment data.

Measure of performance

To compare the performance of Pearson correlation and cosine similarity, we use RMSE as the measurement:

\[ \operatorname{RMSE}=\frac{\displaystyle\sqrt{\sum_{(u, m) \in Train}\left(r_{u m}-\hat{r}_{u m}\right)^{2}}}{\mid \text { Test } \mid} \]

Experiment design

experiment = function(experiment_data){
  
  test = experiment_data %>% 
    sample_frac(0.2, weight = user_id)

  train = experiment_data %>% 
    anti_join(test)

  ratings_tidy = train

  ratings = ratings_tidy %>% 
    nest(movie_id, rating)
  
  cos_sim_users =
    test %>% 
    select(user_id) %>% 
    distinct() %>% 
    mutate(
      sim_users = map(user_id, ~get_similar(user = .x, calc_sim = calc_cos_sim))
      )
  
  pc_sim_users = 
    test %>% 
    select(user_id) %>% 
    distinct() %>% 
    mutate(
      sim_users = map(user_id, ~get_similar(user = .x, calc_sim = calc_pc_sim))
      )
  
  cos_sum_r_diff_sq_df = 
    test %>%
    left_join(cos_sim_users, by = "user_id") %>% 
    mutate(
      r_pred = pmap_dbl(list(user_id, movie_id, sim_users), predict_rating)
    ) %>% 
    drop_na(r_pred) %>% 
    mutate(
      r_diff_sq = (rating - r_pred)^2
    ) 
  
  cos_sum_r_diff_sq = 
    cos_sum_r_diff_sq_df %>% 
    pull(r_diff_sq) %>% 
    sum(na.rm = TRUE)
  
  pc_sum_r_diff_sq_df = 
    test %>%
    left_join(pc_sim_users, by = "user_id") %>% 
    mutate(
      r_pred = pmap_dbl(list(user_id, movie_id, sim_users), predict_rating)
    ) %>% 
    drop_na(r_pred) %>% 
    mutate(
      r_diff_sq = (rating - r_pred)^2
    ) 
  
  pc_sum_r_diff_sq = 
    pc_sum_r_diff_sq_df %>% 
    pull(r_diff_sq) %>% 
    sum(na.rm = TRUE)
  
  cos_rmse = sqrt(cos_sum_r_diff_sq) / nrow(cos_sum_r_diff_sq_df)
  pc_rmse = sqrt(pc_sum_r_diff_sq) / nrow(pc_sum_r_diff_sq_df)
  
  return(list(cos_rmse, pc_rmse))
}

In each round of experiment, for each user, we randomly extract 20% records of their ratings as the test dataset, and the remained dataset as train dataset. Then, we use the two similarity measures separately to find similar users and subsequently predict ratings on corresponding movies in the test dataset. Finally, we compute the RMSE between predicted ratings and actual ratings in the test data set, and get the two RMSE values.

Perform experiment

cos_rmse_list = list()
pc_rmse_list = list()


for (i in 1:30) {
  exp_res = experiment(experiment_data)

  cos_rmse_list = append(cos_rmse_list, exp_res[[1]])
  pc_rmse_list = append(pc_rmse_list, exp_res[[2]])

}

cos_rmse_df = tibble(rmse = cos_rmse_list) %>% unnest()
pc_rmse_df = tibble(rmse = pc_rmse_list) %>% unnest()

rmse_df = bind_rows(cos_rmse_df, pc_rmse_df, .id = "sim_type") %>% 
  mutate(sim_type = if_else(sim_type == 1, "cosine similarity", "pearson correlation"))

rmse_df %>% write_csv("./data/experiment_rmse.csv")

We performed 30 times of experiment and recorded RMSE values. This process took several hours, so we wrote the result into a document. When knitting this file, we read the result from the document and ignored this chunk.

Compare RMSE

The experiment result reveals that choosing pearson correlation as the measure of similarity could yield lower RMSE values. Therefore, in our shiny app, we choose to use calc_pc_sim.

rmse_df = read_csv("./data/experiment_rmse.csv") 

rmse_df %>% 
  ggplot(aes(x = sim_type, y = rmse)) +
  geom_violin() +
  labs(x = "Similarity Measurement", y = "RMSE of Prediction")

Main function

main = function(user){
  
  sim_users = get_similar(user, calc_pc_sim)
  
  movies = sim_users %>% 
    select(user_id) %>% 
    left_join(ratings_tidy, by = "user_id") %>% 
    distinct(movie_id) %>% 
    pull(movie_id) 
  
  calc_df = tibble(
    user_id = user,
    movie_id = movies
  )
  
  watched = acquire_data(user)[[1]] %>% pull(movie_id) %>% as.vector()
  
  rec_movies = 
    calc_df %>% 
    filter(!movie_id %in% watched) %>% 
    mutate(
      r_pred = map2_dbl(user_id, movie_id, ~predict_rating(user = .x, movie = .y, sim_users = sim_users))
    ) %>% 
    arrange(desc(r_pred)) %>% 
    head(10) %>% 
    select(movie_id, pred_rating = r_pred)
  
  return(rec_movies)
}

We built the main function to implement the whole process of recommendation. For the given user, we will find top 10 users similar to him/her, and predicts his/her ratings on movies similar users have seen yet he/she has not watched. Finally, we select 10 movies with the highest predicted rating to recommend to the user.

Test

Now, we can use the main function to recommend movies for a random user:

set.seed(777)

user = round(runif(1, 1, 610))

movies = read_csv("./data/small/movies.csv") %>% 
  janitor::clean_names() %>% 
  select(-genres)
  
main(user) %>% 
  left_join(movies, by = "movie_id") %>% 
  select(title, pred_rating) %>% 
  kableExtra::kbl(digits = 1) %>% 
  kableExtra::kable_styling(font_size = 12)

title	pred_rating
Godfather: Part II, The (1974)	5.6
American Pie (1999)	5.6
Being John Malkovich (1999)	5.6
Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)	5.2
Snatch (2000)	5.2
Batman (1989)	5.1
Seven (a.k.a. Se7en) (1995)	5.1
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)	4.7
Minority Report (2002)	4.7
Gone Girl (2014)	4.7

User-similarity-based recommendation