Previously, we have implemented several algorithms for recommendation, such as adaptive bootstraping, user-similarity-based recommendation, and latent-factor-based recommendation. The next one we would like to introduce is the tag-based recommendation algorithm. Before introducing it, we need a new dataset to carry on the thought. The “tag.csv” in the Movielens contains the data we are looking for. The dataset includes user id, movie id, tag and timestamp (which represents the timepoint that the rating was generated at). Based on the data of tags, we can employ a strategy to achieve personalized recommendation for a user:

  1. For every user, find the most commonly used tags

  2. For each tag, find the movie that labeled by this tag for the most times.

  3. For the given user, find his most commonly used tags, then recommend to the user the most popular movie labeled by this tag.

To improve the above strategy and utilizing all tags rather than most used/received tags, we can quantitatively measure a user’s interest to a movie based on all tags given by the user and all tags received by the movie. The formula of the user u’s interest to the movie i is as follows:

\(p(u, i)=\sum_{b} \frac{n_{u, b}}{\log \left(1+n_{b}^{(u)}\right)} \frac{n_{b, i}}{\log \left(1+n_{i}^{(u)}\right)}\)

where \(n_{u, b}\) is the number of times that user u has labeled tag b, \(n_{b, i}\) is the number of times that movie i has been labeled tag b, \(n_b^{(u)}\) records how many different users have used tag b, \(n_i^{(u)}\)records how many different users have tagged the movie i. To get the specific value, we should build our function first.

 

Construct interest function

The difficulty of the function is to find the correct sets of tags or users. To achieve this goal, we manipulated this dataset, counted the corresponding numbers we need during each round. Then, we sum all values we got in each round and get the interest value of the user on specific movie.

# Set up interest function
fc.interest = function(u, i) {
B.u = 
  remove.tag %>% 
  filter(user_id == u)

B.u.distinct = 
  B.u %>% 
  distinct(tag)

remove.tag.distinct = 
  remove.tag %>% 
  select(-movie_id, -timestamp) %>% 
  distinct()

B.i = 
  remove.tag %>% 
  filter(movie_id == i)

B.i.distinct = 
  B.i %>% 
  select(-tag, -timestamp) %>% 
  distinct()

n_ub = c()
n_bi = c()
n_bu = c()
lg.value = c()
n_iu = nrow(B.i.distinct)
value = c()

for (b in 1:nrow(B.u.distinct)) {
  n_ub[b] = sum(B.u$tag == as.character(B.u.distinct[b, 1]))
  n_bi[b] = sum(B.i$tag == as.character(B.u.distinct[b, 1]))
  n_bu[b] = sum(remove.tag.distinct$tag == as.character(B.u.distinct[b, 1]))
  lg.value[b] = log(1 + n_bu[b]) * log(1 + n_iu)
  value[b] = n_ub[b] * n_bi[b] / lg.value[b]
}
return(interest = round(sum(value), 2))
}

 

Visualize user’s predicted interests on movies

We created a heat map to visualize the users’ interest values, which is shown below. The horizontal coordinate of the heat map is the movie ID and the vertical coordinate is the user ID, color in each grid represents the value of the user’s interest to the movie.

# construct a new table of the selected id and movie
heat.map = tibble(user_id = rep(active.user.vec, length(popular.movie.vec)), movie_id = rep(popular.movie.vec, length(active.user.vec)))

# add insterest value to the table
interest = c()
for (i in 1:nrow(heat.map)) {
  interest[i] = fc.interest(as.numeric(heat.map[i, 1]), as.numeric(heat.map[i, 2]))
}

heat.map = add_column(heat.map, interest)

# build heat map
heat.map %>% 
  ggplot(aes(y = as.factor(user_id), x = as.factor(movie_id), fill = interest)) +
  geom_tile() +
  geom_raster() +
  labs(
    x = "Movie ID",
    y = "User ID"
  )

 

Conclusion

Tag-based recommendation can help us predict the user’s interest on a certain movie. The algorithm’s validity relies on amount of tags. Therefore, when trained on datasets with ample data of tags, the algorithm will become more accurate.

The algorithm penalizes popular tags and popular movies to generate more personalized recommendation results.