Corey
by Corey
4 min read

Categories

  • articles

Tags

  • Python

In this small article, I want to keep a record for my study experience in recommendation system. There are several parameters can evaluate a recommendation system, which are precision, recall and F-measure. Collaborative filtering (CF) is the most successful recommendation technique to date. There are three main categories of CF techniques: memory-based, model-based and hybrid CF algorithms (that combine CF with other recommendation techniques).

image
Overview of collaborative filtering techniques.

Today, I will mainly focus on Item-based and User-based CF techniques. The basic idea of item-based CF algorithms is to calculate the similarity between different items based on the user-item rating data and then to compute the prediction score for a given item based on the calculated similarity score, while the core of user-based CF algorithms is to provide item recommendation or prediction based on the common opinion of other like-minded users.

image

Evaluation indicators for a recommendation system

Notation Meaning
P is the fraction of retrieved instances that are relevant.
R is the fraction of relevant instances that are retrieved.
F F = (2PR/(P+R)). in fact, F-measure is a measure of test’s accuracy by coordinate Precision with Recall.

Datasets

Here I want to evaluate running time of two algorithms using MovieLens datasets.

u1.base 「Total 943 person, Total 1650 movies」

u1.test 「Total 459 person, Total 1410 movies」

User-based CF & Item-based CF

#! /usr/bin/env python
# -*- coding:utf-8 -*-
# Author: <work#shanguangyu.com>

from __future__ import division
from math import sqrt
import time


class UserBasedCF:

    def __init__(self, train_file, test_file):
        self.train_file = train_file
        self.test_file = test_file
        self.readData()
        self.userSimilarity()

    def readData(self):
        self.train, self.test = {}, {}
        for line in open(self.train_file):
            user, item, score, _ = line.strip().split("\t")
            self.train.setdefault(user, {})
            self.train[user][item] = int(score)
        for line in open(self.test_file):
            user, item, score, _ = line.strip().split("\t")
            self.test.setdefault(user, {})
            self.test[user][item] = int(score)

    def userSimilarity(self):
        # Build invese table for item_users
        self.item_users = {}
        for u, items in self.train.iteritems():
            for i in items.keys():
                if i not in self.item_users:
                    self.item_users[i] = set()
                self.item_users[i].add(u)

        C, N = {}, {}
        for i, users in self.item_users.iteritems():
            for u in users:
                N.setdefault(u, 0)
                N[u] += 1
                C.setdefault(u, {})
                for v in users:
                    if u == v:
                        continue
                    C[u].setdefault(v, 0)
                    C[u][v] += 1

        self.W = {}
        # Calculate finial similarity matrix W
        for u, related_users in C.iteritems():
            self.W.setdefault(u, {})
            for v, cuv in related_users.iteritems():
                self.W[u][v] = cuv / sqrt(N[u] * N[v])

    def recommend(self, user, K=3, N=10):
        rank = {}
        interacted_items = self.train[user]
        for v, wuv in sorted(self.W[user].items(),
                             key=lambda x: x[1], reverse=True)[0:K]:
            for i, rvi in self.train[v].items():
                if i in interacted_items:
                    # filter items user interacted before
                    continue
                rank.setdefault(i, 0)
                rank[i] += wuv * rvi

        return dict(sorted(rank.items(),
                           key=lambda x: x[1], reverse=True)[0:N])

    def evaluate(self, train=None, test=None, K=3, N=10):
        train = self.train
        test = self.test
        hit, recall, precision = 0, 0, 0
        for user in train.keys():
            tu = test.get(user, {})
            rank = self.recommend(user, K=K, N=N)
            for i, _ in rank.items():
                if i in tu:
                    hit += 1
            recall += len(tu)
            precision += N
        recall = hit / recall
        precision = hit / precision
        f = 2 * recall * precision / (precision + recall)
        return (recall, precision, f)


class ItemBasedCF:

    def __init__(self, train_file, test_file):
        self.train_file = train_file
        self.test_file = test_file
        self.readData()
        self.itemSimilarity()

    def readData(self):
        self.train, self.test = {}, {}
        for line in open(self.train_file):
            user, item, score, _ = line.strip().split("\t")
            self.train.setdefault(user, {})
            self.train[user][item] = int(score)
        for line in open(self.test_file):
            user, item, score, _ = line.strip().split("\t")
            self.test.setdefault(user, {})
            self.test[user][item] = int(score)

    def itemSimilarity(self):
        C = dict()  
        N = dict()
        for user, items in self.train.items():
            for i in items.keys():
                N.setdefault(i, 0)
                N[i] += 1
                C.setdefault(i, {})
                for j in items.keys():
                    if i == j:
                        continue
                    C[i].setdefault(j, 0)
                    C[i][j] += 1
        self.W = {}
        for i, related_items in C.items():
            self.W.setdefault(i, {})
            for j, cij in related_items.items():
                self.W[i][j] = cij / (sqrt(N[i] * N[j]))
        return self.W

    def recommend(self, user, K=3, N=10):
        rank = dict()
        interacted_items = self.train[user]
        for item, score in interacted_items.iteritems():
            for j, wj in sorted(self.W[item].iteritems(),
                                key=lambda x: x[1], reverse=True)[0:K]:
                if j in interacted_items.keys():
                    continue
                rank.setdefault(j, 0)
                rank[j] += score * wj
        return dict(sorted(rank.items(),
                           key=lambda x: x[1], reverse=True)[0:N])

    def evaluate(self, train=None, test=None, K=3, N=10):
        train = self.train
        test = self.test
        hit, recall, precision = 0, 0, 0
        for user in train.keys():
            tu = test.get(user, {})
            rank = self.recommend(user, K=K, N=N)
            for i, _ in rank.items():
                if i in tu:
                    hit += 1
            recall += len(tu)
            precision += N
        recall = hit / recall
        precision = hit / precision
        f = 2 * recall * precision / (precision + recall)
        return (recall, precision, f)

if __name__ == '__main__':
    start_time = time.time()

    # choose which method to use
    # model = UserBasedCF('u1.base', 'u1.test')
    model = ItemBasedCF('u1.base', 'u1.test')

    # The performance of model under various K
    klst = [5, 10, 15, 20, 25]
    print("recall", "precision", "f1")
    for k in klst:
        print model.evaluate(train='u1.base', test='u1.test', K=k)
    print("--- %s seconds ---" % (time.time() - start_time))

Results

User-based CF:
('recall', 'precision', 'f1')
(0.09855, 0.20901378579003183, 0.13394495412844035)
(0.1054, 0.22354188759278898, 0.14325518178729188)
(0.10725, 0.22746553552492046, 0.14576962283384304)
(0.10875, 0.23064687168610817, 0.14780835881753313)
(0.1087, 0.2305408271474019, 0.14774040095141014)
--- 13.2764439583 seconds ---

Item-based CF:
('recall', 'precision', 'f1')
(0.0974, 0.20657476139978792, 0.13238192320761127)
(0.09955, 0.21113467656415694, 0.13530411145090046)
(0.0997, 0.2114528101802757, 0.13550798504926945)
(0.09905, 0.21007423117709437, 0.1346245327896704)
(0.0981, 0.2080593849416755, 0.13333333333333336)
--- 387.862884045 seconds ---

Conclusion

In short, user-based CF is appropriate for a situation that items increase fast with real-time demand, such as news recommendation. While in e-commerce, movie recommendation, items usually don’t change very much, so this often can be computed off line by item-based CF.

References

  1. A Programmer’s Guide to Data Mining
  2. Item-based Collaborative Filtering Recommendation Algorithms
  3. 推荐系统实践 (Practice in Recommendation system)
  4. 协同过滤实现 (Implementation of Collaborative Filtering)