1 Introduction

As more and more data become available through the Internet, the average user is often confronted with a situation where he is trying hard to find what he is looking for, under the pile of available and often misleading information. This phenomenon is described as the information overload problem and became the reason of development of an interesting concept of information retrieval: recommender systems.

As information overload is defined the problem occurring when the load of information at which a user is exposed to, goes beyond his processing capacity. Recommender systems are systems that attempt to give a solution to the overload problem by making personalized suggestions to the user about items that have not yet considered based either on features of the items themselves, available information on preference patterns of the user or patterns of other users (Montaner, 2003).

A recommender system can be conceived as an advisor that helps the user navigate through the chaotic information space and alleviate his difficulty in choosing items that are the most appropriate for him between all the available options.

1.1 Motivation

Motivation for me to deal with the field of recommender systems was The Netflix Prize competition (https://www.netflixprize.com/). Netflix is an online movies rental store and on 2006 they announced a competition in which developers were called to make a recommender system that would improve the accuracy of their existing system, by 10% or more. Although I did not have the chance to actively get involved in the competition, it became the seed that led to the idea of this project.

Witnessing the interest of the research community in response of the Netflix Prize and the constructive competition that followed, inspired me to undertake this project in order to investigate further this challenging field and put my own ideas into test.

1.2 Project aim and objectives

Aim of this project is to build a recommender system and make it competitive compared to existing approaches by incorporating appropriate techniques to improve its performance. In order to achieve this aim the following objectives have to be accomplished.

· Investigate, through the literature review, the different methodologies suggested in the bibliography to enhance the performance of recommender systems.

· Identify techniques that applied to the developed system can lead to an improved performance.

· Evaluate the effectiveness of these techniques by designing an experimental setup meeting the standards set in the literature and testing the system`s performance using appropriate metrics.

1.3 Document structure

The rest of this document is organized as follows:

Chapter 2, Literature Review, provides background information on recommender systems. The differences on the implementation based on the implementation context, the common challenges faced and the methods of evaluating recommender systems are discussed. Finally in order to demonstrate the variety of techniques used some of the most up to date proposed approaches are presented.

Chapter 3, Experimental Design and Implementation This chapter describes the proposed approach to the recommendation problem. Starting point is a base model that uses classification algorithms to predict the ratings over the entire item space. By identifying the weaknesses of the base model, we suggest an alternative approach. In this chapter the details of the experimental procedure followed in order to test the performance of both the base and the proposed systems are presented along with the experimental results for the base system which set the benchmark for comparison.

Chapter 4, Results and Evaluation This chapter describes in detail the implementation decisions taken during the development of our approach and the way they affect the performance of the system. The results of the experiments for the proposed system are presented, followed by a discussion about their interpretation. Finally the proposed approach is compared with the base system in order to evaluate the effectiveness of our technique.

Chapter 5, Conclusions This chapter concludes the project by discussing what was accomplished in the research conducted and by describing the focus of future work.

2 Literature Review

This chapter provides background information on key points pertaining to recommender systems. It discusses how the implementation context dictates variations of approaches, what are the common challenges met regardless the implementation details and what are the metrics used in order to measure the performance of recommender systems. Finally some of the most up to date proposed approaches are presented in order to demonstrate the variety of techniques used.

2.1 Context of implementation

Recommender systems were developed and work under very different contexts of implementation, from recommending music to suggesting banking products. While the main working idea behind each of those systems remains the same, to filter the information space and present to the user the items that most closely satisfy his needs and taste, the different implementation contexts leads to a number of variations in the techniques used.

2.1.1 Recommending in a community of users

Recommending content within the boundaries of a group of users sharing the same information space is the kind of recommender systems considered the most traditional. A number of users are gathered forming a virtual community which is involved in a common subject and express their opinion on different items of their interest space through explicit ratings. Representatives of this category are the recommendation systems for movies such as MovieLens and IMDB, music such as Last.fm and Pandora, or books such as the WhatShouldIReadNext.

The main characteristic of implementations in this area is that they take advantage of the existence of the explicit ratings from the users for items. Based on these ratings the system can form an opinion about the preferences of each individual and base the recommendations on a user profile built upon this information (Mukherzee, 2003). The recommendation can come from the user`s own preference, in the content-based approach (we suggest you will like item A, because in the past you liked item B, and items A and B are similar), from the preferences of other users in the collaborative filtering approach (we suggest you will like item A because some other users that have similar taste to you, liked item A, so we assume you will like it too), or from combinations of these two approaches in hybrid systems (Burke, 2002).

2.1.2 Recommender systems in E-commerce

Recommender systems were introduced and today are widely used into the field of the e-commerce with the most known representative probably being Amazon.com. The object of such systems is to recommend to the user products that he may be most interested in buying among all the available alternatives.

Characteristics of e-commerce recommender systems are that usually there are no explicit ratings available from the user. Because of this, such systems can either base their recommendations on purely content-based techniques or try to model the user preferences based on implicit information, with the latter being the most common approach (Mobasher, 2000). Such implicit information can be the buying history of the user or their navigation patterns through the shops site (Cho, 2002).

Another characteristic of systems for e-commerce is their often ephemeral relationship with the customer. Unlike the rating communities discussed before, the user here has no reason to return to the site unless he intends to buy something. That can vary from few times a week in the case of a supermarket portal, to one time a year in the case of a computer selling site, or even rarer if the product is cars (Felfernig, 2008). This characteristic constraints the e-commerce systems to work with very limited available information about the user. In such cases collaborative filtering techniques or even the content based ones cannot perform well. To solve this problem a number of different techniques such as the constraint based (Felfernig, 2008) and the knowledge based recommendation (Burke, 1999) have been developed.

The techniques for constraint and knowledge based recommendations become even more important in the case of more complex products, such as financial services and tourism packages, for which the use of recommender systems become increasingly popular. Examples of such systems are Triplehop’s TripMatcher and VacationCoach’s Me-Print for tourism (Berka, 2004) and FSAdvisor for financial services (Felfernig, 2005). Here, the only way to suggest effectively a service to the user is through specific expert domain knowledge and the satisfaction of certain needs and restrictions in a stepwise process (Jannach, 2009).

2.1.3 Recommending Web content

Recommender systems are being used as part of the wider effort towards a more personalized web experience. Web pages, news items or articles are suggested to the user according to his preferences. As in the e-commerce area, here as well there is usually absence of explicit ratings from the user. The user preference modeling is made through the study of his web usage patterns. The web logs are analyzed and information such as the pages visited, the path followed and the time spent in each page are exploited in order to implicitly construct the user profile (Nasraoui, 2003).

There are mainly two points that differentiates the web content recommendation. The first one is the continuous changing item space. In a news recommender system for instance, the news pages are updated daily or even more often. There is no point in trying to build a preference model for specific content as it will be soon be outdated. The best we can do is by using the usage patterns to try to extract more general preference patterns for the user and use them in the recommendation procedure. For example if a user read an article last night about the soccer game between Juventus and Milan, it would be very shortsighted to assume that the user is interested only in the matches between the two specific teams. We could generalize by saying that he is a fan of one of the two clubs. Or by generalizing more, that he is interested in the Italian soccer league, or perhaps in the wider European soccer, or soccer in general as a sport? And here exactly lies the main challenge of the web usage recommender systems, to find the happy medium between overspecialization and accurate recommendations.

The second challenge comes from the fact that web content lacks structured, specific features that can accurately characterize it. In a traditional movie recommendation application for example, movies have a number of features describing them, such as the genre, year of production, director and cast that can help us place them in the user-item preference matrix and on which we can base the recommendation on. Web recommendation systems on the other hand, must extract such features from the content of the pages, and for this reason the use of web mining techniques is also necessary.

2.2 Types of Recommender Systems

Modern recommender systems can be classified into three broad categories, content-based recommender systems, collaborative filtering systems and hybrid systems. In the following section is provided a brief description of these categories accompanied by some of the most recent representative systems proposed in the literature.

2.2.1 Content based recommender systems

Content-based filtering approaches recommend items for the user based on the descriptions of previously evaluated items. In other words, they recommend items because they are similar to items the user has liked in the past (Montaner, 2003).

Examples of recent approaches include:

(Zenebe, 2009) Made use of fuzzy modeling techniques in the item features description, the user feedback and the recommendation algorithm over a content-based recommender system platform.

(Felfernig, 2008) Explore the use of constraint-based recommendation in his implementation, where the recommendation is viewed as a process of constraint satisfaction. The final recommendation comes from the gradual satisfaction of a given set of requirements.

2.2.2 Collaborative filtering recommender systems

The collaborative filtering technique matches people with similar interests and then makes recommendations on this basis. Recommendations are commonly extracted from the statistical analysis of patterns and analogies of data extracted explicitly from evaluations of items given by different users or implicitly by monitoring the behavior of the different users in the system. (Montaner, 2003).

Examples of recent approaches include:

(Acilar, 2009) Propose a collaborative filtering model, constructed based on the Artificial Immune Network Algorithm (aiNet). Through the use of artificial immune network techniques, the system tries to address the data sparsity and scalability problems by describing the data structure, including their spatial distribution and cluster inter-relations.

(Campos, 2008) Make use of fuzzy logic to deal with the ambiguity and vagueness of the ratings, while at the same time uses Bayesian network formalism to model the way the user’s ratings are related.

(Shang, 2009) In his implementation a multi-channel representation is used where each object is mapped to several channels linked to certain ratings. The users are then connected to the channels according to the ratings they have given. The similarity measure of user pairs is given by applying diffusion process to the user-channel bipartite graph.

(Chen, 2009) Propose the use of orthogonal nonnegative matrix tri-factorization in order to alleviate the sparsity problem and to solve the scalability problem by simultaneously clustering rows and columns of the user-item matrix.

(Lee, 2009) In his approach Lee combines the two types of collaborative filtering techniques, the user-based and the item-based. The resulting predictions are then associated by weighted averaging.

(Jeong, 2009) Introduce the user credit as a new way to measure the similarity between users in a memory based collaborative filtering environment. The user credit is the degree of one’s rating reliability that measures how adherently the user rates items as others do.

(Yang, 2009) Propose a collaborative filtering approach based on heuristic formulated inferences. The main idea behind this approach is that any two users may have some common interest genres as well as different ones. Based on this the similarity is calculated, by considering users’ preferences and rating patterns.

(Bonnin, 2009) Use Markov models inspired from the ones used in language modeling and integrate skipping techniques to handle noise during navigation. Weighting schemes are used to alleviate the importance of distant resources.

(Zhang, 2008) Suggest a Topical PageRank based algorithm, which considers item genre to rank items. It is made an attempt to correlate ranking algorithms for web search with recommender systems. Specifically, it is attempted to leverage Topical PageRank, to rank items and then recommend users with top-rank items.

(Rendle, 2008) Propose the use of kernel matrix factorization, a generalized form of the regularized matrix factorization. A generic method for learning regularized kernel matrix factorization models is suggested, from which an online update algorithm is derived that allows solving the new-user/new-item problem.

(Umyarov, 2009) In his research, Umyarov combines external aggregate information with individual ratings in a novel way in his approach.

(Takacs, 2009) Focus his research on the use of different techniques of matrix factorization applied to the recommendation problem. He proposes the use of incremental gradient descent method for weight updates, the exploitation of the chronological order of ratings and the use of a semi-positive version of the matrix factorization algorithm.

(Yildirim, 2008) Propose an item-based algorithm, which first infers transition probabilities between items, based on their similarities and then computes the predictions by modeling finite length random walks on the item space.

(Weimer, 2008) Suggest as extension to the maximum margin matrix factorization technique the usage of arbitrary loss functions, while an algorithm for the optimization of the ordinal ranking loss is used.

(Koren, 2009) Introduce the tracking of temporal changes in the customer’s preferences in order to improve the quality of the recommendations provided.

(Hijikata, 2009)Propose a discovery-oriented collaborative filtering algorithm. that uses not only the traditionally used in collaborative filtering approaches profile of preference but also the so called profile of acquaintance, used to map the knowledge, or the lack of it, about items.

(Schclar, 2009) Propose the use of an ensemble regression method in which during iterations, interpolation weights for all nearest neighbors are simultaneously derived by minimizing the root mean squared error.

(Koren, 2010) Introduce a new neighborhood model based on the optimization of a global cost function. A second, factorized version of the neighborhood model is also suggested, aiming to improve the scalability of the algorithm.

(Kwon, 2008) Aim to find new recommendation approaches that can take into account the rating variance of an item in the procedure of selecting recommendations.

(Amatriain, 2009) Try to improve the system`s accuracy by reducing the natural noise in the input data via a preprocessing step, based on re-rating the items and calibrating the recommendations accordingly.

(Park, 2008) Proposes the clustering of the items with low popularity together, using the EM algorithm in combination with classification rules, in order to improve the quality of the recommendations for the items with few ratings.

(Ma, 2009) Propose a semi-nonnegative matrix factorization method with global statistical consistency, while at the same time suggest a method of imposing the consistency between the statistics given by the predicted values and the statistics given by the data.

(Massa, 2009) Propose to replace the step of finding similar users on which the recommendation will be based, with the use of a trust metric, an algorithm able to propagate trust over a network of users in order to find peers that can be trusted by the active user.

(Lakiotaki, 2008) Propose a system that exploits multi-criteria ratings to improve the modeling of the user’s preference behavior and enhance the accuracy of the recommendations.

2.2.3 Hybrid recommender systems

Hybrid recommender systems combine two or more recommendation techniques to achieve better performance and overcome problems faced by their one-sided counterparts. The ways that recommendation systems can be combined differs greatly. A good overview is given in (Burke, 2002).

Examples of recent approaches include:

(Albdavi, 2009) Suggest a recommendation technique in the context of online retail store, called hybrid recommendation technique based on product category attributes which extracts user preferences in each product category separately in order to provide more personalized recommendations.

(Porcel, 2009) Propose a fuzzy linguistic recommender system designed using a hybrid approach and assuming a multi-granular fuzzy linguistic modeling.

(Al-Shamri, 2008) Propose a hybrid, fuzzy-genetic approach to recommender systems. In order to improve scalability, the user model is employed to find a set of likeminded users. In the resulting, reduced set, a memory-based search is then carried out to produce the recommendations.

(Givon, 2009) Propose a method that uses social-tags alone or in combination with collaborative filtering-based methods to improve recommendations and to solve the cold-start problem in recommending books when few to no ratings are available. In their approach tags are automatically generated from the content of the text in the case of a new book and are used to predict the similarity to other books.

(Nam, 2008) Focusing their research on solving the user-side cold start problem, develop a hybrid model based on the analysis of two probabilistic aspect models using pure collaborative filtering to combine with users’ information.

(Gunawardana, 2009) Make use of unified Boltzmann machines, as probabilistic models that combine collaborative and content information in a coherent manner.

contains a synopsis of the different approaches discussed on this chapter.

Researcher / Year

Type

Main Techniques Used

Problem Focused

Data Sets Used

Zenebe 2009

Content Based

Fuzzy Sets

Accuracy

MovieLens

Felfernig 2008

Content Based

Constraint driven recommendation

Use in domains with complex rarely rated items

–

Acilar 2009

Collaborative Filtering

Artificial Immune Networks algorithm

Data Sparsity

Scalability

MovieLens

Campos 2008

Collaborative Filtering

Bayesian Networks

Fuzzy Logic

Process the uncertainty involved in the recommendation

MovieLens

Shang 2009

Collaborative Filtering

Multi-channel representation

Diffusion process on the user-channel bipartite graph

Accuracy

Netflix

MovieLens

Chen 2009

Collaborative Filtering

Orthogonal nonnegative matrix tri-factorization

Data sparsity

Scalability

MovieLens

Park 2008

Collaborative Filtering

Combination of EM clustering and classification rules

Cold Start

Accuracy

MovieLens

Lee 2009

Collaborative Filtering

Combination of user-based and item-based CF

Data sparsity

Accuracy

EachMovie

MovieLens

Jeong 2009

Collaborative Filtering

Use of “user credit” as degree of rating reliability

Cold-start

Accuracy

MovieLens

Yang 2009

Collaborative Filtering

Heuristic

formulated inferences

Accuracy

EachMovie

MovieLens

Bonnin 2009

Collaborative Filtering

Markov model

Skipping techniques to handle noise

Accuracy

Bank Intranet web logs

Zhang 2008

Collaborative Filtering

Topical Page Rank algorithm

Accuracy

MovieLens

Rendle 2008

Collaborative Filtering

Regularized kernel matrix factorization

Cold-start

Scalability

Netflix

MovieLens

Umyarov 2009

Collaborative Filtering

Combination of external aggregate information with user ratings

Accuracy

Netflix

MovieLens

Takacs 2009

Collaborative Filtering

Matrix Factorization

Scalability

Netflix

Jester

MovieLens

Yildirim 2008

Collaborative Filtering

Random walk item-based algorithm

Data sparsity

Scalability

MovieLens

Weimer 2008

Collaborative Filtering

Maximum margin matrix factorization

Data privacy

Cross-domain predictions

WikiLens

Koren 2009

Collaborative Filtering

Tracking of temporal changes in the customer’s preferences

Modeling drifting user preferences

Netflix

Hijikata 2009

Collaborative Filtering

Discovery-oriented CF algorithm

Recommendation diversity

Music ratings dataset built for the experiment

Schclar 2009

Collaborative Filtering

Ensemble regression method

Accuracy

MovieLens

Koren 2010

Collaborative Filtering

Optimization of global cost function

Accuracy

Scalability

Netflix

Kwon 2008

Collaborative Filtering

Rating diversity consideration

Accuracy

Diversity

MovieLens

Amatriain 2009

Collaborative Filtering

Natural data noise reduction

Accuracy

Customized movie rating dataset

Mass 2009

Collaborative Filtering

Use of trust in the neighbor finding

Cold Start

Data Sparsity

Dataset from Epinions.com

Lakiotaki 2008

Collaborative Filtering

Multi-criteria ratings

Accuracy

Dataset from Yahoo! movies

Ma 2009

Collaborative Filtering

Semi-nonnegative matrix factorization

Accuracy

Scalability

EachMovie

Albdavi 2009

Hybrid

Hybrid recommendation based on product category attributes

More personalized recommendations

Web logs from online retail store

Porcel 2009

Hybrid

Fuzzy linguistic modeling

Accuracy

Digital Library dataset

Al-Shamri 2008

Hybrid

Fuzzy-genetic

Data sparsity

Scalability

MovieLens

Gunawardana 2009

Hybrid

Boltzmann machines

Cold-start

MovieLens

Ta-Feng supermarket dataset

Nam 2008

Hybrid

Combination of pure CF with users information

Cold-start

MovieLens

Givon 2009

Hybrid

Automatic tag generation from text

Cold-start

Corpus of full text books

Table 1 Different approaches proposed in the Recommender System literature after 2008

2.3Challenges

Recommender systems suffer from some common problems. The most usual ones and those that have drawn the most of the researcher’s attention are the cold start and the data sparsity problems that can potentially lead to poor recommendations. Also due to their nature of implementation, recommender systems often face scalability problems. Other than these, there are a number of smaller problems that can also affect negatively the performance of the system and have become the reasons behind the introduction of some of the more innovative techniques at the recommender systems landscape. Such problems are the interest drift, the noisy data and the lack of diversity.

2.3.1 Cold Start

The cold-start problem occurs when items must be proposed to a new user without having previous usage patterns to support these recommendations (new user problem), or when items are newly introduced to the dataset thus lacking ratings from any user (new item problem) (Rashid, 2002). Both of the two faces of the cold start problem are commonly met in the recommendation system field, and result in poor recommendation quality. Modern commercial systems are constantly expanding and new items are added in constant basis, while new users become members as often. The collaborative filtering techniques are especially sensitive to the cold-start problem and for this reason a solution often suggested is the hybridization of the system with the use of a content-based method that can be used in the recommendation procedure of the new items or users.

2.3.2 Data Sparsity

In large e-commerce systems there are millions of participating users and equally as many items. Usually even the most active users have purchased or rated only a very little fraction of the whole collection. This leads to a sparse user-item matrix which affects the ability of the recommender system to form successful neighborhoods. Making recommendations on poorly formatted neighborhoods, results at poor overall recommendation quality (Acilar, 2009). The above problem is known as the data sparsity problem, and is one of the most challenging in the recommendation system field.

2.3.3 Scalability

Modern recommender systems are applied to very big datasets for both users and items. Therefore they have to handle very high dimensional profiles to form the neighborhood while the calculation cost of the algorithms used grows with both the number of items as well as with the number of users (Acilar, 2009). Recommendation tactics that may work well and be effective when applied to small datasets under lab testing conditions may fail in practice because they cannot be effectively applied in real usage scenarios.

2.3.4 Other Potential Problems

Apart from the three commonly faced challenges mentioned above, researchers have tried to address a number of different problems. In this section we briefly present some of them.

· The interest drift problem. By the term interest drift in the recommender systems context we refer to the phenomenon that the taste and the interests of users may be altered over time or under changing circumstances, leading to inaccurate recommendation results (Ma, 2007). A once valid recommendation may not still be accurate after the user has changed his preference patterns. In order to counter fight this, the recommendation models should not be static but it must evolve and adapt itself to the changing preference environment in which it is called to work in.

· The noisy data problem. At the case of systems where the input data are explicit (e.g. ratings) and not implicit (like web logs), there is an extra data noise added coming from the vagueness of the ratings themselves as a product of the human perception (Campos, 2008). The given ratings are only an approximation of the user’s approval on the artifact that he is rating and are restricted by the rating scale`s accuracy. For example in a rating scale of five stars, a user may give a movie three stars, but if he had the opportunity to rate the same movie in a percentage scale he may give something different than 60%. Results may be even more different if the scale he was called to rate the movie on, was something like “I hated it – it was average – I loved it”.

Moreover a fuzziness of the rating is also introduced by the user himself and his own ratings may differ at another time, place or emotional condition. It has been reported (Amatriain, 2009) that if users are called to rate again movies that have seen and rated at the past, their new ratings will differ respectably from their original ratings. Cases that users did not even remember seeing the movie they have rated at the past were also not uncommon.

Finally there are deviations in the ratings characterizing the overall voting trends of either the user, or the items. For example a user may be strict and have a tendency to give lower ratings than the average reviewer, or from the item`s point of view, there may be deviations affecting positively or negatively the ratings that the items receives. From another perspective, a movie that is considered “classic” may tend to receive higher ratings than it would normally receive without its reputation affecting the audience, while those observed trends may or may not be static over the time. (Koren, 2009) For example there is a chance that a viewer becomes stricter as he grows up and as a result his ratings become more biased towards the lower end of the scale compared to his past the ratings.

All these factors introduce noise in the data and can have a negative effect on the accuracy of the recommendations.

· The lack of diversity problem. Most of the researcher`s efforts are focused on making the recommendation produced by a recommender system more accurate. Lately thought, there are argues raised that accurate recommendations are not always what the user may be expecting from a recommender system (Zhang, 2008). To start with, the logic of such a system is to help the user select items for which he has not formed an opinion of his own yet. If the system keeps suggesting items that are too similar to the ones he is already familiar with, then the systems self-cancels, to a point, his own purpose. We can assume that the user can speculate the rating of an item too close to an item he is already knows about, without the need of an elaborate recommender system. What the user is looking for is from the system to help him estimate the rating of an item that he could not rate himself without the assistance of the recommender, solely based on his own experience. This problem is also referred as the over-specialization problem describing the situation where items too close to the items already returned from the user are returned as recommendations (Abbasi, 2009).

2.4 Evaluation Metrics for Recommender Systems

Evaluating a recommender system can be a complex procedure. Many different metrics have been proposed to evaluate the successfulness of recommender systems. In the following sections are presented the most commonly used.

2.4.1 Accuracy metrics

Accuracy is the most widely used metric for recommender systems (Burke, 2002). It measures how close the predicted by the system values are to the true values. It can be expressed as in equation (1).

We can more formally formulate the equation (1) as in (2)

Where P(u, i) is the predictions of a recommender system for every particular user u and item i, and p(u, i) is the real preferences, while R is the number of recommendations shown to the user. In the accuracy metric the P(u, i) and p(u, i) are considered binary functions and r(u, i) is 1 if the recommender presented the item to the user and 0 otherwise.

One common accuracy metric is the Mean absolute error (MAE) that is defined by the equation (3) and measures the average absolute deviation between each predicted rating P(u, i) and each user’s real ones p(u, i). N is the total number of the items observed (Breese, 1998)

Variations of MAE include mean squared error, root mean squared error, or normalized mean absolute error (Goldberg, 2001).

From these, the most widely used, especially after chosen to be the metric used for the judgment of the entries at the Netflix Prize contest, is the root mean squared error which is defined as in the equation (4).

2.4.2 Information Retrieval metrics

Since recommender systems logic and techniques are close to the Information Retrieval (IR) discipline, it comes as no surprise that some of the metrics of IR are also present at the recommender systems field. Two of the most widely used metrics are the precision and recall (Cleverdon, 1968).

The calculation of precision and recall is based on a table, as the below, that holds the different possibilities of any retrieval decision (Hernandez del Olmo, 2008).

Relevant

Non Relevant

Retrieved

a

b

Non Retrieved

c

d

Table 2 Confusion matrix of retrieval decision outcomes

In recommender system terminology, a relevant information is translated to a useful (close to the user`s taste) item while a non-relevant would be an item not satisfying the user.

Precision (eq.5) is defined as the ratio of relevant items selected to number of items selected

Precision represents the probability that a selected item is relevant. It determines the capability of the system to present only useful items, excluding the non-relevant ones.

Recall (eq.6), is defined as the ratio of relevant items selected to the total number of relevant items available. Recall represents the probability that a relevant item will be selected and is an indication of the coverage of useful items that the system can obtain.

Based on the precision and recall line of thought are the F-measure metrics (eq.7) which attempt to combine the behavior of both of the metrics in a single equation.

The most commonly used F-measure metric is the F1, where (Hernandez del Olmo, 2008) and is defined as in eq.8

Another metric originating from the information retrieval field and often used in the recommender system evaluation is the Receiver Operating Characteristic (ROC) analysis (Hanley, 1982). The ROC curve represents the recall against the fallout (eq.9).

Objective of the ROC analysis is to maximize the recall while at the same time minimize the fallout

2.4.3 Rank accuracy metrics

The output of the recommendation is often a list of suggestions presented to the user from the most relevant to the least relevant. To measure how successful the system was on this, a category of metrics, called rank accuracy metrics was introduced. Rank accuracy metrics measure how accurate the recommender system can predict the ranking of a list of items presented to the user.

Two of the most commonly used rank accuracy metrics are the half-life utility metric and the Normalized Distance-based Performance Measure (NDPM). (Herlocker, 2004).

The half-life utility metric is used to evaluate the utility of a ranked list of recommendations, where the utility Ra (eq.10) is defined as the difference between the user rating for an item and the rating baseline for this item. A half life parameter is used to describe the strength of the decay in an exponential decay function showing the likelihood of a user to view each successive item in the list. In equation 10, ra, j represents the rating of user a on item j of the ranked list, d is the baseline rating, and α is the half-life. The half-life is the rank of the item on the list such that there is a 50% chance that the user will view this item.

Normalized Distance-based Performance Measure (Eq. (11)) can be used to compare two different weakly ordered rankings (Balabanovic, 1997).

In the above equation C− is the number of contradicting preference ratings between the user and the system recommendation, where the system believes that an item “a” will be preferred over an item “b”, while the true preference of the user is the opposite. Cu is the number of compatible preference relations, where the user rates item “a” higher than item “b”, but the system ranks the two items equally and Ci is the total number of pairs of items rated by the user, for which one is rated higher than the other.

2.4.4 Suggesting the non-obvious

While the accuracy metrics provide a good indication of the recommender system`s performance, there must be a distinction made between the accurate and the useful results (Herlocker, 2004). For example, a recommendation algorithm may be adequately accurate by suggesting to the user popular items with high average ratings. But often this is not enough. To some extent this kind of predictions are self-explanatory and offer no useful information to the user, as they would be the items for which the user would less likely need help to discover by himself.

The coverage can be defined as a measure of the domain of items over which the system can make recommendations (Herlocker, 2004). In its simplest form, coverage is expressed as the percentage of the items for which the system can form a prediction over the total number of items.

Along the same line of thought, other metrics such as novelty (Konstan, 2006) and serendipity(Murakami, 2008)have been proposed, for measuring how effectively the system recommends interesting items to the user which he might not otherwise come across.

3 Experimental Design and Implementation

This chapter describes the proposed approach to the recommendation problem. Starting with a base model we identify its weaknesses and we will try to improve its performance by incorporating our approach. Both the base and the proposed systems are tested in order to compare their relative performance. In this chapter the details of the experimental procedure followed are presented along with the experimental results for the base system which set the benchmark for comparison.

3.1 Recommendation as classification

Recommending items to a user can be seen as a classification task: Given a set of known relationships, train a model that will be able to predict the class of unseen instances. If we consider each user-item pair as instance and the rating as the class, recommendation can be treated as classification where the system is called to assign the unknown degree of preference of a user towards a given item to one of the possible class values, consisting of the points of the rating scale. Assuming a two point scale, we have a binary classification problem and the outcome could be that a user will either like or dislike the item, depending on the classification result. In a more multivariate scale, the result could be that the instance of user-item pairU-I is assigned to class 4 or in other words that user U is predicted to give item I a rating of 4. If we treat the rating as continuous value instead of nominal, the task becomes a regression problem.

It is under this prism that the recommendation problem is viewed in the work conducted as part of this dissertation. Prediction models are built on the known instances and then used to predict the values of the unknown ratings.

3.2 Data set

The MovieLens dataset was used for the experiments conducted. MovieLens dataset is created by the GroupLens Research Project group at the University of Minnesota through the MovieLens web site (movielens.umn.edu) and contains 100.000 ratings on a numeric five point scale (1-5) for 1682 movies provided by 943 users, with each user having rated at least 20 movies. Simple demographic data consisting of the age, gender, occupation and zip-code are provided for the users, while the information about the movies is title, release date, video release date, IMDB url and genre.

3.2.1 Data preparation

Trying to train any model only on the original dataset would prove highly ineffective since the information provided would not be sufficient to produce rigid rules. For this reason the data had to be enhanced in a way that it would allow the model to obtain as much information as possible during the training step.

Following the methodology of Park (2008), a number of derived variables were produced from the original data and used as independent variables in the models. More specifically the derived variables used are:

c_aver_rating: The average rating of the user for the items he has rated at the past.

c_quantity: The number of items the user has rated.

c_seen_popularity: The average popularity of the items that the user has rated at the past.

c_seen_rating: The overall average rating of the items the user has rated before. The overall average rating is the average of all the ratings given to the item by all the users.

c_like_popularity: The average popularity of the items that were rated higher by the user, compared to his average rating.

c_like_rating: The overall average rating of the items rated higher than the user`s average rating.

c_dislike_popularity: The average popularity of the items that were rated lower by the user, compared to his average rating.

c_dislike_rating: The overall average rating of the items rated lower than the user`s average rating.

I_aver_rating: The average rating of the item.

I_popularity: The popularity of the item.

The variables 1-8 are the user related values while variables 9 and 10 are the item related variables.

It should be noted here that Park (2008) proposes the use of an extra item related variable, namely I_likability defined as the difference between the rating of the user and the item`s average rating. The problem with this is that since we treat the rating as the class, there would be no way to know the value of the I_likability for a new instance beforehand. Although the value could be calculated for the experimental dataset, this would not be feasible in a real case scenario. For this reason the I_likability variable was not used in the variables set.

The original data were loaded to an SQL database from the MovieLens dataset and then the derived values were calculated and stored. The final form of the table, used in the data mining models can be seen in. A snapshot of the actual data in the enhanced data table can be found in Appendix . Even though the variables UserID, MovieID, DateStamp and InstanceID are part of the dataset they were ignored from the models during the training step since they provide no useful information and would add noise at the data.

3.3uilding the predictive models

The Weka machine learning toolkit was used in order to build the predictive models of the experiment. Weka provides a big collection of classification, association and clustering algorithms and can be run from the GUI, using the command line or called from within a program as external library. The later was the approach followed since it provides a greater flexibility on the process. Java was chosen as the implementation language since Weka itself is written in Java and the routines provided could be used from within the developed program without any intermediate steps needed. Although Weka can be used with a number of different environments (including .NET) that would add unnecessary complexity to the project. Eclipse was used as the implementation platform.

One of the early and important decisions that had to be taken during the implementation was whether to treat the rating, which was the class for the models, as nominal or numeric value. This choice would dictate the range of models that could be used as some can work only with nominal classes while other only with numeric. While the rating is in fact nominal (a user can rate an item with 3 or 4 but not with 3.5) I chose to treat it as numeric, in the expectation of providing this way a finer granularity to the system and producing more accurate and useful results. If the predicted value was needed to be sent back to the user as actual recommendation it would be easy to round it to one of the allowed ratings within the rating scale.

Again following the initial methodology of Park (2008), for each of the items in the item list a separate predictive model was built. If we need to predict the rating of item I having 200 ratings for a user U, the model is built using those 200 known instances and is used to predict the unknown rating for U. In this way for a dataset containing n different items, n different models will be built.

Five different types of predictive models were implemented and tested:

Simple Linear Regression (SLR)which “learns a linear regression model based on the single attribute that yield the smallest squared error”. (Witten, 2005: 409).

Locally Weighted Learning (LWL)which “assigns weight using an instance-based method. After this the classifier is built from the weighted instances” (Witten, 2005: 414).

RBFNetwork (RBF) which “implements a Gaussian radial basis function network deriving the centers and widths of hidden units using k-means and combining the outputs obtained from the hidden layers using logistic regression if the class is nominal and linear regression in the case of numeric class (Witten, 2005: 410).

Sequential Minimal Optimization algorithm for support vector regression (SMOreg) which “implements the sequential minimal optimization algorithm (SMO) training a support vector classifier with the use of polynomial or Gaussian kernels.” (Witten, 2005: 410). SMOreg is the SMO version for regression problems.

M5Rules which “obtains regression rules from model trees built using M5′ “ (Witten,2005: 409)

These five models were chosen in an attempt to test the effectiveness of our approach using a diverse set of techniques namely linear regression, radial basis function networks, lazy classifiers, support vector machine classifiers and model trees.

3.4 Base Model evaluation

In order to evaluate the predictive models built, we use two performance measures, Mean Absolute Error (MAE) and Root Mean Square Error (RMSE). As discussed in section, MAE measures the average absolute deviation between each predicted rating P(u, i) and each user’s real ones p(u, i) over the number N of items and is given by the equation

while RMSE is given by the equation

Since the prediction of the ratings occurs by solving the problem as regression, these two metrics will give a good indication of how well the predicted values produced by the algorithms approximate the actual rating values.

We applied 10-fold cross validation and calculated the MAE and RMSE. The original dataset was randomly partitioned in 10 independent samples and each time one of the samples was used as test data while the rest used as training data for the model. The procedure was repeated 10 times, with each of the subsamples acting as test data exactly once. The resulting MAE and RMSE errors are the average errors over the 10 repetition of the testing. 10 was chosen as the number of folds because according to Witten (2005), 10 is shown to produce the better estimation of errors among the rest of the k-fold alternatives. Moreover 10-fold cross validation is a commonly used technique in the recommender system literature for error estimation, making the comparison of our results with the results of published work easier.

The errors calculated for each of the movies using the 5 models are presented in The errors are ordered by the popularity of the movie.

The overall average MAE errors over the whole set of items for each model are summarized in, along with their standard deviations

Model

Overall average MAE error

LWL

0.878232± 0.2165

M5Rules

0.823094 ± 0.1959

SMOreg

0.810135 ± 0.2269

SLR

0.816061 ± 0.1862

RBF

0.864512 ± 0.1676

Table 3 Overall average MAE for the different predictive models

The overall average RMSE errors over the whole set of items for each model are summarized in along with their standard deviations.

Model

Overall averageRMSE error

LWL

1.092771 ± 0.2460

M5Rules

1.032141 ± 0.2360

SMOreg

1.020523 ± 0.2554

SLR

1.015515 ± 0.2170

RBF

1.063652 ± 0.1806

Table 4 Overall average RMSE for the different predictive models

The experimental results we got from this series of tests do seem to match with the findings that Park (2008) reports, as confidently as we can conclude to this based on the volume and the format of the results presented at their publication.

Observing the graphs and the overall error tables we see that the different prediction models do have different performance with the difference being generally constant over the different items popularity. SMOreg and SLR are the two best predicting models, with LWL being the worst.

But the two Figures indicate an underlying trend that governs all the models and is more important from the sheer performance of each algorithm. It shows that the errors increase as the popularity of the items decrease, almost doubling from the one edge to the other. For example the MAE error for the SMOreg algorithm is 0.666542 for the biggest average popularity of 420, and becomes 1.169974 for the smallest popularity. This problem occurs because the prediction models don’t have enough data to produce accurate rules for the less popular items.

In order to illustrate the potential impact of this weakness to the recommender`s performance we present at the histogram of the items rating frequencies for the MovieLens dataset.

As can be seen in the figure, the number of items having few ratings is very significant. If the recommendation will be done solely based on the approach presented at section the system will often perform poorly. It is important to try and improve the performance of the models for this exactly the problematic area.

3.5 Enhancing the system`s efficiency via latent factor modeling.

Dimensionality reduction techniques such as Latent Semantic Indexing (LSI) originate from the information retrieval field and try to solve the problems of polysemy and synonymy (Deerwester, 1990). By using LSI we attempt to capture latent associations between the users, items and the given ratings, not apparent in the initial dataset. The reduced dimensionality space resulting is less sparse than the original data space and can help us determine clusters of users or items by discovering the underlying relationships between the rating instances (Sarwar, 2000). LSI using Singular Value Decomposition (SVD) as the underlying matrix factorization algorithm is the most commonly technique used in the recommender system literature, and this is going to be used in the current experimental setup, aiming to improve the prediction model performance.

3.5.1 Singular Value Decomposition (SVD)

SVD is a matrix factorization technique often used for the production of low rank approximations of matrices. Given a m×nmatrix R, SVD factors R into three matrices as

Where U and V are orthogonal matrices of size m×r and n×r respectively, with r being the rank of matrix R. Sis a diagonal r×r matrix having all singular values of matrix R as diagonal entries (s1, s2, …, sr) where si>0 and s1 ≥ s2 ≥ … ≥ sr.

SVD has the very useful ability to provide the best low-rank approximation matrix Rkof matrix R, in terms of the Frobenius norm ||R – Rk||F. For some value k

The initial step in using SVD to the recommender system is to produce the matrix R. Matrix R is the user-item matrix where users are the rows of the matrix, items are the columns and the values consist of the ratings so the value Ri,j represents the rating of user Ui for the item Ij.

Because of the sparse nature of the data, the resulting matrix R is also very sparse. As a pre-processing step, in order to reduce the sparsity we must fill the empty values of R with some meaningful data. Two available choices are to use the average ratings of the user (rows average) or the average ratings of the items (columns average). Following Sarwar (2000) we proceed with the later as in his findings reports that the items average provided better results. The same author suggests as next step in the pre-processing procedure to normalize the data of matrix R. This is done by subtracting the user`s average from each rating. Let the resulting matrix after the two pre-processing steps be Rnorm. Normalization as described above was used during the implementation.

We can now apply SVD to Rnorm and obtain matrices U, S, V and reduce the matrices to dimension k obtaining matrices Uk, Sk and Vk. These matrices can now be used in order to produce the prediction value Pi,j of the ith user for the jth item as:

where is the average rating of user i.

3.5.3 Using SVD to form neighborhoods

Although applying SVD to produce score predictions as discussed above is really useful, SVD is used in a rather different context in the current implementation. What we are really interested in is the ability to form quality neighborhoods of either users or items. The reduced dimensional space produced by the SVD is less sparse than the original space. This advantage can lead to a better performance during the neighbor selection (Sarwar, 2000).

The matrix is the k-dimension representation of the users while the matrix is the k-dimension representation of the items. We can calculate the similarity between the observations using a distance measure, such as cosine similarity, Pearson`s correlation, mean squared distance or Spearman correlation at one of the two matrices and produce clusters of users or items respectively. What we are going to evaluate in this implementation is whether the formation of clusters of items in the reduced dimension space combined with the base model will improve its accuracy.

The similarity measure used for building the neighborhood was the cosine similarity. Cosine measure finds the cosine of the angle between two vectors A and B in order to calculate their similarity. It is defined as:

3.5.4 Training the models

As before, the prediction of the ratings were produced by building the data mining models on the training set, created using again 10-fold cross validation as on the first group of experiments. The big difference this time was that the models were built not for every item but for each cluster of items.

This means that in order to predict the rating of user Ui for the item Ij, first it was determined in which cluster Ii belonged to and then the model was built using the data from all the items belonging to that cluster.

The expectance from this approach is to provide to the predictive model enough information in order to produce quality rules. This was especially important for the items with few ratings as shown in section for which the lack of enough supportive information lead the models in producing inaccurate predictions.

4 Results and Evaluation

This chapter describes the implementation decisions tested during the development of our approach and the way they affect the performance of the system. For each implementation decision detailed results from the experiments performed are presented, followed by a discussion about their interpretation. Finally the proposed approach is compared with the base system in order to evaluate the effectiveness of our technique.

4.1 Application of the technique in the implementation

As described in section , using Matlab the user-item matrix was created by porting the user, item, rating triplets from the SQL database containing the original dataset. The matrix was then normalized and SVD was applied. The resulting matrices U,S,V were finally reduced in k-dimensionality.

The choice for the value of k was based on the findings of both (Sarwar, 2000) and (Gong, 2009) for the same dataset and was fixed at 14.

Next step of the process was the formation of the neighborhoods. The K-means clustering algorithm using cosine distance was applied on the reduced 1682×14 matrix allocating each of the items to a corresponding cluster. The resulting cluster memberships were finally ported back to the database in order to be used from the system in the prediction model building.

4.2 Defining the parameters of the system

4.2.1 Choosing the number of clusters

One parameter that had to be considered was the number of clusters used during the neighborhood formation. In order to verify if and how much the choice of the number of clusters affected the prediction quality, five different clustering sizes were tested, and using the procedure described in the previous section the MAE and RMSE errors were calculated for each one. In order to get comparable results the same prediction model was used in all the 5 repetition of the experiment. This model was arbitrary chosen between the two better performing models and was the SMOreg.

The number of clusters used in the study was 10, 30, 50, 70 and 100. The MAE and RMSE errors produced by those cluster sizes, using SMOreg as the model can be seen in and respectively, labeled as MAE_ SVD_10 for the MAE errors of the experiment using 10 as the number of clusters, MAE_ SVD_30 the MAE errors of the experiment using 30 as the number of clusters etc. The results were once again averaged and ordered by item popularity as before.

The overall average MAE errors over the whole set of items for each case are summarized in.

Cluster Size

Overall average MAE error

10

0.725248 ± 0.0441

30

0.718343 ± 0.0505

50

0.716717 ± 0.0591

70

0.71663 ± 0.0614

100

0.716177 ± 0.0644

The overall average RMSE errors over the whole set of items for each case are summarized in.

Cluster Size

Overall average RMSE error

10

0.92667 ± 0.0454

30

0.920781 ± 0.0564

50

0.918162 ± 0.0672

70

0.916922 ± 0.0708

100

0.917614 ± 0.0750

Table 6 Overall RMSE errors for the different number of clusters using SMOreg as predictive model

Although the cluster size is usually identified at the literature as one of the highly influential variables on the neighbor based recommender systems, in the context that was used as part of the current implementation we can see that generates small variations at the final error values, with the cluster size of 70 being marginally better.

In order to formally determine if the performance of the model changes significantly across the different numbers of clusters we perform paired t-tests. For each of the pairs of results we perform paired t-test for the RMSE errors of each instance at 95% confidence interval. The paired t-test were conducted using Matlab`s function [h,p]=ttest(x,y) using the default significance level alpha=0.05, where x and y the vectors containing the full set of RMSE errors observed for cluster size x and y accordingly. The test was repeated for every combination of cluster sizes. The results from the 10 t-tests are summarized at

Pair of cluster sizes tested (x-y)

P value

tstat

sd

Significantly Different

10-30

5.2272e-007

5.0465

0.0396

YES

10-50

8.2876e-008

5.3956

0.0535

YES

10-70

6.2111e-009

5.8551

0.0565

YES

10-100

2.5565e-007

5.1845

0.0593

YES

30-50

0.0595

1.8865

0.0471

NO

30-70

0.0056

2.7740

0.0472

YES

30-100

0.0330

2.1343

0.0504

YES

50-70

0.3624

0.9112

0.0462

NO

50-100

0.7038

0.3803

0.0489

NO

70-100

0.5890

-0.5405

0.0434

NO

Summarizing the T-tests we see that at the 5% significance level the data do provide sufficient evidence to conclude that the accuracy of the proposed approach differs for different number of clusters, for the cluster pairs 10-30, 10-50, 10-70, 10-100, 30-70, 30-100 and does not provide sufficient evidence for the pairs 30-50, 50-70, 50-100 and 70-100. Only for the difference of the 10 as cluster size the evidence is strong, reinforcing our initial observation that the cluster size is not affecting critically the accuracy unless the number of clusters is very small relatively to the item space.

4.2.2 Defining the importance of the predictive model

In section we provided the relative performance of the 5 different predictive models, applied on the un-clustered data. The question that arises at this point is whether this relative performance will be the same when the models are used in combination with the SVD produced clusters.

In order to answer to this question we conducted a set of experiments in which the 5 models were used, all for the same number of clusters (70). Their performance is visualized in

The overall average MAE error rate for each predictive model, with the same cluster size (70) are summarized in

Cluster Size

Overall average RMSE error

SVD_70_LWL

0.7711 ± 0.0662

SVD_70_M5Rule

0.7199 ± 0.0585

SVD_70_SMOreg

0.71663 ± 0.0614

SVD_70_SLR

0.7549 ± 0.0637

SVD_70_RBF

0.8251 ± 0.0837

The overall average RMSE error rates for each predictive model, with the same cluster size (70) are summarized in

Cluster Size

Overall average RMSE error

SVD_70_LWL

0.9625 ± 0.0751

SVD_70_M5Rule

0.9085 ± 0.0685

SVD_70_SMOreg

0.916922 ± 0.0708

SVD_70_SLR

0.9456 ± 0.0711

SVD_70_RBF

1.0227 ± 0.0807

From the above diagrams we can see that the relative performance of the models did change. SMOreg and M5Rules are now the two best performing models, while SLR moved third (from second best at the initial implementation). Also while RBF was close to the rest of the models at the un-clustered experiments we see that is clearly the worst performing model now.

Another indication from the above diagram is how the cluster quality affects the resulting accuracy. We can observe that all the 5 models follow a uniform pattern with peaks and bottoms of errors occurring at the same points. This let us speculate that since the only thing shared between the models is the cluster in which they are called to operate in, it is the quality of this cluster that affects the output errors.

4.3 Comparison of the clustered with the un-clustered approach

The most important comparison made as part of this work is the one showing the difference in performance of the suggested technique using SVD as way to form clusters of items with the original model approach of the predictive models built for each item.show the MAE and RMSE errors for the same model, SMOreg, as it was proved to work well in both occasions. The number of clusters used in this comparison for the SVD version is set to 70, the best performing identified size

we can see that the approach that uses clustering through SVD performs constantly better than the original version that builds separate prediction models for each item.

Most importantly the performance of the improved system less aggressively affected by the low number of ratings and remains almost steady across the whole spectrum of the rating frequencies.

present the magnitude of the performance improvement between the two methods for the MAE and RMSE errors respectively, sorted by the average item popularity. The improvement rate is calculated as:

For 46 out of the 48 averaged intervals, the approach using SVD in order to cluster the data improves the accuracy of the recommender. Once again can be observed how the difference in the performance of the two methods scales with the number of the available ratings per item. The less popular the item is, the greater is the improvement of the SVD based technique.

To determine the statistical significance of the improvement in the performance of the model using the proposed methodology compared to the initial implementation that built separate predictive models for each item, we perform paired t-test for the MAE and RMSE errors of each instance at 95% confidence interval. Where for both the tests:

H0 : (Base Model mean error) = (Proposed model mean error)

H1 : (Base Model mean error) > (Proposed model mean error)

The error rate series compared are the results of the five different predictive algorithms applied at the un-clustered data and the same algorithms applied at the data using clustering via Singular Value Decomposition with 70 as the number of clusters. The test results can be seen in

Predictive models compared (x-y)

P value

tstat

sd

Significantly Different

LWL – SVD_LWL

0.00

17.8968

0.2032

YES

M5Rule – SVD_ M5Rule

0.00

19.4605

0.1800

YES

SMOreg – SVD_ SMOreg

0.00

14.9327

0.2125

YES

SLR – SVD_ SLR

0.00

12.1380

0.1711

YES

RBF – SVD_ RBF

0.00

9.3732

0.1428

YES

Table 10 Paired T-test results. Comparison of the MAE error rate difference between the 5 predictive models applied on the original data versus using SVD clustering (95% confidence interval)

Predictive models compared (x-y)

P value

tstat

sd

Significantly Different

LWL – SVD_LWL

0.00

19.2274

0.2299

YES

M5Rule – SVD_ M5Rule

0.00

19.1513

0.2191

YES

SMOreg – SVD_ SMOreg

0.00

14.7433

0.2385

YES

SLR – SVD_ SLR

0.00

11.8022

0.2011

YES

RBF – SVD_ RBF

0.00

8.9711

0.1549

YES

Table 11 Paired T-test results. Comparison of the RMSE error rate difference between the 5 predictive models applied on the original data versus using SVD clustering (95% confidence interval)

Interpreting the result of the T-tests we reject the null hypothesis in favor of the alternative and we can say that at the 5% significance level the data do provide sufficient evidence to conclude that the accuracy of the proposed approach differs from the accuracy of the base model. We are 95% confident that the proposed approach improves the accuracy of the system.

4.3.1 The proposed solution addressing the cold start problem

As discussed in section introducing new items to a recommender system can lead to poor performance. In section we showed by evaluating the base model how the low number of ratings affected negatively the performance of the system.

In this set of experiments we showed that the proposed method improves significantly the accuracy of the base model, that its accuracy is less sensitive to the item popularity, and that the improvement introduced compared to the un-clustered model is bigger for the items with few ratings.

The above three attributes of the proposed approach are good steps towards the solution of the cold-start problem. While still the accuracy drops going from the high end of the popularity scale towards the low end, it always remains close to the overall mean error. That means that a newly introduced item will no longer receive poorly accurate recommendations because it cannot provide enough information to support the creation of effective rules from the classification models since by using the information of the items belonging to the same cluster we can improve the accuracy of the recommendation.

4.4 Execution time performance and scalability discussion

Scalability as discussed in section is an important aspect for any recommender system. Characteristic of the recommender systems is that they deploy in large user-item spaces that depending upon the implementation context can extent to several millions of transactions in a large scale E-commerce site. In the case that browsing patterns are used to indicate the product preference these transactions will be more than the sheer combination of users-items and users-users for user similarity based systems or items-items for item similarity based system as in the approach developed here. At the same time the recommendation must be presented to the end user in a timely manner and many recommendations per second must be produced for all the active customers in a high-traffic site.

In order to test the potential scalability of the proposed system we performed measurements of the execution time required to produce recommendations over the entire experimental Dataset of 100.000 instances.

4.4.1 Execution time performance of the proposed system across different models

presents the average execution time (in seconds) needed to complete a full test run for each one of the different predictive models and the number of clusters set at 70. The times are the average of five repetitions of the experiment on the development machine consisting of an Intel Core 2 Duo T9400 processor (2.53 GHz, 1066MHz FSB, 6MB L2 cache) with 4 GB DDR2 memory running 32-bit Windows Vista. The machine load was tried to be kept minimum and uniform across all the experiments.

It should be noted here that the times presented here can be used in order only to compare the relative execution time required for the different predictive algorithms as an unknown, and probably big, percentage of the total execution time was used during the cross-validation and attribute filtering steps by Weka making the results not representatives of a real use scenario.

We can see that the execution times needed greatly vary depending on the algorithm used, with the worst performing algorithm being SMOreg and the best one being Simple Linear Regression. While this was expected and directly linked to the inner algorithm complexity, what was an interesting observation was the fact that the analogy better accuracy – worst performance did not hold truth for all the models under test. To demonstrate this clearer presents groups the RMSE errors and the execution time needed for the different models. For example we can see that M5Rules produce low error rates while performing reasonably well time-wise.

Utterly there is not a safe answer to the question what is the best predictive model. The decision may vary depending on whether we want to improve accuracy or response time.

4.4.2 Execution time performance of the proposed system across different cluster sizes

The number of clusters used during the neighborhood formation had an immediate effect on the time performance of the system. The less clusters used (more items per cluster) the more time the predictive models need to be built. presents the execution time needed for the different number of clusters used, with SMOreg as the predictive algorithm.

From the results we got during this experiment we are led to the conclusion that while having more clusters means that more models have to be trained (one model per cluster) the reduction in time that occurs for the training of each cluster eventually leads to better performance.

Scaling the system to be applied in a much bigger real use scenario would presuppose the finding of the best performing number of clusters to be used, depending on the population of the dataset. Although this appears feasible, from the evidences we have, we cannot safely assume that the accuracy of the predictions will follow the same patterns as these of the experiment, linking the accuracy with the number of clusters described in section

Nevertheless the fact that we have strong evidence to conclude that the worst (accuracy-wise) performing number of clusters was the smallest one (10) while the best (accuracy-wise) performing models introduced insignificant changes (for example 70-100) and using the results from the execution-time series of tests indicating that the best (time-wise) performing number of clusters were the bigger ones we can say that we can improve the throughput of the system by increasing the number of clusters without suffering loss of accuracy.

4.4.3 Singular Value Decomposition related scalability

Singular value decomposition is computationally expensive. For a m x n matrix of m users and n items, SVD requires time in order of O((m+n)3) (Deerwester, 1990). While during the experimental procedure this was not crucial, since the user-item matrix R was of dimension 943 x 1682 only (with average SVD execution time 13.84 seconds using Matlab) and even while SVD can be calculated offline, the cost would prove prohibitive for large scale datasets containing millions of customers and items (Sarwar, 2000), (Papagelis, 2005). As a result alternative techniques for factorization should be considered as for example in (Sarwar, 2002) and (Ma, 2009).

5 Conclusions

This chapter concludes the project by discussing what was accomplished by the research conducted and by describing the focus of future work.

5.1 Achievements

By conducting accuracy tests in accord with the experimental procedure followed at the recommender systems literature and interpreting the significance of our results with paired T-tests, we concluded that our proposed approach significantly improves the overall accuracy of the recommendations, compared with the base system where at the first part of our experimentations we showed that the use of classification models alone to predict the ratings leads to poor performance especially for items of low popularity.

The most important achievement of the system, is its effectiveness on reducing the negative effects of the item-side cold start problem.

By using clustering in the reduced dimension space via Singular Value Decomposition we managed to improve the accuracy of the classification models at this problematic area. The fact that the performance of the proposed model presents smaller deviations across all the ranges of item popularity compared to the base model means that a newly introduced item will no longer receive poor recommendations compared to the items with many ratings that can inherently provide enough information to support the predictive models.

5.2 Future work

In the future we would like to investigate how effective will be the formation of clusters of users instead of items, in the reduced dimension space in combination with classification models and how the size of these clusters will affect the system`s performance.

We would also like to test whether treating the class as nominal will have an impact at the prediction accuracy compared to the numeric approach followed in this work.

Another thing we would like to investigate is what will be the performance of the proposed system tested on a different dataset. The initial idea was to use more than one datasets to evaluate the performance of the method, but it was dropped due to time restrictions.

Finally, since the classification algorithms proved to be very sensitive to the amount of available information, it we would be interesting to investigate if we can achieve better performance by enriching the data with demographic and contextual information.