Bando de Dados Avançados - Recommender Systems

33
Recommender Systems Collaborative Filtering & Dimensionality Reduction Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University *Adapted by Gustavo Coutinho Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org

Transcript of Bando de Dados Avançados - Recommender Systems

Page 1: Bando de Dados Avançados - Recommender Systems

Recommender Systems Collaborative Filtering & Dimensionality Reduction

Mining of Massive DatasetsJure Leskovec, Anand Rajaraman, Jeff UllmanStanford University*Adapted by Gustavo Coutinho

Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org

Page 2: Bando de Dados Avançados - Recommender Systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Collaborative FilteringHarnessing quality judgments of other

users

2

Page 3: Bando de Dados Avançados - Recommender Systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Previously - Content-Based

3J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Page 4: Bando de Dados Avançados - Recommender Systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Previously - Content-Based

4J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Page 5: Bando de Dados Avançados - Recommender Systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Previously - Content-Based

5J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Page 6: Bando de Dados Avançados - Recommender Systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Utility Matrix

Users have preferences for certain items, and these preferences must be teased out of the data. Lets represent it with an Utility Matrix! Example:

6

Page 7: Bando de Dados Avançados - Recommender Systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Collaborative Filtering

Consider user x

Find set N of other users whose ratings are “similar” to x’s ratings

Estimate x’s ratings based on ratings of users in N

7

x

N

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Page 8: Bando de Dados Avançados - Recommender Systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Collaborative Filtering

Different from Content-Based Filtering

We don’t need to understand the

content of an specific item!

Different user share their experiences

8J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Page 9: Bando de Dados Avançados - Recommender Systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Let rx and rx vectors of users x and y ratings, respectively

Lets try to use the Jaccard Similarity as a measure

9

Finding “Similar” Users

rx = [*, _, _, *, ***] ry = [*, _, **, **, _]

Page 10: Bando de Dados Avançados - Recommender Systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Now, rx and ry are considered as sets

Problem: Ignores the value of the rating!

10

Finding “Similar” Users

rx = { 1, 4, 5} ry = { 1, 3, 4}

Page 11: Bando de Dados Avançados - Recommender Systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

How to put the rating factor under a formula? Cosine Similarity measure

Now, rx and ry are considered as points

Problem: Treats missing ratings as “negative”!

11

Finding “Similar” Users

similarity = cos(Θ) =rx · ry

||rx|| · ||ry||

rx = { 1, 0, 0, 1, 3} ry = { 1, 0, 2, 2, 0}

Page 12: Bando de Dados Avançados - Recommender Systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

How do we balance the missing values? Pearson correlation coefficient Sxy= items rated by both users x and y

12

Finding “Similar” Users

sim(x, y) =

!

s∈Sxy(rxs − rx)(rys − ry)

"

!

s∈Sxy(rxs − rx)2

"

!

s∈Sxy(rys − ry)2

rx and ry = average rating of “x” and “y”

Page 13: Bando de Dados Avançados - Recommender Systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Similarity Metric

Lets consider de following Utility Matrix of users and ratings

Intuitively we want: sim(A,B)>sim(A,C) Using Jaccard: 1/5 < 2/4 Using Cosine: 0.386 > 0.322

13

Page 14: Bando de Dados Avançados - Recommender Systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Similarity Metric

Now, we’re going to use Pearson Correlation

Subtracting the (row) mean

Using Pearson: 0.092 > -0.559 Notice that Cosine Similarity is a correlation when data is centered at 0

14

Page 15: Bando de Dados Avançados - Recommender Systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Rating Predictions

How can we go from similarity metrics to recommendations? Let rx be the vector of user x’s ratings Let N be the set of k users most similar to x who have rated item i Prediction for item s of user x:

Where sxy=sim(x,y)

15

rxi =

!y∈N sxy · ryi!

y∈N sxy

rxi =1

!

y∈N

ryi

Page 16: Bando de Dados Avançados - Recommender Systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Item-Item Collaborative Filtering

Until now we have used an User-User approach. What about an Item-Item? ▪ For item i, find other similar items ▪ Estimate rating for item i based on

ratings for similar items ▪ Can use the same similarity metrics and

predictions functions as in user-user model

16

Page 17: Bando de Dados Avançados - Recommender Systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

12 11 10 9 8 7 6 5 4 3 2 1

4 5 5 3 1 1

3 1 2 4 4 5 2

5 3 4 3 2 1 4 2 3

2 4 5 4 2 4

5 2 2 4 3 4 5

4 2 3 3 1 6

Users

Mov

ies

- unknown rating - rating between 1 and 5

Item-Item CF (|N|=2)

17

Page 18: Bando de Dados Avançados - Recommender Systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Item-Item CF (|N|=2)

12 11 10 9 8 7 6 5 4 3 2 1

4 5 5 ? 3 1 1

3 1 2 4 4 5 2

5 3 4 3 2 1 4 2 3

2 4 5 4 2 4

5 2 2 4 3 4 5

4 2 3 3 1 6

Users

Mov

ies

- estimate rating of movie 1 by user 5

18

Page 19: Bando de Dados Avançados - Recommender Systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19

Item-Item CF (|N|=2)

1.00

-0.18

0.41

-0.10

-0.31

0.59

sim(1,m)12 11 10 9 8 7 6 5 4 3 2 1

4 5 5 ? 3 1 1

3 1 2 4 4 5 2

5 3 4 3 2 1 4 2 3

2 4 5 4 2 4

5 2 2 4 3 4 5

4 2 3 3 1 6

Users

Mov

ies

- estimate rating of movie 1 by user 5

Page 20: Bando de Dados Avançados - Recommender Systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Item-Item CF (|N|=2)

Neighbour selection: identify movies similar to movie 1, rated by user 5 Here we use Pearson correlation as similarity:

Subtract mean rating mi from each movie i

m1=(1+3+5+5+4)/5 = 3.6 row1:[ -2.6, 0, -0.6, 0, 0, 1.4, 0, 0, 1.4, 0, 0.4, 0]

Compute cosine similarities between rows

20

Page 21: Bando de Dados Avançados - Recommender Systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Item-Item CF (|N|=2)

Compute similarity weights: s1,3 = 0.41, s1,6 = 0.59

1.00

-0.18

0.41

-0.10

-0.31

0.59

sim(1,m)12 11 10 9 8 7 6 5 4 3 2 1

4 5 5 ? 3 1 1

3 1 2 4 4 5 2

5 3 4 3 2 1 4 2 3

2 4 5 4 2 4

5 2 2 4 3 4 5

4 2 3 3 1 6

Users

Mov

ies

21

Page 22: Bando de Dados Avançados - Recommender Systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Item-Item CF (|N|=2)

Predict by taking weighted average

r1,5 = (0.41*2 + 0.59*3) / (0.41+0.59) = 2.6

12 11 10 9 8 7 6 5 4 3 2 1

4 5 5 2.6 3 1 1

3 1 2 4 4 5 2

5 3 4 3 2 1 4 2 3

2 4 5 4 2 4

5 2 2 4 3 4 5

4 2 3 3 1 6

Users

Mov

ies

22

Page 23: Bando de Dados Avançados - Recommender Systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Define similarity sij of items i and j Select k nearest neighbors N(i; x) ▪ Items most similar to i, that were rated by x Estimate rating rxi as the weighted average:

CF: Common Practice

23

baseline estimate for rxi µ = overall mean movie rating bx = rating deviation of user x

= (avg. rating of user x) – µ bi = rating deviation of movie i

∑∑

∈−⋅

+=);(

);()(

xiNj ij

xiNj xjxjijxixi s

brsbr

 

Page 24: Bando de Dados Avançados - Recommender Systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Item-Item vs. User-User

In practice, it has been observed that item-item often works better than user-userWhy? Items are simpler, users have multiple tastes

Avatar LOTR Matrix Pirates

Alice 1 0.8

Bob 0.5 0.3

Carol 0.9 1 0.8

David 1 0.4

24

Page 25: Bando de Dados Avançados - Recommender Systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Works for any kind of item No feature selection needed

Unexpected recommendations A user may receive recommendations different from active searches done by itself

Groups with similar ratings Users may connect with each other and create groups with similar interests

Pros/Cons of Collaborative Filtering

25

Page 26: Bando de Dados Avançados - Recommender Systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Cold Start Need enough users in the system to find a match

Sparsity The user/ratings matrix is sparse Hard to find users that have rated the same items

First rater Cannot recommend an item that has not been previously rated New items, Esoteric items

Popularity bias Cannot recommend items to someone with unique taste Tends to recommend popular items

Pros/Cons of Collaborative Filtering

26

Page 27: Bando de Dados Avançados - Recommender Systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Hybrid Methods

Implement two or more different recommenders and combine predictions

Perhaps using a linear model

Add content-based methods to collaborative filtering

Item profiles for new item problem Demographics to deal with new user problem

27

Page 28: Bando de Dados Avançados - Recommender Systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Remarks & Practical Tips- Evaluation - Error metrics - Complexity / Speed

2828

Page 29: Bando de Dados Avançados - Recommender Systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

Use

rs

Movies

29

Page 30: Bando de Dados Avançados - Recommender Systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 ? ?

?

2 1 ?

3 ?

1

Use

rs

Movies

Test Data Set

30

Page 31: Bando de Dados Avançados - Recommender Systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Collaborative Filtering: Complexity

Expensive step is finding k most similar customers: O(|X|) Too expensive to do at runtime

Could pre-compute Naïve pre-computation takes time O(k·|X|)

We already know how to do this! Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction

32

Page 32: Bando de Dados Avançados - Recommender Systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Tip: Add Data

Leverage all the data Don’t try to reduce data size in an effort to make fancy algorithms work Simple methods on large data do best

Add more data e.g., add IMDB data on genres

More data beats better algorithms http://anand.typepad.com/datawocky/2008/03/more-data-usual.html

33

Page 33: Bando de Dados Avançados - Recommender Systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Questions

34