Bando de Dados Avançados - Recommender Systems
-
Upload
gustavo-coutinho -
Category
Science
-
view
241 -
download
1
Transcript of Bando de Dados Avançados - Recommender Systems
Recommender Systems Collaborative Filtering & Dimensionality Reduction
Mining of Massive DatasetsJure Leskovec, Anand Rajaraman, Jeff UllmanStanford University*Adapted by Gustavo Coutinho
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Collaborative FilteringHarnessing quality judgments of other
users
2
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Previously - Content-Based
3J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Previously - Content-Based
4J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Previously - Content-Based
5J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Utility Matrix
Users have preferences for certain items, and these preferences must be teased out of the data. Lets represent it with an Utility Matrix! Example:
6
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Collaborative Filtering
Consider user x
Find set N of other users whose ratings are “similar” to x’s ratings
Estimate x’s ratings based on ratings of users in N
7
x
N
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Collaborative Filtering
Different from Content-Based Filtering
We don’t need to understand the
content of an specific item!
Different user share their experiences
8J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Let rx and rx vectors of users x and y ratings, respectively
Lets try to use the Jaccard Similarity as a measure
9
Finding “Similar” Users
rx = [*, _, _, *, ***] ry = [*, _, **, **, _]
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Now, rx and ry are considered as sets
Problem: Ignores the value of the rating!
10
Finding “Similar” Users
rx = { 1, 4, 5} ry = { 1, 3, 4}
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
How to put the rating factor under a formula? Cosine Similarity measure
Now, rx and ry are considered as points
Problem: Treats missing ratings as “negative”!
11
Finding “Similar” Users
similarity = cos(Θ) =rx · ry
||rx|| · ||ry||
rx = { 1, 0, 0, 1, 3} ry = { 1, 0, 2, 2, 0}
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
How do we balance the missing values? Pearson correlation coefficient Sxy= items rated by both users x and y
12
Finding “Similar” Users
sim(x, y) =
!
s∈Sxy(rxs − rx)(rys − ry)
"
!
s∈Sxy(rxs − rx)2
"
!
s∈Sxy(rys − ry)2
rx and ry = average rating of “x” and “y”
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Similarity Metric
Lets consider de following Utility Matrix of users and ratings
Intuitively we want: sim(A,B)>sim(A,C) Using Jaccard: 1/5 < 2/4 Using Cosine: 0.386 > 0.322
13
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Similarity Metric
Now, we’re going to use Pearson Correlation
Subtracting the (row) mean
Using Pearson: 0.092 > -0.559 Notice that Cosine Similarity is a correlation when data is centered at 0
14
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Rating Predictions
How can we go from similarity metrics to recommendations? Let rx be the vector of user x’s ratings Let N be the set of k users most similar to x who have rated item i Prediction for item s of user x:
Where sxy=sim(x,y)
15
rxi =
!y∈N sxy · ryi!
y∈N sxy
rxi =1
k·
!
y∈N
ryi
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Item-Item Collaborative Filtering
Until now we have used an User-User approach. What about an Item-Item? ▪ For item i, find other similar items ▪ Estimate rating for item i based on
ratings for similar items ▪ Can use the same similarity metrics and
predictions functions as in user-user model
16
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
12 11 10 9 8 7 6 5 4 3 2 1
4 5 5 3 1 1
3 1 2 4 4 5 2
5 3 4 3 2 1 4 2 3
2 4 5 4 2 4
5 2 2 4 3 4 5
4 2 3 3 1 6
Users
Mov
ies
- unknown rating - rating between 1 and 5
Item-Item CF (|N|=2)
17
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Item-Item CF (|N|=2)
12 11 10 9 8 7 6 5 4 3 2 1
4 5 5 ? 3 1 1
3 1 2 4 4 5 2
5 3 4 3 2 1 4 2 3
2 4 5 4 2 4
5 2 2 4 3 4 5
4 2 3 3 1 6
Users
Mov
ies
- estimate rating of movie 1 by user 5
18
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19
Item-Item CF (|N|=2)
1.00
-0.18
0.41
-0.10
-0.31
0.59
sim(1,m)12 11 10 9 8 7 6 5 4 3 2 1
4 5 5 ? 3 1 1
3 1 2 4 4 5 2
5 3 4 3 2 1 4 2 3
2 4 5 4 2 4
5 2 2 4 3 4 5
4 2 3 3 1 6
Users
Mov
ies
- estimate rating of movie 1 by user 5
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Item-Item CF (|N|=2)
Neighbour selection: identify movies similar to movie 1, rated by user 5 Here we use Pearson correlation as similarity:
Subtract mean rating mi from each movie i
m1=(1+3+5+5+4)/5 = 3.6 row1:[ -2.6, 0, -0.6, 0, 0, 1.4, 0, 0, 1.4, 0, 0.4, 0]
Compute cosine similarities between rows
20
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Item-Item CF (|N|=2)
Compute similarity weights: s1,3 = 0.41, s1,6 = 0.59
1.00
-0.18
0.41
-0.10
-0.31
0.59
sim(1,m)12 11 10 9 8 7 6 5 4 3 2 1
4 5 5 ? 3 1 1
3 1 2 4 4 5 2
5 3 4 3 2 1 4 2 3
2 4 5 4 2 4
5 2 2 4 3 4 5
4 2 3 3 1 6
Users
Mov
ies
21
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Item-Item CF (|N|=2)
Predict by taking weighted average
r1,5 = (0.41*2 + 0.59*3) / (0.41+0.59) = 2.6
12 11 10 9 8 7 6 5 4 3 2 1
4 5 5 2.6 3 1 1
3 1 2 4 4 5 2
5 3 4 3 2 1 4 2 3
2 4 5 4 2 4
5 2 2 4 3 4 5
4 2 3 3 1 6
Users
Mov
ies
22
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Define similarity sij of items i and j Select k nearest neighbors N(i; x) ▪ Items most similar to i, that were rated by x Estimate rating rxi as the weighted average:
CF: Common Practice
23
baseline estimate for rxi µ = overall mean movie rating bx = rating deviation of user x
= (avg. rating of user x) – µ bi = rating deviation of movie i
∑∑
∈
∈−⋅
+=);(
);()(
xiNj ij
xiNj xjxjijxixi s
brsbr
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Item-Item vs. User-User
In practice, it has been observed that item-item often works better than user-userWhy? Items are simpler, users have multiple tastes
Avatar LOTR Matrix Pirates
Alice 1 0.8
Bob 0.5 0.3
Carol 0.9 1 0.8
David 1 0.4
24
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Works for any kind of item No feature selection needed
Unexpected recommendations A user may receive recommendations different from active searches done by itself
Groups with similar ratings Users may connect with each other and create groups with similar interests
Pros/Cons of Collaborative Filtering
25
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Cold Start Need enough users in the system to find a match
Sparsity The user/ratings matrix is sparse Hard to find users that have rated the same items
First rater Cannot recommend an item that has not been previously rated New items, Esoteric items
Popularity bias Cannot recommend items to someone with unique taste Tends to recommend popular items
Pros/Cons of Collaborative Filtering
26
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Hybrid Methods
Implement two or more different recommenders and combine predictions
Perhaps using a linear model
Add content-based methods to collaborative filtering
Item profiles for new item problem Demographics to deal with new user problem
27
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Remarks & Practical Tips- Evaluation - Error metrics - Complexity / Speed
2828
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
Use
rs
Movies
29
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 ? ?
?
2 1 ?
3 ?
1
Use
rs
Movies
Test Data Set
30
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Collaborative Filtering: Complexity
Expensive step is finding k most similar customers: O(|X|) Too expensive to do at runtime
Could pre-compute Naïve pre-computation takes time O(k·|X|)
We already know how to do this! Near-neighbor search in high dimensions (LSH) Clustering Dimensionality reduction
32
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Tip: Add Data
Leverage all the data Don’t try to reduce data size in an effort to make fancy algorithms work Simple methods on large data do best
Add more data e.g., add IMDB data on genres
More data beats better algorithms http://anand.typepad.com/datawocky/2008/03/more-data-usual.html
33
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Questions
34