Loading...
Loading...

Ervin Dervishaj, a PhD student at the University of Copenhagen, discusses his research on disentangled representation learning in recommender systems, finding that while disentanglement strongly correlates with interpretability, it doesn't consistently improve recommendation performance. The conversation explores how disentanglement acts as a regularizer that can enhance user trust and interpretability at the potential cost of some accuracy, and touches on the future of large language models in denoising user interaction data.
Welcome to Data Skeptic, a podcast exploring the methods, use cases and consequences of
recommender systems.
Welcome to another installment of Data Skeptic Recommender Systems.
It isn't a strict dichotomy, but oftentimes we think of features in machine learning as being
handcrafted features or, or cases where we use representation learning to custom find those features.
Handcrafted features are really where most things started machine learning,
and they still play a significant, and I'm sure we'll continue to play a significant role in
many ML rollouts, but you can't deny how popular representation learning is getting.
And for good reason, if the model can learn its own features, let it.
But then you're left with this latent space that's quite hard to interpret.
Handcrafted features let you give an insight to a user different from, well, the cosine similarity
was very high between two vectors. So all is not lost, there's many interpretability techniques,
and one idea in particular we're going to delve into today called disentanglement.
This is the idea that different factors should really be independent of one another,
in that regard perhaps you could move along the axis of one, in sort of a perturbation study.
And the degree to which your model is disentangled gives you a lot of flexibility.
It's really a great tool for interpretability, or at least it would seem so.
Our guest today, Irvin and his co-author surveyed the literature on disentanglement
representation learning and recommender systems, and they noticed most prior work really looked
at disentanglement qualitatively, rather than a rigorous way. So they set out to go from qualitative
to quantitative and found some really interesting insights along the way.
I'm Erwin Derbyschei. I'm a third-year PhD student at the University of Copenhagen.
I'm studying mainly machine learning and specifically recommender systems,
focusing on representation learning for recommender systems, and more recently large language models
and their applicability in the recommender system space. What first got you interested in
recommender systems? It all sort of started during my masters. I had a course in recommender systems,
and I got hooked on the subject from that moment on. Then I participated with a group for
the University Polytechnical Dmitano, where I was doing my masters, joined the university team
that took part in the Rexist Challenge in 2018 in Vancouver. After that, I got interested
even more in the subject and decided to do my master thesis in recommender systems.
From that moment on, I decided to pursue a PhD in the topic. Very cool.
For listeners who aren't familiar with the term, could you share your perspective? What is
representation learning? So representation learning is the way that computers take in
huge amount of data and try to build their own understanding, their own representation of
this data, and then they use it to build decision-making on top of that representation.
So I'm familiar with the idea of feature engineering. Let's say we're working on fraud,
maybe we'd ask, have you used this credit card from this IP address before and had a successful
transaction? I can handcraft all these features. Is that the same idea? Are you talking about
something different? Yeah, it's somewhat different because before, as you mentioned, we had this whole
field of feature engineering where you'd put a lot of effort in trying to build these features that
would help model to model better the task at hand. Whereas with representation learning, what you
are trying to do is you just input all of these data to the model and then hope that while
training the model is going to make sense of the data on its own as a way trying to do better
predictions. Is that through an unsupervised process or is it a supervised learning technique?
There are two paradigms. Both are used today both in machine learning and
recommended systems. In the unsupervised, partying what happens is that you do not have specific
labels for your task at hand and basically you just leave the model to learn different
representations for the data and usually that involves some sort of clustering where you
look at the different features, the model looks at the different features and sort of clusters the
data. Whereas in the supervised paradigm, what happens is that you also have labels associated
with your data. So you build a model that, given a certain input, tries to learn to predict the
correct label for that input and while doing so internally, it has to build representation for
that input. So the unsupervised version is very appealing to me because number one, it's kind of
like I have no work. The algorithm is going to do it for me once I set it up and maybe the algorithm
can perform as good or better than I would and also do it faster. So I'd like to go unsupervised.
Do you think the field is mature enough that that can be a practical thing industry people and
researchers use? I think for certain problems and certain tasks you have to reside on unsupervised
learning. The main reason as it touched a bit upon is the fact that you need to have labels in
order to do supervised learning and there are a lot of cases when you do not have access to this
label. You just have a huge amount of data and then you try to understand what can you use that
data for and there are also other cases when you do not even know the task. You just have the data
and you won't understand what's happening with that data even before you know what they are going
to use it for. So you use unsupervised learning to sort of learn different representations of the
data and of course that can help you later on once you have a downstream task.
The unsupervised technique then produces a representation, usually a vector of numbers
and those numbers don't necessarily have column labels like my handcrafted features.
Can I interpret the data that's being presented to me? Yeah, in the case when you try to learn
these representations with your models be that either with the unsupervised partying but also
there are cases also in supervised partying. It becomes quite difficult to understand what
the representation means. So you have, as you mentioned, a sort of a vector that is your
representation for a specific input and of course when you do unsupervised learning you do not know
what you are looking for when you are learning these representations. So that brings an aspect
of interpretability into the mix and of course this is a topic very very important nowadays in machine
learning. Could you give an example of one or two popular techniques for interpretability and
especially if you find them useful in your work and maybe describe them a little bit? There are a
couple of interpretability metrics that are very common and are used both in industry but also
academia. One of them is Lyme and the other one is Shab. Basically what these two methods do is they
try to build some models that can be used to explain later on what each individual component within
your embedding means. They try to connect these with specific labels in the input space in your
input data. So I think we've got a lot of the key ideas on the table. There's representation
learning to come up with the vector. There's interpretability tools to help me maybe get a sense of
what they are. Perhaps the other key word from your title that we should focus on is disentanglement.
Our representation disentanglements and interpretability linked in recommender models as the paper
we're going to discuss. What is the disentanglement part? So when you use these machine learning models
or recommended system models, what happens is that when you leave the model to learn whatever
representation of the data that it needs in order to accomplish as better as it can, the task
attempt in this case recommending items to specific user. It learns representation such that the
different components within that representation are sort of entangled with one another. I usually like
to give the example of trying to buy a t-shirt. When you're trying to recommend a specific item
within the t-shirt section, there are a couple of components that are usually attributes that
every user would look at. For example, you'd have the size of the t-shirt, but the other one could
be also price. Of course, when doing the recommendation, you want to learn representation such that
these two components are disentangled for one another in the learned representation. By that,
I mean, those are kind of invariant. If you change the size of your t-shirt, it shouldn't affect
also the price. If you want to recommend something that is cheap or expensive, that shouldn't
affect the size of the t-shirt that the user should be interested in based on their previous
purchase history. Getting into your research that we're going to discuss from this paper,
could you outline the goals you had? We scouted the literature for disentangled
representation learning models being used in recommended systems. What we wanted to do with this
study was to try and replicate the results in the form of reproducibility studied and also
investigate whether there is a connection between disentanglement and interpretability and
between disentanglement and recommendation performance. Let's maybe focus the first one on
the connection between disentanglement and interpretability. They seem like closely linked
ideas. Could you contrast them? How do they actually differ? The entire premise of learning
disentanglement representation is that you can find in the representation space, you can
sort of separate these different aspects of a given input. The idea here is that if you are able
to do this separation, then the representation are more interpretable. You can use that later on
in the recommendation phase as a way to show to the user why they received certain
recommendation in the UI or system that they're using. I'm thinking of something like movie
recommendations. Let's say I as a user have contributed a bunch of feedback in some form
and it's clear I like horror movies, but maybe we have a disentanglement representation which
also identifies duration. Then I'm thinking if you perturb duration for a user like me,
it should just recommend shorter or longer horror films. These two factors are independent.
Is that kind of the idea? As I said, you try to learn these attributes within the disentanglement
representation such that as you mentioned, if you change only the genre or only the duration
of the specific move that you are interested in, the genre shouldn't change. Those would be
totally separate entities within the recommendation model. This feels very intuitive,
but also a little ad hoc or just so story or something like that. How can we study it empirically?
Right. So what we did to study this disentanglement representation recommended system was to
find models that other papers that were using these models in recommended systems. We tried to
collect the data sets that they were using the different underlying machine learning models to do
the disentanglement representation. We started by collecting these models and these data sets
and what we found is that they were just focusing on a qualitative evaluation of the disentanglement.
So even though they were presenting work that was focused on learning disentanglement
representation, eventually the evaluation of the disentanglement space was mainly done from
a qualitative point of view. What we wanted to do was to try and provide a quantitative
evaluation of this entanglement and to do that, we focused on some existing disentanglement metrics
like disentanglement and completeness and we evaluated these models on these two metrics.
Could we jump into some of your findings? It seems natural that good representations would correlate
highly with interpretability, at least intuitively, is that what you found?
Yeah, so we ran a correlation analysis between the interpretability metrics that we were using
based on top of line and shape and the disentanglement and completeness metrics from the
disentanglement literature. The correlation, as it actually indeed did show that there was a
positive correlation between disentanglement and interpretability. This sort of goes in line with
what our expectation was because, as you mentioned, the entire premise for learning
disentanglement representation, or at least one of the benefits of it, is that you end up with more
interpretable representation. So your work in this paper covers a wide variety of models and data
sets. Did you find that correlation everywhere? Actually, no. We found that the correlation, when you
account for the different models that we were using, for those we could see that the correlation
would hold, whereas when we accounted for some of the data sets, we found that on the majority of
them, indeed, this connection still holds. Would you say it holds strongly or is it a slight correlation?
The correlation between the disentanglement and interpretability was quite strong, whereas the
connection between the disentanglement and the effectiveness, the performance of the recommendation
models, there we did not find consistent correlation. Even though some of these models that we
were awaiting were showing that disentanglement was a key component in the performance of their model.
So if historically that's been a key indicator in the model, is that because the maybe prior work
focused in on limited data sets, and we could say maybe a little bit overfit, or how do you interpret
it? Yeah, so what we saw is that for some of these models, just because of the way that they were
introduced and presented in the respective papers, some of the details were not given. So we had
some troubles trying to replicate their scores, also because when you can see their disentanglement
representation learning, you also need some ground truth, so you need to know whether
once you learn these factors of variation within the representation, you need to know which specific
component or specific attributes this factor variation relates to. Usually in recommended systems,
we do not have such ground truth, and that makes it more difficult to sort of replicate the results
existing work. We tried to follow as much as we could from the paper the results, the experiments
of the presented work and the models that we were considering, but eventually we weren't able to
reach the same scores that they were showing. In terms of performance or reproducibility,
you weren't able to reproduce the exact results of prior studies, is that because those studies
need to be more clear about some initialization, or maybe you just had a different seed than they did,
what can account for the differences? So we tried in our reproducibility study to have multiple
runs for our experiments, all with different seeds, and we eventually just reported the average
of our runs. Indeed, in some of the papers that, in some of the models that we considered,
we did not have all the details. Yes, the authors usually provide a specific range for the
hyperparameters that they use within the models, but they do not usually, they did not provide the
exact hyperparameters that they used to get the scores that they got. So we tried to, in our study,
tried to cover all the different extremes of the hyperparameters that they used, but still,
we weren't able to get to their scores. So in my survey and people I talked to and other
research I've read, it seems to me that there's a common sense. People believe that disentanglement
representations are like a gateway to both better recommendations and better interpretability.
Maybe you found some better interpretability, but it's sounding like better recommendations are
not a guarantee. Do you think the community should be surprised by this result? Or do you expect
them to be surprised by that? Yeah, in some sense, the result was also surprising for us because
when we read this other work result that they were pushing for disentanglement as one way to
achieve better recommendation performance. We did not find that connection in our results,
but I think we should still strive for learning disentanglement representation because
as we discussed, it also brings the aspect of interpretability, which is very important,
especially in recommended systems. In a way, with more interpretable recommendations, you've
also helped build a bit more trust between the system, recommend the system and the user. So
I think it's still important to focus on learning disentanglement presentations.
Well, I'm curious if you have any thoughts on why it didn't impact accuracy or the performance
of the recommendations. It seems like it should intuitively. Maybe it's the case that the model
already has the accuracy that it needs from the existing features, and there's no benefit,
or maybe there's something else going on here. Do you have a sense of why efficiency wasn't
improved in terms of performance? Yeah, so when you try to make the model more interpretable,
and in this case, in this specific case, when we push the model to learn disentanglement
representations, this sort of acts like regularization on the network. So instead of leaving the
network or the recommendation model free to decide which kind of representation it needs in order
to do better in terms of recommendation performance, with disentanglement representation, you
sort of regularize the network with the model and enforcing it to build this interpretable
representation, and that sort of penalizes the performance of the model.
So should we expect an efficiency improvement in terms of performance? Or was that an unfair
goal to have? Yeah, in some ways, I think if you consider learning of disentanglement
representation as a regularizer, yeah, in that case, usually what happens is that you sacrifice a
bit of performance and gain a bit more inexpendability in interpretability. So in some sense, yes,
I would consider it expected to not observe the same connection that we saw with interpretability.
Could we talk more about that trade-off? Imagine you're in a new role in some big industry position,
they have a recommender system with a lot of revenue behind it, something like that.
It seems like they would push for highest accuracy, highest performance because they want to make
the most money as a company. What would be the good argument for exploring the trade-off?
I think it's a very important component, because I mean, the main objective of recommender
system is to show the most accurate recommendation to the user. But eventually in the end, you need
to have trusts from the user into the system. And I think that is quite important, even before
they get to the recommendation stage, they need to know that the recommendation that they are getting
have some sense and they are not just some mathematical computation of an underlying model. But
if you can show explanations why the user is getting the specific recommendation,
then they sort of believe more in the system and sort of accept more also the recommendation
that they are getting. So in some sense, it can also help a provider of recommendation
increase their retention of the users, because of course, if you increase trust, it also helps.
So I've seen a lot of ideas in the recommendation system literature about including some
metadata with your recommendation. Like, why was this suggested to me? How do you feel about
those approaches? I think they are one of the components within or one of the paradigms within
the recommender system space, actually. So in those cases, when we try to use also some metadata or
content information, we call that a content recommender system. And what it does is looking to the
similarity of the metadata between the items that the behavior of the user and a list of items
that are potentially recommendable to the user. And that also helps making the
recommendation more interval, because if you have in your behavior, you have watched movies
with, I know, with a specific genre, then it's quite easy to show to the user a specific item,
recommend it again from the same genre. We also see this in many different platforms where it says
because you like this in the past. And that makes it much more easy for the user to understand
why they got the specific recommendation. Along the same lines, do you think disentanglement
representations are a tool strictly for like a machine learning engineer or might they propagate
all the way to the user? I think that's a very good question. If you manage to learn
disentanglement representation in a way that the component that you learn, the factors of variation
that you learn within the representation of specific users, if you can connect those back to some
attributes of the different items in your catalog, then you can theoretically allow users to control
those specific parts, those specific factors of variation within the learned representation.
So going back to the example of T-shirt, if you want to change only the color of the T-shirt,
you should be able to see only red T-shirts in your recommendation. So a platform could use
those disentanglement representation as a way to give more control to the user and in a way they
can control how they want to shape the different recommendations that they are getting.
So in your work, you found a correlation then and correct me if I'm wrong, but there is a link
in some sense between disentanglement and interpretability, not so much so between effectiveness
and disentanglement. With those results in mind, what do you think the future of the idea of
disentanglement should be in recommendation systems research? I think we should focus more on,
of course, try to have as much as possible reproducible results in future work,
but then again, also try to provide a quantitative evaluation of learned disentanglement because
that is very crucial. I think the study of disentanglement representation is very important from
the interpretability point of view because we did indeed find this correlation between the two
and I think if we explore even more in the direction, we can provide even ways to improve
even the performance of the remainder system if we couple it with other components. Because from our
study, we saw that relying only on the disentanglement component did not provide connection to the
performance of the recommended system. I know this may be out of scope for the particular paper we're
discussing, but you in the introduction mentioned your work also involves large language models,
which are of course the topic there's you or everyone who's getting into now. Could you talk
about how they're impacting your work? Yeah, so in the very beginning, I was trying to stay away from
the LLM field, but they are everywhere right now, so I sort of shifted a bit from representation
learning in recommended systems to large language models. We recently had a paper accepted on trying
to denoise user profiles with the use of large language models. So what happens in a normal
recommended system is that you make use of implicit data instead of explicit data. So implicit
data is just some actions that you usually do when you interact with the recommended system like
some specific clicks or watching a movie up to a certain minute. And because this kind of data
are much more abundant, we end up industry ends up using this data to train the models. The problem
is that this data by design is noisy. So you might watch a specific movie, but in the end you end up
giving a dislike or you you purchase something on an Amazon and then in the end you live a negative
review. So when you build a recommended system using this noisy data, then of course it might affect
the overall performance of the model. In this latest work, we tried to use LLM to try to denoise
find items in the previous interactions of the users, find items in the previous history of
interactions of the users that are noisy. And if we were to train the model without those
specific items that would bring better performance to the recommendation model.
That's an interesting idea. Do you have any thoughts on how we identify which ones are outliers
in that fashion? Despite the hype, LLM still act a bit like black box. So yes, in this work,
we try to use LLM and ask it basically to pick which items to remove from the history of the
user. But again, we do not know why it shows specific those items. And we touched on this a little
bit already, but you'd mentioned some of the challenges in reproducing results during your literature
search. I'm sure there are many academics listening now getting ready to publish their paper and
their work in recommended systems. Could you give any advice for how they could better prepare their
publication and whatever else they released to make it more reproducible?
Yeah, I think there are some steps that they can take to achieve a more reproducible work
with their paper. One thing which I believe is quite important is to make it as much as possible
to release their code. And very importantly, also give specifically the hyperparameters
of their best results. Usually what happens is that we include a range of hyperparameters when
we do the hyperparameter tuning part, when training the models. But it's very important to also
have specifically which hyperparameters produce the results that end up being reported in the paper.
Another component is also releasing the data and specifically the data splits that were used.
So yeah, overall this allow other researchers to replicate the results.
Good advice for sure. What's next for you in your career?
Right, so right now I'm doing internships with Amazon where I'm working, I switched from
recommended systems for this internship and I'm focusing mainly on large language models,
specifically in model merging. And is there anywhere listeners can follow you online?
Sure, they can reach out to me on Twitter or LinkedIn.
Very cool, we'll have links in the show notes for listeners who want to follow up.
Irvin, thank you so much for taking the time to come on and share your work.
Thank you very much for having me.



