technologysciencemathematics

Disentanglement and Interpretability in Recommender Systems

Data Skeptic·Mar 10, 2026·30:33

About this Episode

Ervin Dervishaj, a PhD student at the University of Copenhagen, discusses his research on disentangled representation learning in recommender systems, finding that while disentanglement strongly correlates with interpretability, it doesn't consistently improve recommendation performance. The conversation explores how disentanglement acts as a regularizer that can enhance user trust and interpretability at the potential cost of some accuracy, and touches on the future of large language models in denoising user interaction data.

Hosts & Guests

Kyle Polich

Host

Transcript

Welcome to Data Skeptic, a podcast exploring the methods, use cases and consequences of

recommender systems.

Welcome to another installment of Data Skeptic Recommender Systems.

It isn't a strict dichotomy, but oftentimes we think of features in machine learning as being

handcrafted features or, or cases where we use representation learning to custom find those features.

Handcrafted features are really where most things started machine learning,

and they still play a significant, and I'm sure we'll continue to play a significant role in

many ML rollouts, but you can't deny how popular representation learning is getting.

And for good reason, if the model can learn its own features, let it.

But then you're left with this latent space that's quite hard to interpret.

Handcrafted features let you give an insight to a user different from, well, the cosine similarity

was very high between two vectors. So all is not lost, there's many interpretability techniques,

and one idea in particular we're going to delve into today called disentanglement.

This is the idea that different factors should really be independent of one another,

in that regard perhaps you could move along the axis of one, in sort of a perturbation study.

And the degree to which your model is disentangled gives you a lot of flexibility.

It's really a great tool for interpretability, or at least it would seem so.

Our guest today, Irvin and his co-author surveyed the literature on disentanglement

representation learning and recommender systems, and they noticed most prior work really looked

at disentanglement qualitatively, rather than a rigorous way. So they set out to go from qualitative

to quantitative and found some really interesting insights along the way.

I'm Erwin Derbyschei. I'm a third-year PhD student at the University of Copenhagen.

I'm studying mainly machine learning and specifically recommender systems,

focusing on representation learning for recommender systems, and more recently large language models

and their applicability in the recommender system space. What first got you interested in

recommender systems? It all sort of started during my masters. I had a course in recommender systems,

and I got hooked on the subject from that moment on. Then I participated with a group for

the University Polytechnical Dmitano, where I was doing my masters, joined the university team

that took part in the Rexist Challenge in 2018 in Vancouver. After that, I got interested

even more in the subject and decided to do my master thesis in recommender systems.

From that moment on, I decided to pursue a PhD in the topic. Very cool.

For listeners who aren't familiar with the term, could you share your perspective? What is

representation learning? So representation learning is the way that computers take in

huge amount of data and try to build their own understanding, their own representation of

this data, and then they use it to build decision-making on top of that representation.

So I'm familiar with the idea of feature engineering. Let's say we're working on fraud,

maybe we'd ask, have you used this credit card from this IP address before and had a successful

transaction? I can handcraft all these features. Is that the same idea? Are you talking about

something different? Yeah, it's somewhat different because before, as you mentioned, we had this whole

field of feature engineering where you'd put a lot of effort in trying to build these features that

would help model to model better the task at hand. Whereas with representation learning, what you

are trying to do is you just input all of these data to the model and then hope that while

training the model is going to make sense of the data on its own as a way trying to do better

predictions. Is that through an unsupervised process or is it a supervised learning technique?

There are two paradigms. Both are used today both in machine learning and

recommended systems. In the unsupervised, partying what happens is that you do not have specific

labels for your task at hand and basically you just leave the model to learn different

representations for the data and usually that involves some sort of clustering where you

look at the different features, the model looks at the different features and sort of clusters the

data. Whereas in the supervised paradigm, what happens is that you also have labels associated

with your data. So you build a model that, given a certain input, tries to learn to predict the

correct label for that input and while doing so internally, it has to build representation for

that input. So the unsupervised version is very appealing to me because number one, it's kind of

like I have no work. The algorithm is going to do it for me once I set it up and maybe the algorithm

can perform as good or better than I would and also do it faster. So I'd like to go unsupervised.

Do you think the field is mature enough that that can be a practical thing industry people and

researchers use? I think for certain problems and certain tasks you have to reside on unsupervised

learning. The main reason as it touched a bit upon is the fact that you need to have labels in

order to do supervised learning and there are a lot of cases when you do not have access to this

label. You just have a huge amount of data and then you try to understand what can you use that

data for and there are also other cases when you do not even know the task. You just have the data

and you won't understand what's happening with that data even before you know what they are going

to use it for. So you use unsupervised learning to sort of learn different representations of the

data and of course that can help you later on once you have a downstream task.

The unsupervised technique then produces a representation, usually a vector of numbers

and those numbers don't necessarily have column labels like my handcrafted features.

Can I interpret the data that's being presented to me? Yeah, in the case when you try to learn

these representations with your models be that either with the unsupervised partying but also

there are cases also in supervised partying. It becomes quite difficult to understand what

the representation means. So you have, as you mentioned, a sort of a vector that is your

representation for a specific input and of course when you do unsupervised learning you do not know

what you are looking for when you are learning these representations. So that brings an aspect

of interpretability into the mix and of course this is a topic very very important nowadays in machine

learning. Could you give an example of one or two popular techniques for interpretability and

especially if you find them useful in your work and maybe describe them a little bit? There are a

couple of interpretability metrics that are very common and are used both in industry but also

academia. One of them is Lyme and the other one is Shab. Basically what these two methods do is they

try to build some models that can be used to explain later on what each individual component within

your embedding means. They try to connect these with specific labels in the input space in your

input data. So I think we've got a lot of the key ideas on the table. There's representation

learning to come up with the vector. There's interpretability tools to help me maybe get a sense of

what they are. Perhaps the other key word from your title that we should focus on is disentanglement.

Our representation disentanglements and interpretability linked in recommender models as the paper

we're going to discuss. What is the disentanglement part? So when you use these machine learning models

or recommended system models, what happens is that when you leave the model to learn whatever

representation of the data that it needs in order to accomplish as better as it can, the task

attempt in this case recommending items to specific user. It learns representation such that the

different components within that representation are sort of entangled with one another. I usually like

to give the example of trying to buy a t-shirt. When you're trying to recommend a specific item

within the t-shirt section, there are a couple of components that are usually attributes that

every user would look at. For example, you'd have the size of the t-shirt, but the other one could

be also price. Of course, when doing the recommendation, you want to learn representation such that

these two components are disentangled for one another in the learned representation. By that,

I mean, those are kind of invariant. If you change the size of your t-shirt, it shouldn't affect

also the price. If you want to recommend something that is cheap or expensive, that shouldn't

affect the size of the t-shirt that the user should be interested in based on their previous

purchase history. Getting into your research that we're going to discuss from this paper,

could you outline the goals you had? We scouted the literature for disentangled

representation learning models being used in recommended systems. What we wanted to do with this

study was to try and replicate the results in the form of reproducibility studied and also

investigate whether there is a connection between disentanglement and interpretability and

between disentanglement and recommendation performance. Let's maybe focus the first one on

the connection between disentanglement and interpretability. They seem like closely linked

ideas. Could you contrast them? How do they actually differ? The entire premise of learning

disentanglement representation is that you can find in the representation space, you can

sort of separate these different aspects of a given input. The idea here is that if you are able

to do this separation, then the representation are more interpretable. You can use that later on

in the recommendation phase as a way to show to the user why they received certain

recommendation in the UI or system that they're using. I'm thinking of something like movie

recommendations. Let's say I as a user have contributed a bunch of feedback in some form

and it's clear I like horror movies, but maybe we have a disentanglement representation which

also identifies duration. Then I'm thinking if you perturb duration for a user like me,

it should just recommend shorter or longer horror films. These two factors are independent.

Is that kind of the idea? As I said, you try to learn these attributes within the disentanglement

representation such that as you mentioned, if you change only the genre or only the duration

of the specific move that you are interested in, the genre shouldn't change. Those would be

totally separate entities within the recommendation model. This feels very intuitive,

but also a little ad hoc or just so story or something like that. How can we study it empirically?

Right. So what we did to study this disentanglement representation recommended system was to

find models that other papers that were using these models in recommended systems. We tried to

collect the data sets that they were using the different underlying machine learning models to do

the disentanglement representation. We started by collecting these models and these data sets

and what we found is that they were just focusing on a qualitative evaluation of the disentanglement.

So even though they were presenting work that was focused on learning disentanglement

representation, eventually the evaluation of the disentanglement space was mainly done from

a qualitative point of view. What we wanted to do was to try and provide a quantitative

evaluation of this entanglement and to do that, we focused on some existing disentanglement metrics

like disentanglement and completeness and we evaluated these models on these two metrics.

Could we jump into some of your findings? It seems natural that good representations would correlate

highly with interpretability, at least intuitively, is that what you found?

Yeah, so we ran a correlation analysis between the interpretability metrics that we were using

based on top of line and shape and the disentanglement and completeness metrics from the

disentanglement literature. The correlation, as it actually indeed did show that there was a

positive correlation between disentanglement and interpretability. This sort of goes in line with

what our expectation was because, as you mentioned, the entire premise for learning

disentanglement representation, or at least one of the benefits of it, is that you end up with more

interpretable representation. So your work in this paper covers a wide variety of models and data

sets. Did you find that correlation everywhere? Actually, no. We found that the correlation, when you

account for the different models that we were using, for those we could see that the correlation

would hold, whereas when we accounted for some of the data sets, we found that on the majority of

them, indeed, this connection still holds. Would you say it holds strongly or is it a slight correlation?

The correlation between the disentanglement and interpretability was quite strong, whereas the

connection between the disentanglement and the effectiveness, the performance of the recommendation

models, there we did not find consistent correlation. Even though some of these models that we

were awaiting were showing that disentanglement was a key component in the performance of their model.

So if historically that's been a key indicator in the model, is that because the maybe prior work

focused in on limited data sets, and we could say maybe a little bit overfit, or how do you interpret

it? Yeah, so what we saw is that for some of these models, just because of the way that they were

introduced and presented in the respective papers, some of the details were not given. So we had

some troubles trying to replicate their scores, also because when you can see their disentanglement

representation learning, you also need some ground truth, so you need to know whether

once you learn these factors of variation within the representation, you need to know which specific

component or specific attributes this factor variation relates to. Usually in recommended systems,

we do not have such ground truth, and that makes it more difficult to sort of replicate the results

existing work. We tried to follow as much as we could from the paper the results, the experiments

of the presented work and the models that we were considering, but eventually we weren't able to

reach the same scores that they were showing. In terms of performance or reproducibility,

you weren't able to reproduce the exact results of prior studies, is that because those studies

need to be more clear about some initialization, or maybe you just had a different seed than they did,

what can account for the differences? So we tried in our reproducibility study to have multiple

runs for our experiments, all with different seeds, and we eventually just reported the average

of our runs. Indeed, in some of the papers that, in some of the models that we considered,

we did not have all the details. Yes, the authors usually provide a specific range for the

hyperparameters that they use within the models, but they do not usually, they did not provide the

exact hyperparameters that they used to get the scores that they got. So we tried to, in our study,

tried to cover all the different extremes of the hyperparameters that they used, but still,

we weren't able to get to their scores. So in my survey and people I talked to and other

research I've read, it seems to me that there's a common sense. People believe that disentanglement

representations are like a gateway to both better recommendations and better interpretability.

Maybe you found some better interpretability, but it's sounding like better recommendations are

not a guarantee. Do you think the community should be surprised by this result? Or do you expect

them to be surprised by that? Yeah, in some sense, the result was also surprising for us because

when we read this other work result that they were pushing for disentanglement as one way to

achieve better recommendation performance. We did not find that connection in our results,

but I think we should still strive for learning disentanglement representation because

as we discussed, it also brings the aspect of interpretability, which is very important,

especially in recommended systems. In a way, with more interpretable recommendations, you've

also helped build a bit more trust between the system, recommend the system and the user. So

I think it's still important to focus on learning disentanglement presentations.

Well, I'm curious if you have any thoughts on why it didn't impact accuracy or the performance

of the recommendations. It seems like it should intuitively. Maybe it's the case that the model

already has the accuracy that it needs from the existing features, and there's no benefit,

or maybe there's something else going on here. Do you have a sense of why efficiency wasn't

improved in terms of performance? Yeah, so when you try to make the model more interpretable,

and in this case, in this specific case, when we push the model to learn disentanglement

representations, this sort of acts like regularization on the network. So instead of leaving the

network or the recommendation model free to decide which kind of representation it needs in order

to do better in terms of recommendation performance, with disentanglement representation, you

sort of regularize the network with the model and enforcing it to build this interpretable

representation, and that sort of penalizes the performance of the model.

So should we expect an efficiency improvement in terms of performance? Or was that an unfair

goal to have? Yeah, in some ways, I think if you consider learning of disentanglement

representation as a regularizer, yeah, in that case, usually what happens is that you sacrifice a

bit of performance and gain a bit more inexpendability in interpretability. So in some sense, yes,

I would consider it expected to not observe the same connection that we saw with interpretability.

Could we talk more about that trade-off? Imagine you're in a new role in some big industry position,

they have a recommender system with a lot of revenue behind it, something like that.

It seems like they would push for highest accuracy, highest performance because they want to make

the most money as a company. What would be the good argument for exploring the trade-off?

I think it's a very important component, because I mean, the main objective of recommender

system is to show the most accurate recommendation to the user. But eventually in the end, you need

to have trusts from the user into the system. And I think that is quite important, even before

they get to the recommendation stage, they need to know that the recommendation that they are getting

have some sense and they are not just some mathematical computation of an underlying model. But

if you can show explanations why the user is getting the specific recommendation,

then they sort of believe more in the system and sort of accept more also the recommendation

that they are getting. So in some sense, it can also help a provider of recommendation

increase their retention of the users, because of course, if you increase trust, it also helps.

So I've seen a lot of ideas in the recommendation system literature about including some

metadata with your recommendation. Like, why was this suggested to me? How do you feel about

those approaches? I think they are one of the components within or one of the paradigms within

the recommender system space, actually. So in those cases, when we try to use also some metadata or

content information, we call that a content recommender system. And what it does is looking to the

similarity of the metadata between the items that the behavior of the user and a list of items

that are potentially recommendable to the user. And that also helps making the

recommendation more interval, because if you have in your behavior, you have watched movies

with, I know, with a specific genre, then it's quite easy to show to the user a specific item,

recommend it again from the same genre. We also see this in many different platforms where it says

because you like this in the past. And that makes it much more easy for the user to understand

why they got the specific recommendation. Along the same lines, do you think disentanglement

representations are a tool strictly for like a machine learning engineer or might they propagate

all the way to the user? I think that's a very good question. If you manage to learn

disentanglement representation in a way that the component that you learn, the factors of variation

that you learn within the representation of specific users, if you can connect those back to some

attributes of the different items in your catalog, then you can theoretically allow users to control

those specific parts, those specific factors of variation within the learned representation.

So going back to the example of T-shirt, if you want to change only the color of the T-shirt,

you should be able to see only red T-shirts in your recommendation. So a platform could use

those disentanglement representation as a way to give more control to the user and in a way they

can control how they want to shape the different recommendations that they are getting.

So in your work, you found a correlation then and correct me if I'm wrong, but there is a link

in some sense between disentanglement and interpretability, not so much so between effectiveness

and disentanglement. With those results in mind, what do you think the future of the idea of

disentanglement should be in recommendation systems research? I think we should focus more on,

of course, try to have as much as possible reproducible results in future work,

but then again, also try to provide a quantitative evaluation of learned disentanglement because

that is very crucial. I think the study of disentanglement representation is very important from

the interpretability point of view because we did indeed find this correlation between the two

and I think if we explore even more in the direction, we can provide even ways to improve

even the performance of the remainder system if we couple it with other components. Because from our

study, we saw that relying only on the disentanglement component did not provide connection to the

performance of the recommended system. I know this may be out of scope for the particular paper we're

discussing, but you in the introduction mentioned your work also involves large language models,

which are of course the topic there's you or everyone who's getting into now. Could you talk

about how they're impacting your work? Yeah, so in the very beginning, I was trying to stay away from

the LLM field, but they are everywhere right now, so I sort of shifted a bit from representation

learning in recommended systems to large language models. We recently had a paper accepted on trying

to denoise user profiles with the use of large language models. So what happens in a normal

recommended system is that you make use of implicit data instead of explicit data. So implicit

data is just some actions that you usually do when you interact with the recommended system like

some specific clicks or watching a movie up to a certain minute. And because this kind of data

are much more abundant, we end up industry ends up using this data to train the models. The problem

is that this data by design is noisy. So you might watch a specific movie, but in the end you end up

giving a dislike or you you purchase something on an Amazon and then in the end you live a negative

review. So when you build a recommended system using this noisy data, then of course it might affect

the overall performance of the model. In this latest work, we tried to use LLM to try to denoise

find items in the previous interactions of the users, find items in the previous history of

interactions of the users that are noisy. And if we were to train the model without those

specific items that would bring better performance to the recommendation model.

That's an interesting idea. Do you have any thoughts on how we identify which ones are outliers

in that fashion? Despite the hype, LLM still act a bit like black box. So yes, in this work,

we try to use LLM and ask it basically to pick which items to remove from the history of the

user. But again, we do not know why it shows specific those items. And we touched on this a little

bit already, but you'd mentioned some of the challenges in reproducing results during your literature

search. I'm sure there are many academics listening now getting ready to publish their paper and

their work in recommended systems. Could you give any advice for how they could better prepare their

publication and whatever else they released to make it more reproducible?

Yeah, I think there are some steps that they can take to achieve a more reproducible work

with their paper. One thing which I believe is quite important is to make it as much as possible

to release their code. And very importantly, also give specifically the hyperparameters

of their best results. Usually what happens is that we include a range of hyperparameters when

we do the hyperparameter tuning part, when training the models. But it's very important to also

have specifically which hyperparameters produce the results that end up being reported in the paper.

Another component is also releasing the data and specifically the data splits that were used.

So yeah, overall this allow other researchers to replicate the results.

Good advice for sure. What's next for you in your career?

Right, so right now I'm doing internships with Amazon where I'm working, I switched from

recommended systems for this internship and I'm focusing mainly on large language models,

specifically in model merging. And is there anywhere listeners can follow you online?

Sure, they can reach out to me on Twitter or LinkedIn.

Very cool, we'll have links in the show notes for listeners who want to follow up.

Irvin, thank you so much for taking the time to come on and share your work.

Thank you very much for having me.

Disentanglement and Interpretability in Recommender Systems

About this Episode

Hosts & Guests

More from Data Skeptic

Book Ratings and Recommendations

Collective Altruism in Recommender Systems

Niche vs Mainstream

Healthy Friction in Job Recommender Systems