newstechnology

Inside the Android Bench & AI LLM Rankings with Matthew McCullough from Google

Android Faithful·Mar 20, 2026·34:12

About this Episode

We're super excited to present a special edition of Android Faithful, welcoming back friend of the show and VP Product, Android Developer Experience at Google, Matthew McCullough who joins Huyen Tue Dao and Ron Richards to go behind the scenes on the making of the new Android Bench and the AI LLM Rankings recently made available to Android Developers. Matthew was kind enough to give us some time out of his day to give us the inside scoop on what went into the effort, exactly how and why Gemini came out on top of the LLM rankings and a provide a glimpse of what's to come!

Hosted on Acast. See acast.com/privacy for more information.

Hosts & Guests

huyen tue dao

Host

ron richards

Host

Matthew McCullough

Guest

Transcript

This episode is brought to you by Capital One. Capital One's tech team isn't just talking about multi-agentic AI

They already deployed one. It's called chat concierge and it's simplifying car shopping using self-reflection and layered reasoning with live API checks

It doesn't just help buyers find a car they love. It helps schedule a test drive, get pre-approved for financing and

Estimate trade and value, advanced, intuitive and deployed. That's how they stack. That's technology at Capital One

Howdy, howdy ho and welcome to fantasy fanfellas. I'm Hayden producer of the fantasy fangirls podcast and your resident lover of all things

Sanderson and I'm Steven your bookish internet goofball, but you can call me the smash daddy and we are currently deep diving

Brandon Sanderson's fantasy epic Missborn, but here's the catch

Steven here has not read Missborn before that's right. Hey, hey, so each week you'll get my unfiltered raw

Reactions to every single chip and along the way we'll do character deep dives magic explainers and Steven will even try to guess

What's next spoiler alert? He'll be wrong news flash. I'm never wrong episodes come out every Wednesday and you can find fantasy fanfellas wherever you get your podcasts

Well, hello, this is Android faithful here with a their holiday special for you

Did you not have enough green this week on St. Patrick's Day? Well, guess what we got some extra green on the inside folks for you

I'm wind wet down and I'm Ron Richards. We're back when we didn't get enough this week, didn't we?

We did not we did more

No, but we're suiting out exactly you're making up for lost time

No, but we're super super excited because we're here for a special bonus episode

extra episode this week

Because Google came to us and said hey

You guys want to you guys want to have a conversation with a Googler and we're like of course we do, right?

And especially when it's one of our favorites, right? One of our favorites. Yes

Very very good friend of the show Matthew McCullough VP of product Android developer experience three words, which of course

I appreciate very much

Yeah, no, so recently we covered it on the show when they announced that I think about a week and a half ago or so

When the Android developer team rolled out the the new Android bench along with those

LLM rankings

And we joked when you were on the show, but but Jason poked fun at it because those LLM rankings had Gemini at the top

And we're like oh of course Gemini is at the top

And so but but actually as we dug into the information a lot of like I just thought this is like another like oh developer docs

And how to use AI and you're kind of stuff

But as I dug into it, I'm like oh, they really did a lot of work here in terms of like

Guiding development in the proper use of AI and selecting the right tools

And I was really curious when your perspective as a developer when that announcement roll out what you're take away from it was

Yeah, I mean if you've been following me on DTS and here you know that I'm not I'm not the most let's say

I am not the biggest

In general sense AI stand in the room. I've been pretty critical about seven and I I've definitely railed about stuff every day

So of course when I saw I I have to admit like I tried to pay a very hard to pay attention especially since you know we I do

I do we do know eat personally and I'm personal other people that that that working it

But it's there's a lot of sermon wrong along with a lot of innovation happening right now

So I have to admit when I kind of saw it. I was like okay, what's this um and I have to admit like especially after reading about it

And especially you know talking to Matthew this feels

You know very androidy. It feels like a lot of things that you know

It feels like android is listening. It feels like the team is listening. It feels like it's taking our feet

It feels like they're trying to do something to address real problems and real feelings and real concerns about you know

Like how do we integrate this new technology in our lives in a in a successful?

But also in a in like the best way possible for both products and human beings and so

Yeah, I was a little skeptical at first and certainly yeah, it was kind of funny that Gemini at the top

I did I did kind of like oh, that's kind of funny

But but I will say that I

I'm still I'm still kind of like very I wouldn't say bear

I can be pretty bearish about this

But I do say I have a lot of respect for this project

So yeah, that's so that's glad to hear that and your perspective and so when they when Google reached out and said hey

Do you want to talk to Matthew? Of course the first answer is yes

Because we just like

But then the second answer was like yeah, this is an opportunity to really understand like why are they doing what they're doing

What is the motivation about it? What is the approach by it? And then you know, honestly, I was like you know

How did Gemini become first? I want to understand that and so we we definitely you know like as we often do in Android faithful

We don't pull any punches. We're gonna ask the tough questions

So if you listen to the interview we do ask Matthew how did Gemini become why is that on top and and he gave a great answer

As well as giving a ton of context and behind the project and the approach

And the dedication that they're doing to it and again, I think that they're you know the the Android development team is doing a great job in terms of you know

really

Building resources for developers and trying to build a best-in-class kind of approach for the platform

And I can listen to him talk about it for hours, you know, it's so much fun. So yeah, so enough of that

Let's get right into it

We hope you enjoy this special bonus episode of Android faithful

In this conversation about Android bench and the LML LM ranking system and just enjoy hanging out Matthew McCullough for half an hour

So yeah, and so we'll be back on Tuesday. Well, when we'll miss you

You'll be gone, but but flow and Jason. I will be back on Tuesday

and yeah, enjoy this interview with Matthew

Well, welcome back to Android faithful Matthew McCullough. It's great to see you again

Good to see you when Ron. It's it's always wonderful to talk to android faithful

Whether it's the official current or the ones that I meet when I'm traveling around. Oh, I'm and I'm both of those right so

Yes, you're two people

With Matthew you are returning champions. That's always good to have you back on the show and and this time around

We're talking about all the exciting news that recently came out earlier this month around

Android bench and the work that you guys are doing with LLMs and AI and all this exciting stuff

We covered it on the show when it was announced, but we had the opportunity

You wanted to come on and talk more about it and we of course jumped at the opportunity of course just to hang out with you

But we had a whole bunch of questions for you

So let's get right into it

So the first thing I thought of that I wanted to hear from you was that you know before Android bench kind of existed

How were developed Android developers actually making decisions about which AI model to use and for at least from your perspective

And what problem did that create that you guys thought that Google really needed to solve like why do this?

Well, I think we're all devs here and then part of this is no hand raising today for sure

But you know, we like to tell ourselves that we're rigorous and we use metrics and we we've got you know a stack rank of our choices

And then sometimes we get busy and the answer becomes whatever's at hand

Or whatever we have closest at hand and we really started to see that you know

Happen or people just reaching for the thing that was already corporately subscribed or whatever they'd used most recently

But that doesn't necessarily mean that that was the very best thing

For their Android development work and so what we decided to set on a quest sponsored by our our leads in the organization

Is to give people a way to rigorously and repeatedly quantify which one gives them the most benefit

That makes a lot of sense can can can concur engineering tends to be very data driven

So it makes a lot of sense to have a tool that allows you to analyze and quantify what might something make better than the other

The so the next thing I want to ask you which is honestly a quest especially as an Android developer myself a question very dear and dear to my heart that perhaps

I might have asked um, but you know that there are already plenty of general coding benchmarks are at there

You know, we're like several years deep now into kind of like this new advent of like AI as part of our workflow

And as like the broader culture and you know

There are already general coding benchmarks out there

But what specifically about Android development made those the existing coding benchmarks fall short

And can you give us a concrete example of some gap that they just didn't quite cover

Can indeed I think they've really helped so to anyone who's contributed or built one of those existing benchmarks

I have nothing but appreciation it is driven in industry that I love like forward and and more helpful

But much as we know that there's the difference between coding for back end and front end and full stack

You know the phrases from the last decade or so there's definitely a difference in coding for

JavaScript type script

React type stacks versus that of native Android development and there are so many benefits

You know all the innovation that we're bringing to the Android platform even like with Android 17

We want to make sure that all of that is making it into the hands of developers who are using AI tools for their coding

Not just some sort of last year's tech or three years old tech but specifically the newest stuff

And so you know, there's everything from I could numerate a very long list when but it could be you know

Compose it could be Kotlin making sure we're using the latest language constructs

It's making sure that we're using the latest libraries or even performance techniques

You're seeing posts all the time from the DREs in our organization many of your friends

I see when we're at conferences. They're always posting tips and tricks

And we don't want the lowest comment denominator or mid

I think that's what the cool kids say like

But we literally want best practices best architecture best libraries best frameworks latest versions

And if you want all of that

It's quite the shopping list. I think most of them are going to go tell you you got to go do this yourself

So we built on the best we got inspired from the ones that are already there

But this is unabashedly

Android native development boosting

Full stop

I just want to comment real quick that that sounds like very encouraging to me because I think you know

And I think this is a bit ago, but I know one tip I heard once was like well, you know

One of the best ways to optimize it is to use architectures that you know are older architectures that the that you know

LLS might be familiar with which

Which understandable but as you say that that's not I mean

Not to hate not to judge, but that's not how we roll in Android and and that ability to stay like adaptable and flexible and kind of take advantage of

All of the new ideas coming out from your team as well as like people in the community. That's so important

So thank you for saying that

And in some ways it's kind of funny because you what we don't want to do

But it can happen is essentially to be beholden to the automation like well the automation prefers and so we're gonna do

Sorry, we want to be best latest greatest most innovative. It needs to serve us

So to some degree maybe that's the

subtitled mission here is making sure that the automation is serving what we aspire to do

Well, and that's really interesting because and it's great because you know you guys are in the weeds on the development side

And I have so much respect for you Matthew and your team and win and the work you do and I you know

I always joke that I'm not a developer, but I know enough to get me in trouble and I've found myself in a career path where I'm

Product doing product management and working with the developers and needing to walk the walk and you know

And it's really interesting to hear you say that

But my first thing I think of is like okay, well if you're doing it in that sense

How walk us through how you actually do that you know like if you're sourcing real issues from GitHub repositories to do this

How do you ensure the tasks are generally testing android expertise

rather than just general coding ability like like how

How do you make sure that it is serving your guys need versus just general development?

Well, I like to kind of draw the illustration of a sales funnel to some degree if anybody's working in a place where they've

Had to deal with that where you you've got a lot of option at the top right?

You know what are we going after what are we reaching and then we kind of start making

Hopefully very intelligent decisions to come down to the set that that makes a lot of sense for us

Let's just say with the amount of Android code and what's on GitHub the the top is very wide and we finished with a hundred curated tasks

So the steepness of that I don't know what this logo of that line is, but it's very very steep and a lot of energy and a lot of decision

Went into choosing those and and they're coupled that you know rubrics that we used for this work

One it needed to represent best practices. We covered that a few minutes ago. So that one's already in there

Second, it needed to be modern approaches. It's sweet. We're we're not looking to drive

Classic approaches as much. I mean those are great, but we're looking to drive modern approaches

So compose for example compose only for the approaches that we've got here

And then third they need to be looked at by an expert to make sure that these are

Quality changes because when you think about it ultimately what we're doing is is helping train these systems on what great looks like

And I want to emphasis that word not what mediocre what compiles

That can already be probably be done. What great looks like I want to set the bar extremely high

Let's just say there was a lot of stressed people when you set the bar at at great

There are more stressed faces at times because it means reviewing every single pull request every single one of the tests

Line by line

trajectory by trajectory change by change. So you know what somebody said and I get it

They're like a hundred I can crank that out in no time

But I feel like these are like a hundred FaberJ eggs and so maybe if he is that kind of mental model

Hundreds a lot in that particular case

And this was done by GDE's by sui's on the team by product managers by folks in the DRE team

Couple of outside consultants that we also use to just get all these different lenses

And all of these tests were examined by that expert set

Fascinating

Yeah

I that I mean especially given and it is really encouraging that you know you source, you know

Issues from GitHub repository

So if you're not familiar watching this in your somehow not familiar

You know, I think in the Andrew community one of the things that we have built an identity on it

And our most proud of it is our wide network of open source solutions that

This entire community not just as contributed to but has become foundational to a lot of the work

We do so it makes a lot of sense because we get our best practices

Our best ideas of what is great from the project is it I think it made a lot of sense

As someone who for better for worse you probably for better

I should say that has worked that on a lot of large large large scale projects and an enterprise environments where

You know

Especially given its proprietary so we you know unfortunately can't always

Talk about or show the practices that we do because of more enterprise hugely scaled projects or you know

Just even like the unique challenges of kind of being on a large scale

Massive project with a lot of different kind of gears working and a lot of different kind of more enterprise

The no concerns can you can you kind of give us like

Kind of a any any thoughts on like how well those android bench still reflect that you know given that you know

Sometimes enterprises a bit different than open source

It's it's a difficult task. I think we did a really good job on it

But it's extremely hard, you know

We already talked about the steepness of the funnel, but one of the other elements since you ask

When about this is is also looking for at least 30 or so tasks. We got the 29

That that had large quantities of code change and I mean a hundred lines and there and the reason is it comes back again to this example of

Well, there was already sweet bench and other things like that

You know was able to emit Kotlin code

It sometimes you could coax it to emit compose for some of the leading models

But not necessarily in the volume that we're looking for for that enterprise or professional work

So exactly on your question we need it to do more for us

And I think for professional developers to really feel the benefit where they're saying like oh my goodness

This is transforming the work day

It's helping me be more productive. It's got to be larger quantities of code

So that was a specific lens that we used in some of it and so just to re emphasize there's

29 if I remember correctly that have at least a hundred lines of code change

And so you're talking about big sets in bigger code bases

Now the difficult part about this is given that this first pass of the benchmark

And I think we might have time to get to it to talk about version two

We'll see if I get in trouble for sneak peeks

But for version one, it's all open source and so there's also this tension that these are not

Necessarily commercial projects like you might have been contracted or employed to work on in the past

And we have some

Has coming up to even add that to future versions

So I'm already proud of where we got for this for the answer

But maybe it's usual Matthew style. I'm never fully satisfied. So we're gonna try to do more

Okay, oh, okay, sorry. Go ahead Ross. I mean, I'm gonna say in the world of the technology development

The work's never really done, right? I mean, there's always the first release is the first release

So then you build from there, right? So one point. No is a Friday and probably nothing more like oh

I will say after covering this stuff weekend a week out we have been commenting

I don't know if you listen to show we have a commenting at the relentless pace you guys have put us on in terms of

Covering the the output. So I'm not surprised to hear that

On the core core, but our job is to make your job difficult by the relentless pace of what we put out good

It has been nonstop

Well real quickly one one quick question for you, you know, similarly to that question room and health are you know kind of a pity

This episode is brought to you by capital one capital one's tech team isn't just talking about multi-agentic AI

They already deployed one. It's called chat concierge and it's simplifying car shopping using self-reflection and layered reasoning with live API checks

It doesn't just help buyers find a car they love it helps schedule a test drive get pre-approved for financing

An estimate trade-in value advanced intuitive and deployed. That's how they stack

That's technology at capital one

Howdy howdy ho and welcome to fantasy fanfellas

I'm Hayden producer of the fantasy fangirls podcast and your resident lover of all things sanderson and I'm steven your bookish internet

Goofball, but you can call me the smash daddy and we are currently deep diving Brandon sanderson's fantasy epic

Missborn, but here's the catch

Steven here has not read

Missborn before that's right. Hey, hey, so each week you'll get my unfiltered raw reactions to every single chip and along the way

We'll do character deep dives magic explainers and Steven will even try to guess what's next spoiler alert

He'll be wrong news flash. I'm never wrong episodes come out every Wednesday and you can find fantasy fanfellas wherever you get your podcasts

You

Gnated standard proposed by google coming out of this, but many projects actually use alternates

You know and and many developers use alternate storage and di frameworks

How much a team's factor that in when using android benches a reference?

You know, this is a place where we're a little bit more thoughtful and the thinking may evolve here

So I'm gonna give you the current mode of thinking but it may evolve over time

I think something for the last four years that you know my team is supported me on and I've had as kind of

For vision for android is there's a balance between being opinionated and not having options

And let me just tease that part just a little bit

Android is the land of tons of options. You can usually find two three four five maybe ten different options for a library that suits a particular need

Language parsing image parsing animations and the like I think that is part of what makes it vibrant and attractive to developers at times

But when they're on their learning journey or they just get it done because they're under pressure

There's also a nice desire to be helpful with a we've tried we've used this one looks good

And that is a very difficult balance

So openness so that you can make choice

But a little bit of opinionation so you can actually like what if y'all used what seems to be working for for the for the industry

So we've done that balance in this

We have not taken any strong stance in the selection of tasks for library or preference for maker or for author foundation

That one is when we've largely kind of put gently to the side and looked more at architecture and top-level choices

With the exception if you want to call it a library of compose. I believe in that so

Religiously that you can't unsell me on all the you know miraculous benefits of compose

But aside from that we did not use library choice as a filtering concern

Cool all right, so one of the big things coming out of this announcement was the the the the the kind of the grading of LLM's and these the

The benchmarks and all that sort of stuff helping people guide are they AI

One of one criticism of public benchmarks is basically that models can cheat by training on test data

How are you guys guarding against that with Android venture and how are you going to stay ahead of that as those models

evolve?

You can you can take a couple of approaches so we'll peel apart in three parts. I think number one

Developers in general and Ron

We're gonna for this point for just like you're in the developer community now that you're AI enhanced in a product manager

You're in team developers and no more no more team out

And so I think for for all of those folks, you know

You're thinking about the fact that software development is

Definitely you're training on all the the open materials that are properly licensed to be able to do so

But you know in general people will with open source. I think when I'll come back to yours

You know well when we're learning we'll copy and paste something and we may not have always like followed to the T

The license there and so there's just risk in industry that open source code gets

Transmagnified from one code based to another and essentially we've designed for the fact and expected

That's this this benchmark will eventually find its way into some of the models that are out there

And so I just want to make it labeled completely clear

We expect that to happen over time. It's just the nature of industry

But the second piece that I wanted to bring to this is that we're already planning it's a little bit that joke of

You know version 1.0 is any given Friday

Well, we already started planning and we're working on 2.0 before 1.0 came out of of this benchmark

And what you'll see is we'll evolve using some of the the wisdom and best practices of sweet bench and sweet bench pro

If you're familiar with those two

They they have a second set effectively a second wave of

More closed evals that supplement the first so a yin and yang if you weld this on degree and we're going to go that direction with some of the additional

Elements that we add to 2.0 that also plays to your earlier question when I didn't I didn't know we're going to go this route

But it's the end of this is it's the enterprise approach. We may be able to have some licensed code bases that we're able to use in evals

But we may not be able to make open source so they're going to contribute to the wisdom the quality of what the models are able to do

But they won't necessarily be code bases that are out in the open and so what you've got is for the first round

You can test it yourself evaluate yourself look at every single line of what we've selected

So from a trustworthyness we gave it all to you. It's all in the open

And then after we booked the ability for v2 so that we can kind of stretch our arms a little bit

We're going to incorporate some code that comes from other sources that we made license to make this even stronger

Okay, well picking off of back up back off of that idea, you know like any good party

That lets you bring your own beverages android studio out of three launch bring your own model back in January

You know, which if you're not aware let's developers plug in claw gpt and like other models

And and so how does android bench complete that picture is it essentially like you know like a buying guy

That comes with the open marketplace

Definitely not a buying guide and and on top of it, you know, these are changing every single week

I mean if you look at just the relentless pace of model releases

I think that you know who's on top who's in second place who's fifth

Could change week to week to week and and probably is there's new model releases and there's improvements to existing ones

But I think even more importantly is the philosophy behind this and not that it's just a buying guide

We have we have two goals for this one to be able to as we begin the show

I think I'm giving developers a way to more empirically make choices about what applies to which scenario

I mean models cost different depending on whether you're using pros or ultras or some of the nano light, you know kind of super

Efficient kind of bits and there are certain scenarios like maybe just code completion where the light models make sense

But if you're refactoring you're most important app in your company and you are re-architecting everything about it and wanted to run

I feel like that's the time to choose the thing that is currently on the top of the list

That's that's where to pay those dollars for the tokens

So this is not just a which one one and done

But you could almost say this helps you make discrete choices for the scenario the use case the automation the volume that you're working on

So you make choose you may choose farther down the list for cost efficiency because it gets you what you need to

And then very lastly is I had the most delightful times

This is a rare privilege of a hat to be able to wear in Google

I've been able to meet with lots of the large

Model makers and like build really good relationships. They're scientists and their researchers and

The great part is that everyone that I spoke to is on team android

Even if they don't wear a Google branded, you know, propeller hat to some degree

And so I think what's super exciting to me is this means that every major model maker is pulling for android to do better

And it just feels so so like the android vibe and culture in that way

We love that vibe. We love that culture

So I got to address the elephant in the room at you, you know, we love the approach. We love the whole thing

But looking at the leaderboard

Gemini's on top and some people might raise an eyebrow at the fact that Google's both running the benchmark and

Saying that Google's AI model coming in first

How are you guys maintaining credibility and trust in those rankings over time and some skeptics might wonder are you guys tipping the scale and Gemini's favor

To push forward Google's, you know, LLM in this case

The team that puts this together is on team android to just be super clear

The second piece though is, you know, that's what hat we wear, but then you know, what are we doing

All open all in the open source if anyone is currently a skeptic

We put it out there so that you can run it yourself like, you know, be that person that goes and runs the harness

Runs it n equals 10 runs it across the models that you care about including even once that are not on our our leaderboard

I mean, there's only so many we can do there are a lot of model choices out there today

And then go look at what your result is so I would simply say like we did that for the sake of efficiencies to somebody can pop up in a webpage

And see, you know, what we got we put hundreds of hours of labor into doing that

But if you need that confidence of getting your own answer

We gave you all the source materials all the harness all the tests and then go produce your own result

Pretend that graph doesn't exist and go create the graph you trust based on exactly the same test Evalon harness that we've got

But that would require someone actually doing work and it's much easier to be snarky unread it than to do work

I said there was no red lines for today's conversation, but maybe you found one

Will we have off that proclivities for work kind of

Okay, well, um next up let's kind of drift a little bit away from android bench and maybe look more kind of a like maybe more

Kind of top level because you know these mentioned like

You're for yourself a lot of goals and a lot of people talked about kind of a good future and what the future of app smell

My look like so for me as someone with boots in the ground

We're having a lot of discussions about quality and human review at scale and you know

It really does come up these days, especially with AI and you know on one hand for maybe like the broader public

There's the growing trend of micro apps where apps are written buying individual for their own use

And it feels like a lot of what's been talked about especially kind of the more kind of commercial

Oh regular person spaces

Um, but the working concerns of like production projects from like you know tiny to large can be quite different again

As an enterprisey working person these are the kind of things that I often think about and you know

There's there's different you know requirements in terms of security quality and scalability and it can be hard

Especially for someone like me to kind of bridge that mental gap between the two kind of projects

Like the two kind of threads maybe in the conversation. I want to ask you Matthew. What are your thoughts on what

Us as an industry need to do to fill in these gaps to make AI trustworthy and productive at scale while maintaining you know high quality and safety

Yeah, it's for using this. I think you know there's a lot of emergent techniques

I don't claim to have the full list

But I have a couple things that I hear from industry leaders are working extremely well

One when we talked about multi model in this particular case

One of the approaches is actually

LLM as it as a judge and I use that term very loosely not the traditional sense for it where

One will write the test. I mean I even have a harness for some of my work at home where I use one model to write the code

And the other one to write the test and then they flip and critique each other and you know

Yes, I'm subject to even writing the silly prompts

Your future use depends on how good of a job you you you do in critiquing the other

And so I think one of the interesting things that that you know comes from a venture like this is you could even choose a couple of the models to be able to

Cross-check each other because they just have such different training techniques and sometimes different behaviors that are really good to

To check one another I especially did that around area any area where I'm looking for a performance or best practices

Architectural or security concerns and you know, I am definitely not on team expert of our Android team

I mean we've got luminaries who've been in for ten years

So I'm you know far far more in need of this double check than maybe some of our experts in this case

But that's been one approach and then the second is we're actually open to recommendations

For folks to send in we've got a kind of the usual paths through GDEs and the like but also some air contacts on the DRE team

For where you'd like us to see putting energy on the V2 point out if you're like hey is I'm using model X model Y model Z

I'm feeling like I'm not getting secure approaches performing approaches etc

Please send us those signals because we're going to use that we're going to work for about another couple of months on making sure that we have the right plan for 2.0

Even as we're coding and I'd love to take that feedback into account so that we're essentially making the areas you need stronger

Stronger through a result of that at the benchmark

Well as someone who has a lot of opinions. I really much appreciate that

But no and I mean seriously that's it's very encouraging because I think that's the one thing that feels sometimes

Drowned out and everything is the lack of voice and ability to steer the course of you know

What is a huge movement right now and a huge way of so and when one thing we already did is you know

We have effectively a steer cove of wide range of companies

But who come together with us virtually and in person a couple times a year too

Essentially be our our steering voice for Android developer you know representatives

You've you've participated in that many a time and I think those also we quantify those results

We make sure we double click into them and that was already used for some of the decision making in v1.0

So that's another channel as well that we're getting from you know from those on the ground developers who are

Refectively leading Android in industry

Excellent well

So in some of the and you know, we're coming close to the end of our time together

But in some of the materials that came out at the announcement

You guys mentioned you know the long-term goal of a developer being able to build any app they imagine an Android

Right, which is you know a long-term dream of mine

I've got windows. I have an app idea that I've been wanting to make for years and

Someday I'm gonna find the time when my kids will leave me alone

I'm gonna vibe coded. We're gonna use Gemini. I'm gonna do studio. I'm gonna do the whole thing

Something to do where does AI need to in your opinion?

Where does AI need to get to in terms of benchmark scores and capability and all the stuff you guys are working on before that

Generally becomes realistic and how far away are we from that moment from your point of view

Not very far. I feel like users can already touch some version of this

I'm not asking people to trust me very much at all if you look at Panda 2 and download the new project

If you use the new project agent and download Panda 2

I feel like you already

Get some confidence that we're right on the the cusp of this being possible

It's super exciting. I mean there's quite a bit of Twitter traffic

I don't know if there's a specific hashtag for it, but for people

Jenning up ideas with with Panda 2 and new project agent that are

Composed best practices good architecture come with the test suite have really good look and feel use material 3 expressive

Like these are a lot of really good practices. This is not you're not having to compromise like well

It's sorry. It's not native. You know doesn't look really good. It doesn't use the best like no no compromise by default

So we're on the cusp

But I'll tell you that even with just a little bit of extra prompting still a little bit of that developer insight

But maybe not hands-on developer

People are able to make some really impressive apps and next time that we're together ask me about the progress of my friend

Grant who has a teenagers along with mine who are in driving school and the like and the app

That's out there could use some help and he's like what let's do this

This is the era like this is the era to like write a better version of this app and

It's really good. It's cool

Really good and so I think this is like my own personal validation loop not industry not Twitter not some blog post or something like that

Like my friend I was like I like this

And so then my daughter was writing in the car with him and had an idea and as soon as he got back home from like you know

Being transported to their high school activities and stuff. He was like hammering out the extra feature

And to me it's not just this this app to focus on but it's the excitement that we have new people

Raising the bar in apps that are going to be like, you know every day kind of use not just you know single individual use apps

And so to me like investing in Android bench is effectively a

Sirtoidist way of helping people like Grant create better software for people like my own daughter

Oh, that's awesome. That's what you that's what you want to see right it's it's it's it's

Adding to the society effect and change and moving forward the innovation

You know, I think about as a kid in the 80s learning basic for the first time and feeling like you've unlocked a whole new world

And like I you know my kids are seven. They're not quite you know

They're they're just starting out on and I think about the tools. They'll be able to do in there your kid you could say

She like that's gonna be it's gonna be a lot of fun and you guys are laying the groundwork for it. That's awesome

So well Matthew. Thank you so much for your time

Are there any other things you want to leave our audience with any teases anything we can look forward to or I know

Google Ios around the corner. We're looking forward to seeing you there

I'll I'll be careful about the Io teaser bits, but there is exciting stuff coming

We will we will not leave you bored for that piece. I saw we skip you that promise

And leading up to it and the second is you know on something like this

It's always important like oh one and done Android bench 1.0. Well, it's great. You know, it's been great shipping

We have the same team plus expanded plus more funding to keep going

And so I think that part that I want to leave you with is like this

This is the start. This is not the finish that we're talking about today

So please give that feedback. You know when you were asking like please send it with through any of the channels

Whether it's social or otherwise some of the chance and the feedback on what should be in it and

Effectively stay tuned for models to really if you already think they're reasonably good

I think over the next six months you will see step changes in Android capabilities

Due to this and so if you're part of the Android community

Enjoy and you know get derived as much benefit as you can from this in in building some amazing apps for our family

Our partners our kids

Awesome. Well, thank you so much for your time. We really appreciate. We'll see you. We'll see you in May at at Google Ios

Thanks Matthew then by when by run

You

This episode is brought to you by capital one capital one's tech team isn't just talking about multi-agentic AI

They already deployed one

It's called chat concierge and it's simplifying car shopping using self-reflection and layered reasoning with live API checks

It doesn't just help buyers find a car they love it helps schedule a test drive get pre-approved for financing and estimate trade and value

Advanced intuitive and deployed. That's how they stack

That's technology at capital one howdy howdy ho and welcome to fantasy fanfellas

I'm Hayden producer of the fantasy fangirls podcast and your resident lover of all things sanderson and I'm steven your bookish internet

Goofball, but you can call me the smash daddy and we are currently deep diving Brandon Sanderson's fantasy epic

Missborn, but here's the catch

Steven here has not read

Missborn before that's right. Hey, hey, so each week you'll get my unfiltered raw reactions to every single chip and along the way

We'll do character deep dives magic explainers and steven will even try to guess what's next spoiler alert

He'll be wrong news flash. I'm never wrong episodes come out every Wednesday and you can find fantasy fanfellas wherever you get your podcasts

Inside the Android Bench & AI LLM Rankings with Matthew McCullough from Google

About this Episode

Hosts & Guests

More from Android Faithful

Android's Loose Garden

Green on the Inside

Android's "Swole" Era

Pixel's Big Booty Drop