Don’t Give Developers a Leaderboard
Photo by Compagnons on Unsplash
At work we measure the PR count of every software developer. Confusingly enough we also measure the PR count of designers because in the AI world we expect our designers create pull requests (true story. I’m devastated to say). In this environment I’ve got PRs out to “improve” testing classes (when they don’t need to be changed at all).
So Amazon ranked their devs by AI tools. The result was entirely predictable.
You Knows It
At Amazon employees started creating unnecessary work to increase their scores.
Apparently some workers were assigning tasks to AI agents that didn’t need to be done at all. They produced work that didn’t need to be done.
They probably spent time asking Claude what the weather is likely to be in Mexico on max mode. At least that’s the sort of thing that I do in my job to make sure that I *appear* AI productive, and that is a crucial thing.
I suppose while at my company the same productivity theatre is continuing, at least Amazon has cancelled this charade.
My complaint? They could have produced this outcome from the beginning. They didn’t need to waste so many tokens and so much of the earth’s natural resources in the process.
Optimisation
Amazon had a good idea this that would happen. They knew their staff, and they knew how software developers would behave.
They knew that software developers are optimizers. Optimization is what reduces runtime, gets your features working properly (not working in a slooow way that annoys users).
So when you measure developers on their AI usage you don’t get just increased test coverage. You get increased AI usage. That has correlation between productivity (https://github.blog/news-insights/research/research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness/, https://www.nber.org/papers/w31161, https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-economic-potential-of-generative-ai-the-next-productivity-frontier) but those studies give narrow domains of what productivity means, but in all cases burn tokens. That means we (as a technology sector) create an incentive to use AI without a promise of better code.
It’s the wrong metric. It’s such a common thing, Goodhart’s Law.
“When a measure becomes a target, it ceases to be a good measure”
The moment somebody attaches importance to a number, people start optimizing for the number instead of the thing the number was supposed to represent.
Measure AI usage and people find ways to use more AI. That’s not great. It doesn’t actually help any of us.
The AI Arms Race
Many companies are currently trying to work out how much AI their developers should be using. Our organization is all-in and seems to want us all to become AI native developers.
Yet some are pretending AI doesn’t exist (banks, I’m looking at you). Yet we are all getting more and more pressure to use AI in our jobs, I’ll give you an example.
I ran an interview this week and the candidate wrote off their poor performance as “because I don’t write, or even see code anymore”. That has all sorts of ramifications for both our process and for when we prepare for a new job ourselves. But that discussion is for another day.
The general pressure is to produce more and more code, and to prove that you’re using AI to do so “if not our competitors will get ahead”. Who cares about the consequences? Certainly leadership and management don’t.
Nobody Gets Promoted For Saving Tokens
The funny thing is that AI usage itself is a terrible metric.
Two developers.
One spends all day prompting an AI assistant and generates thousands of tokens.
The other spends twenty minutes using AI to solve a problem and then spends the rest of the day implementing a solution.
Which one provided more value?
Does anyone care?
In terms of companies, it seems like no. Leaderboards don’t know. Metrics can’t tell if what is being done is valuable.
Software development has always struggled with measuring productivity and that is something which is not really changing. AI isn’t fixing it, and no-one seems to really care.
The things we need to measure are easy to say and difficult to rank.
Judgment.
Experience.
Understanding business requirements.
Preventing disasters before they happen.
Writing code that somebody can still understand two years later.
None of those fit neatly into a dashboard, persist and just aren’t being fixed.
And if you put them into a leaderboard software developers would simply game that leaderboard in any case.
Which is a whole other class of problem.
Sources
Financial Times https://www.ft.com/content/b1a62a7f-6df5-4c90-94ce-64ce9c9961b6
Business Insider (summary with additional comments from Amazon) https://www.businessinsider.com/amazon-ai-leaderboard-tokenmaxxing-2026-5
About The Author
Professional Software Developer “The Secret Developer” can be found on Twitter @TheSDeveloper and regularly publishes articles through Medium.com
The Secret Developer doesn’t usually reveal anything about their employer. In this case they are pleased to say they don’t work for Amazon.