Code Reviews in the age of AI

Code reviews are an essential part of good software development practice.

Typically, after a developer has implemented some code change for a new feature or a bug fix, he or she will ask a colleague to look at it to confirm it is done correctly. The second pair of eyes can spot problems the author was blind to, misunderstood, or didn't know about. This is analogous to the work editors do for written text.

Common things that code reviews help with are:

align on the best approach for a specific task - how something should work
ensure that the organization's preferred patterns and best practices are followed - how it will fit with the rest of the system
knowledge transfer between coworkers - increasing the bus factor
check code correctness by reasoning about the process flow - done in addition to automated tests and manual QA
learning opportunity for junior developers
code style checks - beyond the standard linting rules

While not every organization or team does code reviews, they are considered a standard best practice. In some teams, or for high-risk changes, the workflow may even require two reviewers for each code change.

Code reviews are hard

Doing good code reviews is hard, and requires effort on both the person who did the original work (the PR submitter) and the reviewer.

The reviewer must deduce the new behaviour just by looking at the changes. He or she needs to spot not only problems in the new or modified code, but also recognize how that will interact with existing code that's not apparent in the PR. The reviewer needs to balance between ensuring code quality and avoiding nitpicking or second-guessing the author just because he or she would do it slightly differently.

The author needs to take the comments in stride and not get defensive or take the criticism personally. The point is not to tear them down, but together with the reviewer, ensure the high code quality. At the same time, the code author needs to be able to argue why they did something, stand by their decision if they believe they're correct, yet be open to having their mind changed if the argument is persuasive.

The smaller the change, the easier the code review will be. If the PR consists of several related changes, reviewing them one by one is typically easier. If there are unrelated changes, many things being changed at once, or there are hundreds or thousands of changed lines, the review becomes increasingly hard.

Sloppy AI coding kills code reviews

Using AI for coding turns this upside down. Today it's easier than ever to churn out thousands of lines of code changes, without a good grasp of what's actually being done. Whereas previously "ready for review" meant the author was happy with the result, now some people just take whatever Claude or Codex or Cursor output and send the pull request.

This places the entire responsibility on the reviewer, and saddles him or her with thousands of lines of code which, as noted previously, is then very hard to review properly. What's worse, the review comments get copy-pasted back into the AI agent to fix the code because the author simply doesn't understand either their change or what the comments mean.

This is unacceptable. A software engineer worthy of the name should always stand behind his or her output and guarantee that it's the best they were able to make it. It doesn't matter if they wrote it manually, had help from a coworker, found the solution on Stack Overflow, or asked AI to build it. They are still the person in charge of that particular change and must completely and confidently understand what's going on.

To reiterate: as a software engineer, you are responsible for the code you commit.

Responsible AI coding still kills code review

Even if the code author is practicing responsible AI-assisted software development and has excellent understanding of all the changes, using AI still spells trouble for traditional code reviews. The reason is simple: volume.

A typical AI coding session involves a planning step, where a developer discusses the codebase, task at hand, requirements and constraints and develops a detailed code plan or specification (for that particular task, not the entire project - scope must be limited for this to work well). The AI agent can then autonomously implement the changes outlined in the plan, including writing tests or doing "manual" QA.

The developer tests and inspects the code, asks for adjustments if needed, and the AI agent and developer iterate on the task. When satisfied, the developer sends a pull request (PR) for a review.

The implementation phase is now massively compressed. A task that might have taken two weeks to code can now often be done in one day. But if the review is still fully manual, it will still take the same amount of time. Worse, since there is a lot more new code that can be created in the same amount of time - even if it's good quality - the burden on the review grows proportionally. This is a known process flow problem - increasing the capacity in one part of the system (coding) creates a bottleneck in another (code review).

"Looks Good To Me, I Guess?"

This leads to reviewer fatigue. People stop being as careful reviewing the code - whether their own (that their AI agent wrote), or in the pull request reviews (i.e. others' code). People reach for hacks like asking other AI code review tools to review the codebase, or stop reviewing everything but the most critical parts of the code.

This is an unfortunate development that inevitably leads to worse code quality and cognitive debt.

For all their advancements, AI coding agents can still write subpar code. For popular languages and frameworks and typical code patterns, that happens occasionally. For more complex code involving obscure libraries or languages, more often. But it does happen. If it's allowed to go totally unchecked, slowly but steadily, the code quality will slide.

A more insidious problem is cognitive debt: the team's shared knowledge of the code not keeping up with the changes. After a while, this leads to not understanding what's going on in the project. This is as paralyzing as technical debt, as the developers are afraid to do any changes for fear of breaking a black box nobody understands anymore.

To put it succinctly: Velocity without understanding is not sustainable.

Rethinking the code review

If we are to fully take advantage of the productivity boost that the AI coding tools are promising, we must find a better way - we must go back to the drawing board.

From a high-level, a code review accomplishes three things:

doing the right thing - correctly tie in with the rest of the system, use the proper data flows, structures, patterns, frameworks or packages, and so on
doing the thing right - bug free, follows code style and best practices
understanding the thing - improving a shared understanding of the codebase

How can we redesign our code review to still achieve these goals? I've seen different teams and individual developers tackle subsets of these challenges and I think some patterns are starting to emerge:

review the plan - to ensure we're doing the right thing
automate the coding part
lean on the AI reviewers - to help with the details

Review the plan

Taking advantage of the planning vs implementation split in the new AI-powered workflows, some teams are now reviewing the plans. The process looks something like this:

A developer assigned to the task takes in all the input and iterates on the development plan with the AI. For complex cases or tasks with some unknowns, this may involve "vibe-coding" a spike or two (experimental throw-away implementation to verify feasibility of a solution).
The detailed implementation plan is stored in a repository and sent for an initial review by another person. It may be accompanied by (or contain) an Architecture Design Record (ADR) documenting the decision process, rationale for the decision, and alternatives that were discussed but rejected (and why).
The reviewer goes through the plan and points out any problems or omissions that the author (and their AI assistant) hadn't noticed. Note there's no code involved at this stage!
The feedback is incorporated into the plan. If it requires a massive rework, the iteration continues from step 1, now including the reviewer's feedback.

The focus of this review is to ensure the author is doing the right thing and that the others in the team understand the proposed changes. At the end of the process, the team has a good shared understanding of what is being done and how it will be implemented.

Automate the implementation

Since the plan is detailed and eliminates ambiguity, unknowns and guesswork, no serious problems are expected at this stage and it can be taken on by an AI agent to implement it. The agent may act autonomously or be guided by the developer. In both cases, the ultimate responsibility for the implementation falls on the developer and he or she is expected to understand all the details.

Should the author also review all the code changes before submitting the changes? There's no consesus here, and things are likely to change in the coming months. Personally I would argue the author should do a full code review, but there are some arguments that for simple or non-critical changes this constraint can be relaxed.

The complete implementation should consist not only of the code implementing the changes, but also the automated tests verifying the changes are correct and the relevant documentation. Both of these are useful for the AI implementation process itself and help with long term maintainability of the code.

Automate the code review

The focus of the second review - the actual code review - is to ensure the thing is done right.

Automated code checkers - linters, unit tests, and AI code reviewers - help verify the changes look good and work correctly and are built according to the established plan.

AI code reviewers are (still) bad at looking at the big picture or some domain-specific edge cases and gotchas. But these are exactly the things that should have been caught in the first, manual review, of the plan! At this stage, the resulting code should closely follow the plan. AI-based reviewers are good at spotting typical code smells or deviations from the plan and they can be very effective tools here.

If the author is less experienced or less trustworthy, the same reviewer who did the original plan review might go over the code as well. Even if this sounds like duplicate work, it's not: the reviewer is now helped by the fact they already know the overall plan and how everything fits together (from the first review). So they can focus on the details here.

Aside from ensuring quality, the human feedback here is important for knowledge transfer and upskilling the less experienced coworkers.

Won't this still overwhelm the reviewer? This is a tricky issue. If the point of the second manual review is to help a junior or less skilled colleague, they should avoid churning out large amounts of code in the first place. Instead, they should spend more time learning and improving their understanding of the system.

This will mean they're slower - the management should recognize, support and encourage it! In the long term, the perceived miss from short-term productivity from such an employee will be more than offset by their faster learning curve. As a happy side effect, they won't overwhelm the senior colleagues reviewing their output.

Tell me a story

A well-prepared pull request should tell a story of what's being done, how, and why.

This is straighforward when reviewing the plan (the first code review) - you can argue the plan is the story. But the point can get lost in the weeds in the (second) code review phase. The excellent talk I linked above shows how to approach preparing a PR so that it gets easier. Trading some time here to speed up the review process is a worthwhile effort and your colleagues will thank you.

This is also an area where AI can shine. If you have a messy commit history, AI can easily rework or reorganize it so it's more readable "chapter by chapter" (commit by commit). I have also experimented with AI producing a "review walkthrough" document, telling me what are the big things to check first and how it all fits together.

It's not a replacement for human judgement, but - along with initial plan review - does help with understanding the big changes.

Shuffling pieces around?

This may look like we're just shuffling pieces around, but it's not.

The key insight is that we can separate the code review goals, so we can automate what can be automated and put more human effort into what needs to be done by a living, breathing expert: decision making, judgement, good taste and accountability.