TLDR: Code review recommendation tools are boring—so why should you read this post? You don’t have to! Just know these tools are imperfect, like us humans. If you’re curious but don’t feel like reading, you can listen to the podcast and explore the mind map from our paper linked below.
- https://farshad-kazemi.me/characterizing-the-prevalence-distribution-and-duration-of-stale-reviewer-recommendations/
- https://farshad-kazemi.me/exploring-the-notion-of-risk-in-code-reviewer-recommendation/
I’ve always been a fan of practical things—whether it’s warm socks for a freezing Canadian winter or a system (some might call an LLM) to make our dumb digital assistants smarter. In my PhD, I followed that same rule, with a bit of criticism. I looked into code review recommendation tools and why they aren’t as widely used as research suggests they should be.
I started by noticing how defect-proneness is often ignored when recommending reviewers for a change. That clearly affects the usefulness of the suggestions. We proposed a framework that adapts to the project’s risk tolerance and the change’s defect-proneness. We presented this at ICSME 2022. You can read the full paper and try it yourself at: https://rebels.cs.uwaterloo.ca/papers/icsme2022_kazemi.pdf
Next, we noticed that while some code reviewers consider the history of developers’ contributions, as they take into account multiple aspects of users’ behaviour, they still produce stale recommendations, i.e., recommending developers who are no longer actively contributing to a project. As we became more concerned with this issue, we noticed that some of the recommended reviewers are no longer active. Yet, they continue to be suggested because they were the only authors in some of the files or made noticable contributions to the project during their tenure. We published our findings in the TSE journal and presented them at ICSE 2025. A detailed report of our study is available at https://rebels.cs.uwaterloo.ca/papers/tse2024_kazemi.pdf.
Finally, we studied code review generation tools—both BERT-based and LLM-based. We evaluated them quantitatively and qualitatively, especially in how they ask questions, a key part of reviewing. This work is still under review, and I’ll update this post once it’s published.