Using mutation testing to improve trust in LLM-assisted development
I believe mutation testing will grow in popularity due to LLMs.
Let’s say that you have code that needs refactoring, and that you are tempted to use LLMs.
- You don’t trust the LLM to not change behavior, so you need good test coverage before the refactor
- You could generate tests, but you don’t trust the LLM to write proper tests
Test coverage is (arguably) more important when the refactor is made by an LLM, since humans are (arguably) less likely to make silly, arbitrary changes to business logic.
Mutation testing
Mutation testing … is used to design new software tests and evaluate the quality of existing software tests.
Mutation testing involves making small changes to the program being tested.
i.e. testing the tests: “If a change introduces a subtle but catastrophic bug, will the tests catch it?”
So the full refactor flow would be:
- Write tests
- Run mutation testing, improve tests
- Refactor business logic
The semi-automated way
Using cargo-mutants, we install it with cargo install --locked cargo-mutants.
As a refactoring exercise, let’s walk through the classic Gilded Rose kata. The premise is that we have correct code, but want to safely refactor it.
$ git clone git@github.com:emilybache/GildedRose-Refactoring-Kata.git
$ cd GildedRose-Refactoring-Kata/rust/
Low-effort prompting (Cursor CLI + Claude 4.5 Sonnet in this example):
$ cursor-agent -p "write tests"
This produces reasonable tests, for example:
diff --git a/rust/src/gildedrose.rs b/rust/src/gildedrose.rs
index c2ecea0..4ebc8ce 100644
--- a/rust/src/gildedrose.rs
+++ b/rust/src/gildedrose.rs
@@ -86,4 +86,183 @@ impl GildedRose {
#[cfg(test)]
mod tests {
+ use super::{GildedRose, Item};
+
+ #[test]
+ fn test_normal_item_before_sell_date() {
+ let items = vec![Item::new("+5 Dexterity Vest", 10, 20)];
+ let mut rose = GildedRose::new(items);
+ rose.update_quality();
+
+ assert_eq!(rose.items[0].sell_in, 9);
+ assert_eq!(rose.items[0].quality, 19);
+ }
Seems fine, and it would probably pass a light human review.
But mutation testing reveals untested code paths:
$ cargo mutants
Found 55 mutants to test
...
MISSED src/main.rs:6:5: replace main with ()
MISSED src/gildedrose.rs:20:9: replace <impl Display for Item> with Ok(Default::default())
MISSED src/gildedrose.rs:47:50: replace < with <= in update_quality
MISSED src/gildedrose.rs:53:50: replace < with <= in update_quality
MISSED src/gildedrose.rs:66:38: replace < with <= in update_quality
MISSED src/gildedrose.rs:69:50: replace > with >= in update_quality
MISSED src/gildedrose.rs:78:46: replace < with <= in update_quality
55 mutants tested in 8s: 7 missed, 48 caught
Some missed mutants might not be worth testing:
- Replacing main with a no-op
- Messing with the display formatting
But the integer comparison changes would break the business logic, which is what we are trying to avoid.
By feeding the mutation output back into the agent, we get better coverage:
I can see you have 7 missed mutants in your Rust code. Let me examine the code to understand what’s happening and help improve the test coverage. Now I can see the issue. The mutants are revealing gaps in test coverage for boundary conditions. Let me add tests to catch these missed mutants:
- Line 6 (main function): Not tested at all
- Line 20 (Display trait): Not tested
- Lines 47, 53, 66, 78: Boundary conditions with < vs <=
- Line 69: Boundary condition with > vs >=
diff --git a/rust/src/gildedrose.rs b/rust/src/gildedrose.rs
index 4ebc8ce..5b670f9 100644
--- a/rust/src/gildedrose.rs
+++ b/rust/src/gildedrose.rs
@@ -265,4 +265,119 @@ mod tests {
// Current implementation treats it as a normal item
make
}
+
+ #[test]
+ fn test_item_display_format() {
+ let item = Item::new("Test Item", 10, 20);
+ let formatted = format!("{}", item);
+ assert_eq!(formatted, "Test Item, 10, 20");
+ }
+
+ #[test]
+ fn test_backstage_pass_at_11_days() {
+ // Test boundary: sell_in = 11 should increase by 1, not 2
+ let items = vec![Item::new("Backstage passes to a TAFKAL80ETC concert", 11, 20)];
+ let mut rose = GildedRose::new(items);
+ rose.update_quality();
+
+ assert_eq!(rose.items[0].sell_in, 10);
+ assert_eq!(rose.items[0].quality, 21); // Should be 21, not 22
+ }
+
+ #[test]
+ fn test_backstage_pass_at_6_days() {
+ // Test boundary: sell_in = 6 should increase by 2, not 3
+ let items = vec![Item::new("Backstage passes to a TAFKAL80ETC concert", 6, 20)];
+ let mut rose = GildedRose::new(items);
+ rose.update_quality();
+
+ assert_eq!(rose.items[0].sell_in, 5);
+ assert_eq!(rose.items[0].quality, 22); // Should be 22, not 23
+ }
Rerunning cargo mutants now finds only the no-op main() mutant.
Value so far:
- a low-effort way to determine test trustworthiness
- a low-effort way to tell the LLM how the tests are insufficient
Fully automated?
Let’s reset the state and try again with a minimal AGENTS.md
Tests are run with `cargo test`
Mutation tests are run with `cargo mutants`. Mutation tests must pass.
For each MISSED mutation, either add a test case or annotate the function with `#[mutants::skip]` if testing is not appropriate.
Rerunning the same prompt cursor-agent -p "write tests" makes the agent execute the mutation tests and iterate until all mutants are caught:
$ cargo mutants
Found 54 mutants to test
ok Unmutated baseline in 0s build + 0s test
INFO Auto-set test timeout to 20s
54 mutants tested in 9s: 54 caught
It also added a skip annotation as instructed
diff --git a/rust/src/main.rs b/rust/src/main.rs
index 3a552a8..88b1371 100644
--- a/rust/src/main.rs
+++ b/rust/src/main.rs
@@ -2,6 +2,7 @@ mod gildedrose;
use gildedrose::{GildedRose, Item};
+#[mutants::skip]
fn main() {
let items = vec![
Item::new("+5 Dexterity Vest", 10, 20),
Takeaways
- LLM-assisted development improves mutation testing:
- Parsing and acting on the mutation testing output is now cheap
- Mutation testing improves LLM-assisted development:
- It provides an automated feedback loop for LLM test generation, and increases trust in the output
Let’s see if it provides value in practice. I will start using this loop as part of my normal development flow, and hopefully find out.