An Automated Evaluation Framework for Unit Test-Driven LLM Code Generation

Muhammad Adnan Rizqullah

King Abdulaziz University

Faculty of Computing and Information Technology

Advisor: Dr. Emad Yosif Albassam

The Critical Gap in LLM Code Generation Evaluation

Test-Driven Prompting (TDP) shows promise but has critical limitations:

Limitations: Single language (Python only), outdated LLMs, low test coverage

Limitations: Single language focus, limited model diversity, no difficulty analysis

Mathews et al. (Waterloo): First to show test-as-prompt improves accuracy (+7-18%)

Piya et al. (Texas): TDD workflow with LLMs

Critical Gaps:

Tests guide implementation and ensure specification adherence

↓

Test cases as executable specifications in prompts

↓

Does specification clarity translate to performance gains?

RQ1 Cross-Language Performance

What is the performance of test-driven code generation across programming languages of different popularity and type nature?

Scope: Python, JavaScript, C++, TypeScript, PHP, Ruby, Go, C#

RQ2 Model-Agnostic Effectiveness

What is the performance of test-driven code generation on various models with differing characteristics?

Scope: Closed/open-source, varying sizes, general vs specialized

RQ3 Problem Difficulty

What is the relationship between programming problem difficulty and LLM performance?

RQ4 Test Suite Completeness

What is the relationship between test suite completeness and LLM performance?

RQ5 Decision Framework

How can a decision framework guide developers in selecting appropriate LLMs for platform-specific development?