King Abdulaziz University
Faculty of Computing and Information Technology

An Automated Evaluation Framework for Unit Test-Driven LLM Code Generation

Muhammad Adnan Rizqullah

King Abdulaziz University

Faculty of Computing and Information Technology

Advisor: Dr. Emad Yosif Albassam

The Critical Gap in LLM Code Generation Evaluation

Test-Driven Prompting (TDP) shows promise but has critical limitations:

  • Evaluation bias: TDP applied only to failed cases (selection bias)
  • Limited scope: Single language focus (Python bias)
  • Narrow model coverage: 1-2 models, missing open-source alternatives
  • Lack of explainability: Works, but when and why?

Prior Work on LLM Code Generation

  • HumanEval (OpenAI) - 164 Python programming problems
  • MBPP (Google) - 974 Python problems, beginner to intermediate
  • APPS (UC Berkeley) - Algorithmic problem-solving tasks

Limitations: Single language (Python only), outdated LLMs, low test coverage

Prior Work on LLM Code Generation

  • AlphaCode (DeepMind) - Competition-level code generation
  • CodeT (Microsoft) - Iterative refinement approaches
  • RL-based approaches (Huawei) - Reinforcement learning techniques

Limitations: Single language focus, limited model diversity, no difficulty analysis

Prior Work on LLM Code Generation

Mathews et al. (Waterloo): First to show test-as-prompt improves accuracy (+7-18%)

Piya et al. (Texas): TDD workflow with LLMs

Critical Gaps:

  • Only 1-2 modern LLMs evaluated
  • Single language (Python bias)
  • No difficulty analysis across problem complexity
  • Low test coverage in experiments
  • No model characteristic study (size, type, specialization)

From Test-Driven Development to Test-Driven Prompting

TDD in Software Engineering

Tests guide implementation and ensure specification adherence

TDP Adaptation

Test cases as executable specifications in prompts

Does specification clarity translate to performance gains?

Research Questions: Primary Focus

RQ1 Cross-Language Performance

What is the performance of test-driven code generation across programming languages of different popularity and type nature?

Scope: Python, JavaScript, C++, TypeScript, PHP, Ruby, Go, C#

RQ2 Model-Agnostic Effectiveness

What is the performance of test-driven code generation on various models with differing characteristics?

Scope: Closed/open-source, varying sizes, general vs specialized

Research Questions: Additional Dimensions

RQ3 Problem Difficulty

What is the relationship between programming problem difficulty and LLM performance?

RQ4 Test Suite Completeness

What is the relationship between test suite completeness and LLM performance?

RQ5 Decision Framework

How can a decision framework guide developers in selecting appropriate LLMs for platform-specific development?