I Benchmarked Naive vs. Optimized Prompts Across 10 Job Roles. Here's What Surprised Me.

Most advice on prompt engineering is anecdotal. “Add a role.” “Use few-shot examples.” “Be specific.” All reasonable. None of it measured.

I wanted numbers. So I ran a benchmark: 10 job roles, two prompt variants each — naive (vague, unstructured) and optimized (role-specified, structured output, explicit constraints) — evaluated using PromptFoo with Claude Haiku as the model and explicit LLM-graded assertions as the scoring mechanism. Ten test cases per role, three assertions each.

Here’s what the data actually said.


The Setup

Each role had a PromptFoo YAML config defining two prompt variants and ten test cases. The test cases follow a single variable schema per role — the same slots, different values. For Sales, that schema was: product, prospect_role, company_size, company_type, hook, pain_point, differentiator, cta. Test 1 injects an AI contract review tool targeting a Head of Legal at a 300-person fintech. Test 5 injects a cloud security tool targeting a CISO at an 800-person financial services firm. Same prompt structure. Diverse inputs.

This matters. It’s the same principle as pytest’s @pytest.mark.parametrize or table-driven tests in Go: the prompt is the function, the variables are the parameters, the ten test cases are the test suite. A prompt that passes on one hand-crafted example tells you almost nothing. One that passes 8/10 across diverse, realistic inputs tells you something real.

Each test case had three assertions:

  • Structural check (contains-any): Does the output contain what I asked for? For Sales: does it include a subject line, an opening hook, a CTA? Non-negotiable.
  • Rubric check (llm-rubric): Does the output follow the principle I care about? For Sales: “Does the opener reference the prospect’s specific pain point, not just the product?” — grounded in SPIN Selling methodology. For Marketing: “Does the copy lead with customer desire, not a feature list?” — Ogilvy. For HR: “Does the job description avoid gendered language that reduces applicant diversity?” — Gaucher et al. (2011).
  • Negative constraint: What should the output never do? Marketing prompts explicitly prohibited the words “innovative” and “powerful.” These negative constraints matter as much as the positive ones.

One evaluator per dimension. Binary pass/fail. This is the right default — a “God Evaluator” that tries to assess five criteria in one rubric is impossible to calibrate and impossible to debug.

Grading note: A scenario is marked PASS only if all assertions pass — AND logic, not average. A test case that passes structural and rubric checks but violates a negative constraint is a FAIL. This is conservative by design. Partial correctness in an LLM output often means the error is still there. The pass rates below reflect this strict bar.


The Results

RoleNaiveOptimizedNaive PassOpt PassCost Δ
Sales3.027.861/106/10−48%
HR6.608.353/105/10+22%
Product Manager7.278.032/104/10+65%
Customer Support8.088.385/106/10−34%
Financial Analyst9.009.438/108/10+63%
Marketing8.938.827/107/10−44%
Software Engineer9.9810.0010/1010/10+22%
Data Scientist9.679.509/108/10+69%
Legal8.338.336/105/10+58%
Content Writer7.808.106/104/10+17%

Scores are out of 10. Cost Δ = change in token cost, optimized vs. naive.


Three Patterns

Structured prompts win most, but not all, of the time. In 7 of 10 roles, the optimized prompt scored higher. In 2 roles (Data Scientist, Marketing), the naive prompt scored the same or better. The Software Engineer role was effectively a ceiling test — both prompts scored 9.98 and 10.0 — which suggests that for highly structured domains where a capable base model already knows the schema (code review, statistical analysis), the optimized prompt adds tokens and cost without proportional quality gain. Know your ceiling before you optimize.

The biggest gains are in open-ended, relationship-driven roles. Sales had the most dramatic gap: 3.02 → 7.86, pass rate 1/10 → 6/10. The naive prompt was five words: Write a cold outreach email for {{product}}. No persona, no ICP, no objection to address, no format constraint. The model produced generic copy that failed 9 out of 10 rubric checks. The optimized version specified the SDR persona, required a subject line under 8 words (Gong research: subjects under 7 words get 33% higher open rates), named the prospect’s likely pain point, and constrained length. Same model, same temperature, same test cases. Pass rate went from 1 to 6.

HR followed the same logic: 6.6 → 8.35. The naive prompt was Write a job description for {{role}}. The optimized version added team context, culture signals, and an explicit constraint against gendered language. Gaucher et al. showed that coded language in job descriptions measurably reduces the diversity of applicants — baking that finding into the rubric means the eval is testing for something that actually matters in production, not a vibe.

Cost is not correlated with quality — and this is the number that matters at scale. Marketing optimized prompt scored slightly lower but cost 44% less and ran in 28 seconds instead of 52. Customer Support optimized prompt scored higher and cost 34% less. If you’re generating marketing copy across 10,000 SKUs or handling thousands of support tickets daily, that delta is not a footnote. It’s the budget conversation.


The Failure That Surprised Me Most

Legal. Both prompts scored exactly 8.33. Pass rates: 6/10 naive, 5/10 optimized. Identical quality by every metric — except latency. The optimized prompt took 410 seconds to complete. The naive prompt took 0.1 seconds.

This is the LLM equivalent of a query that returns the right answer but does a full table scan to get there. Structured prompts with multiple required output sections — contract risk summary, liability cap analysis, recommended redlines — can trigger long reasoning chains that cost you in latency even when they’re producing correct output. For latency-sensitive applications, efficiency is a quality dimension, and it needs to be an assertion, not an afterthought.


A Note on the Grading Setup

The grader and generator are the same model: Claude Haiku. This is the honest limitation of the current setup. A model can have systematic blind spots that cause it to consistently produce and consistently approve certain patterns — self-confirmation bias at the model level.

Two things partially mitigate this here. First, the rubrics are specific, not subjective. “Does the copy avoid greenwashing clichés like ‘save the planet’ or ‘eco-friendly’?” leaves less room for the grader to be lenient than “Is this good marketing copy?” Second, the AND-logic strict grading means even a single rubric failure marks the case FAIL — there’s no averaging that papers over a miss. That said: meta-evaluations show LLM judges correlate roughly 0.82 with human annotators on structured output tasks (Zheng et al., 2023). Not perfect, but more consistent and scalable than grading 100 outputs manually every time you tweak a prompt. The right production setup would use a stronger model as grader — Sonnet or Opus grading Haiku output — or spot-check a sample of results with human review.


What I’d Do Differently

The Legal result points to an obvious addition: latency as an explicit assertion. PromptFoo supports response time constraints. A prompt that produces correct output in 410 seconds when the naive version takes 0.1 seconds is not better — it’s a regression hiding behind a good score.

The Data Scientist and Content Writer results surface a subtler point: for domains where the base model already has strong priors about output structure, the optimized prompt adds cost without proportional quality gain. The right move is to benchmark first, then optimize. Don’t write an elaborate prompt for a role where the naive version is already at 9.67.

The Sales role is the one I’d show anyone who says prompt engineering is just intuition. The difference between 3.02 and 7.86 isn’t a better model or a different temperature. It’s knowing what you want precisely enough to write it down — and then measuring whether you got it.


The Point

The benchmark is reproducible. The configs, prompts, and test cases are all in YAML files in version control. Swap the model, change a role, add a new assertion. The feedback loop is 60 seconds.

The teams that will move fast with LLMs aren’t the ones with the most creative prompts. They’re the ones who treat prompt iteration like a pipeline — with a defined test suite, pass/fail criteria, and version history.

Vibing is how you ship a demo. Testing is how you ship production.


Tools used: PromptFoo (open-source LLM eval framework), Claude Haiku (generator + grader), Ollama/qwen2.5-coder:14b (local fallback). All 10 role configs on GitHub.

References: Zheng et al. (2023) — LLM-as-a-Judge; Gaucher et al. (2011) — Journal of Personality and Social Psychology; Gong Labs — subject line research; SPIN Selling — Neil Rackham; Ogilvy on Advertising.