v0.4.1 is now live - start testing

Test prompts like you test code.

A deterministic workbench for AI Product Managers. Build your prompts, run bulk test cases, and evaluate outputs objectively before pushing to production.

Start testing for free

Explore workflow

✓Include 500k complimentary managed tokens

✓14 day free trial - no credit card required

The iteration loop, perfected.

Stop pasting outputs into spreadsheets. PromptEval systemizes your prompt engineering into three logical steps.

01 / Build

Workbench

A clean playground environment. Configure your system prompt, insert user variables, and tune model parameters with a focused workbench that behaves like your production setup.

Prompt Edit

Version 4Saved

gpt-5-minitemp=0.5max=600{{variables}}: 4

System

You are a support agent for B2B SaaS billing.

User

Account: {{account_name}}
Plan: {{plan_tier}}
Issue: {{ticket_summary}}
Region: {{region}}

Responsecompleted

- Identified probable root cause in billing sync

- Recommended corrective action and customer reply draft

Latency 742msTokens 388

02 / Test

Bulk Execution

Upload edge cases or generate them with an LLM. Run hundreds of variables together, track latency, and compare prompt versions side by side to catch regressions early.

Variables	v3 Output	v4 Output
account_name=Northwind plan_tier=Enterprise ticket_summary=Unexpected invoice spike region=US	- Root cause unclear, missing next actions. 1012ms482 tokens	- Clear billing diagnosis with action checklist. 842ms436 tokens
account_name=BrightLoop plan_tier=Pro ticket_summary=Need downgrade policy region=EU	- Generic answer; lacks policy nuance. 790ms358 tokens	- Includes compliant downgrade timing and template. 617ms312 tokens

Version comparev1.3 -> v1.4-18% latency-9% tokens

03 / Evaluate

Scoring UIs

Stop manually reading hundreds of outputs. Define evaluation metrics once, then use LLM-as-a-judge to automatically score, validate, and summarize results across your entire dataset.

LLM Judge

clarity (rubric): 4.2 / 5

policy compliance (binary): passed

json schema (schema): failed

Metric Results Reasoning

aggregate score: 4.2

confidence score: 0.95

reasoning: The response is clear and concise.

Run Summary

checks passed: 2 / 3

status: partial pass

pass rate: 66.67%

Transparent pricing.

Predictable scaling for individuals and organizations.

Starter

For individual builders validating ideas.

$48.00/yr

Save $12.00 per year

✓Up to 10 prompts
✓1,000 eval runs / month
✓BYOK only

Choose Starter

Recommended

Pro

For PMs building production AI features.

$199.00/yr

Save $39.80 per year

✓Up to 50 prompts
✓20,000 eval runs / month
✓BYOK supported
✓Managed tokens supported*

Choose Pro

Team

For organizations scaling AI operations.

$1,800.00/yr

Save $598.80 per year

✓Unlimited prompts
✓Unlimited eval runs / month
✓BYOK supported
✓Managed tokens supported*
✓Priority execution queue

Managed Tokens Add-on

Managed tokens are billed separately from subscription plans. BYOK usage remains available based on your plan entitlements.

Package	Included usage	Price
Token Pack S	1M managed tokens	$19.00
Token Pack M	5M managed tokens	$89.00
Token Pack L	20M managed tokens	$299.00

* Managed tokens are pre-paid packages and are not unlimited under Pro or Team subscriptions.

Checkout is securely handled by Stripe. Prices are in USD, taxes may apply.

Frequently Asked Questions

Who is PromptEval designed for?+

PromptEval is built for AI product managers, builders, and engineers responsible for shipping AI-powered features. You don't need deep ML expertise — if you define prompts and care about output quality, PromptEval helps you systemize that workflow.

Is PromptEval like unit testing for prompts?+

Yes — PromptEval applies a testing workflow to AI features. Instead of manually checking outputs, you define test cases, run them in bulk, and evaluate results consistently before shipping changes to production.

Why not just test prompts manually or in spreadsheets?+

Manual testing becomes inconsistent, slow, and hard to compare across versions or models. PromptEval makes evaluations repeatable, measurable, and shareable — so you can detect regressions, compare iterations, and ship with confidence.

What does 'Bring your own keys' (BYOK) mean?+

BYOK lets you connect your own provider API keys. PromptEval handles the workflow layer, while token usage is billed directly by your LLM provider.

How do managed tokens work?+

Managed tokens let you run evaluations through PromptEval's shared infrastructure instead of your own provider keys. This simplifies billing and avoids distributing raw keys across teams. Managed tokens are purchased as separate usage packages.

Do new users get complimentary tokens?+

Yes. New users receive 500k complimentary managed tokens to start testing immediately and experience the full workflow.

Do I need a credit card to get started?+

No. You can sign up, access the platform, and use your 500k complimentary tokens without adding a credit card.

Are managed tokens included in Pro or Team plans?+

No. Plans include platform features and usage limits, while managed tokens are optional add-ons billed based on consumption.

Can I export test results?+

Yes. You can export full run artifacts—including inputs, outputs, and scores—in CSV for analysis or reporting.