v0.2.1 is now live - start testing

Test your prompts like you test your code.

A deterministic workbench for AI Product Managers. Build your prompts, run bulk test cases, and evaluate outputs objectively before pushing to production.

The iteration loop, perfected.

Stop pasting outputs into spreadsheets. PromptEval systemizes your prompt engineering into three logical steps.

01 / Build

Workbench

A clean playground environment. Configure your system prompt, insert user variables, and tune model parameters with a focused workbench that behaves like your production setup.

Prompt Edit
Version 4Saved
gpt-5-minitemp=0.5max=600{{variables}}: 4
System
You are a support agent for B2B SaaS billing.
User
Account: {{account_name}}
Plan: {{plan_tier}}
Issue: {{ticket_summary}}
Region: {{region}}
Responsecompleted
- Identified probable root cause in billing sync
- Recommended corrective action and customer reply draft
Latency 742msTokens 388
02 / Test

Bulk Execution

Upload edge cases or generate them with an LLM. Run hundreds of variables together, track latency, and compare prompt versions side by side to catch regressions early.

Variablesv3 Outputv4 Output
account_name=Northwind
plan_tier=Enterprise
ticket_summary=Unexpected invoice spike
region=US
- Root cause unclear, missing next actions.
1012ms482 tokens
- Clear billing diagnosis with action checklist.
842ms436 tokens
account_name=BrightLoop
plan_tier=Pro
ticket_summary=Need downgrade policy
region=EU
- Generic answer; lacks policy nuance.
790ms358 tokens
- Includes compliant downgrade timing and template.
617ms312 tokens
Version comparev1.3 -> v1.4-18% latency-9% tokens
03 / Evaluate

Scoring UIs

Stop manually reading hundreds of outputs. Use an LLM-as-a-judge to automatically score your bulk runs based on your criteria, or use our human-in-the-loop interface to rate outputs yourself.

LLM Judge
factuality: 9/10
compliance: 8/10
Human Review
policy fit: approved
voice: approved
Decision
ship candidate: yes
confidence: high

Note: Evaluation functionality is currently under active development and not fully available yet.

Transparent pricing.

Predictable scaling for individuals and organizations.

Starter

For individual builders validating ideas.

$48.00/yr

Save $12.00 per year

  • Up to 10 prompts
  • 1,000 eval runs / month
  • BYOK only
Choose Starter
Recommended

Pro

For PMs building production AI features.

$199.00/yr

Save $39.80 per year

  • Up to 50 prompts
  • 20,000 eval runs / month
  • BYOK supported
  • Managed tokens supported*
Choose Pro

Team

For organizations scaling AI operations.

$1,800.00/yr

Save $598.80 per year

  • Unlimited prompts
  • Unlimited eval runs / month
  • BYOK supported
  • Managed tokens supported*
  • Priority execution queue
Contact Us

Managed Tokens Add-on

Managed tokens are billed separately from subscription plans. BYOK usage remains available based on your plan entitlements.

PackageIncluded usagePrice
Token Pack S1M managed tokens$19.00
Token Pack M5M managed tokens$89.00
Token Pack L20M managed tokens$299.00

* Managed tokens are pre-paid packages and are not unlimited under Pro or Team subscriptions.

Checkout is securely handled by Stripe. Prices are in USD, taxes may apply.

Frequently Asked Questions

Who is PromptEval designed for?+
PromptEval is built for AI product managers, builders, and engineers responsible for shipping AI-powered features. You don't need deep ML expertise — if you define prompts and care about output quality, PromptEval helps you systemize that workflow.
Is PromptEval like unit testing for prompts?+
Yes — PromptEval applies a testing workflow to AI features. Instead of manually checking outputs, you define test cases, run them in bulk, and evaluate results consistently before shipping changes to production.
Why not just test prompts manually or in spreadsheets?+
Manual testing becomes inconsistent, slow, and hard to compare across versions or models. PromptEval makes evaluations repeatable, measurable, and shareable — so you can detect regressions, compare iterations, and ship with confidence.
What does 'Bring your own keys' (BYOK) mean?+
BYOK lets you connect your own provider API keys. PromptEval handles the workflow layer, while token usage is billed directly by your LLM provider.
How do managed tokens work?+
Managed tokens let you run evaluations through PromptEval's shared infrastructure instead of your own provider keys. This simplifies billing and avoids distributing raw keys across teams. Managed tokens are purchased as separate usage packages.
Are managed tokens included in Pro or Team plans?+
No. Plans include platform features and usage limits, while managed tokens are optional add-ons billed based on consumption.
Is automated evaluation available yet?+
Automated scoring is currently being finalized and will roll out progressively. PromptEval already supports the full build-and-test workflow today, and early users will gain access to evaluation features as they become production-ready.
Can I export test results?+
Yes. You can export full run artifacts—including inputs, outputs, and scores—in CSV or JSON for analysis or reporting.