How Oracle Built Veritas: A Blueprint for Scaling Generative AI Evaluation Across Enterprise Teams

As AI teams scale, evaluation becomes the bottleneck — and Oracle built an entire framework to fix it.
When Oracle’s teams expanded their use of large language models in 2023, model evaluation quickly became chaotic. Teams were benchmarking models, testing product scenarios, and comparing systems — but every evaluation lived in its own pile of scripts, datasets, and custom reporting code. Results were useful locally but nearly impossible to compare or reproduce across teams.
The solution was Veritas: an internal evaluation framework designed to turn AI model testing from a collection of one-off scripts into a structured engineering system.
How Veritas Works
Veritas organizes every evaluation pipeline around four stages: Transformation, Generation, Evaluation, and Summarization. This consistent structure makes evaluations reusable and comparable. Instead of each team rebuilding the same infrastructure, they compose new benchmarks from shared predictors, evaluators, and datasets — and only write new code for genuinely novel parts.
Key components include:
- Testsuites defining end-to-end evaluation pipelines in YAML
- Shared predictors and evaluators reusable across workloads
- A scheduler that tracks dependencies, runs independent work in parallel, and supports retries and checkpointing
What Changed in Practice
Since launch, Oracle’s AI team has published 150+ testsuites and 50+ model evaluation reports covering reasoning, RAG, responsible AI, code generation, NL2SQL, multimodal tasks, and more. The framework also enabled an internal leaderboard — so teams can see whether a quality improvement justifies a change in cost or latency for a specific use case.
Key Takeaways
- Evaluation fragmentation is a scaling problem, not just a tooling problem — Veritas shows the value of treating it as a first-class engineering system.
- Composable YAML pipelines beat custom scripts — reuse becomes practical when components are declared, not buried in code.
- Agentic evaluation is next — Oracle is packaging benchmark-integration knowledge as “skills” that agentic tools can use to convert research papers into executable evaluations.
References: HELM, OpenAI Evals, lm-evaluation-harness
🔗 Read the full article on Oracle Blogs
Stay in Rhythm
Subscribe for insights that resonate • from strategic leadership to AI-fueled growth. The kind of content that makes your work thrum.
More from Thrum
Additional pieces exploring adjacent ideas
