What is the difference between an AI agent and an AI workflow?

A workflow is a system where the steps are predefined in code. An agent is a system where the LLM itself decides what steps to take and in what order. The key difference is who controls the logic — the developer or the model.

Do I need to know how to code to build an AI agent?

No. Tools like Claude Code let you describe what you want in plain language and handle most of the implementation. What matters more is clarity about what the system should do and what a good result looks like.

Why is observability important for AI agents?

AI systems are non-deterministic — they can behave differently across runs. Observability traces each step of an agent's execution so you can debug failures, understand outputs, and improve performance over time.

What is the Arthur Engine?

The Arthur Engine is a free, open-source tool for AI observability and evaluation. It traces every step of an AI agent or workflow so teams can see what happened, measure performance, and catch issues before users do.

What’s New in Arthur: Custom Evals, a smarter workspace, and engine upgrades for speed and security

What’s New in Arthur – September 2025 Edition
Custom Evals, a smarter workspace, and engine upgrades for speed and security.

AI teams today are juggling more complexity than ever: multiple models, shifting data, compliance needs, and business stakeholders asking, “But how do we know it’s working?”

This month’s Arthur updates are designed to give teams more control, more clarity, and more confidence in how they measure, monitor, and manage AI in production. Let’s dig into the new features and the value they unlock.

Custom Evals: Measure Success Your Way

No two AI applications define “success” the same way. A retail company wants to track how personalized their product recommendations feel. A healthcare team needs to measure false negatives in patient screenings. A finance org cares about bias across different customer segments.

That’s why we built Custom Evals.

With Custom Evals, you can:

Define your own metrics that reflect your business goals, not just generic accuracy scores.
Configure once and reuse everywhere across ML models, GenAI outputs, and agentic workflows.
Introduce LLM-as-a-Judge metrics for natural language evaluations like tone, clarity, or brand alignment.

Use case example: A fintech company uses Custom Evals to monitor fraud detection models. Instead of only tracking precision and recall, they configure a “risk exposure” metric that quantifies the dollar value of missed fraud cases. That custom lens helps both data scientists and executives see the true business impact that off-the-shelf metrics could never capture.

The result? More relevant insights, tighter alignment across teams, and KPIs that actually matter.

A New Workspace Home: Less Clicks, More Flow

Your platform homepage should feel like a control center, not a maze. The new Workspace home makes it easier to orient yourself and act quickly.

Overview of activity: See recent evals, monitoring jobs, and results at a glance.
One-click access: Jump straight into engine management and other core functions.
Simpler navigation: Spend less time hunting through menus, more time iterating.

Use case example: A data science lead starts her day with a quick scan of the Workspace home. She sees that an overnight batch of model monitoring flagged drift in one city’s data. Instead of digging through logs, she’s one click away from the span query that pinpoints the issue. By 9:30 a.m., her team is already deploying a fix.

The impact? Minutes, not hours, to diagnose and one less fire drill.

Arthur Engine: Performance, Security, and Developer Experience

Under the hood, we’ve rolled out a series of improvements that make Arthur faster, more secure, and easier for developers to use:

Span Query Improvements:
- Filter spans by type with the new GET endpoint /v1/spans/query.
- Support for span name columns and indexed queries means faster, more flexible analysis.
Improved ingestion stability: Handles complex trace structures without failures.
Unified API schema and client libraries: A smoother experience across ML, GenAI, and Platform engines.
Security hardening: ML engine now runs as a non-root user.
Artifact management: ML Engine artifacts are now pushed to Nexus for cleaner CI/CD pipelines.

Use case example: An engineering team running GenAI apps needs to debug latency issues in production. With the optimized span queries, they can isolate the slowest components instantly instead of combing through unstructured logs. Meanwhile, security officers rest easier knowing that all ML engines are running with stricter permissions by default.

The outcome? Faster troubleshooting, stronger security posture, and smoother developer workflows.

The Bigger Picture

September’s updates all ladder up to one core promise: Arthur makes AI oversight fit the shape of your business, not the other way around.

Custom Evals give you the freedom to measure what matters.
A workspace home helps you act faster and smarter.
Engine enhancements tighten security and speed up your workflow.

Together, these features ensure that whether you’re managing traditional ML models, GenAI outputs, or agentic systems, you have control, visibility, and trust in every decision your AI makes.

Stay tuned for the October release that will bring even more ways to simplify, secure, and scale your AI operations.

See the full platform release notes for September 2025 here.