Frequently asked questions about agentic evaluations
Summarize
Summarized using AI
This content was generated using new OpenAI-powered functionality. Results are provided on an as is basis and are not guaranteed to be accurate or complete.
Summary of Frequently asked questions about agentic evaluations
This guide provides answers to common questions about setting up and running automated evaluations for AI agents and workflows in ServiceNow. It helps you prepare, configure, and interpret evaluations to ensure your agentic processes perform accurately and effectively.
Show less
Preparing for Automated Evaluations
- Test your agent or workflow in the playground to catch obvious issues before deep validation.
- Ensure all required input fields are present, especially if generating or reusing test scenarios.
- Prepare a robust dataset with at least 100 scenarios to strengthen evaluation reliability.
- Define clear success criteria by specifying the expected correct outputs for your agent.
Setting Up and Running Evaluations
- Follow the guided setup flow: select your agent or workflow version, choose metrics (built-in or custom), and select or build a dataset.
- Datasets can be created from logs of previous runs or by generating new logs after setup.
- Evaluations track multiple metrics per execution, with customizable thresholds to match your organization’s success definitions.
Custom Metrics
- Create custom metrics when built-in options don’t cover specific evaluation needs, such as detecting phrases or measuring response length.
- To create a custom metric, name and describe it, define its scope, specify measurement details, add inputs, write a script, then save and publish it.
- A parser tool is available to extract structured data from execution logs, simplifying metric input preparation.
Interpreting and Tracking Evaluation Results
- Review metric scores per execution and consult the metric guide to understand results.
- Use evaluation results to identify configuration gaps, assess deployment readiness, and analyze tool performance.
- Refine your agent or workflow in AI Agent Studio and rerun evaluations to track improvements.
- Track evaluation progress and required actions from the homepage without needing to remain on the evaluation page.
Find answers to common questions about setting up and running evaluations.
- Do I need to keep anything ready before an automated evaluation?
- Before you begin, make sure you:
- Test your agent or workflow in the playground. Catch the obvious issues early—automated evaluations are best for deeper validation.
- Ensure your table has all the required inputs if you're generating test scenarios or using scenarios from previous agent or workflow runs during setup.
- Prep enough scenarios. We recommend at least 100. Your evaluation is only as strong as the situations you put your agent through.
- Define what success means. Be clear on what the right output for your agent should be.
- How do I set up my first automated evaluation?
- To set up an evaluation, follow the guided flow:
- Select your agent or workflow and its version.
- Choose your metrics—built-in or custom.
- Use an existing dataset or decide how you want to build one.
- When should I create a custom metric?
- Create a custom metric when you have unique evaluation criteria and want to measure workflow or agent-specific behaviors that aren't covered by ServiceNow's built in metrics. For example, you might want to:
- Check whether a particular phrase appears in the agent's response.
- Measure response length to assess verbosity or brevity.
- How do I build a dataset for agentic evaluations?
- There are two ways to build a dataset for agentic evaluations, but first, let's clarify what a dataset is. Your dataset should include logs of executions that capture what happens when your AI agent or workflow processes records like incidents, case, or tasks. You can create a dataset by either:
- Using logs from previous agent or workflow runs, or
- Generating new logs by running the agent or workflow after setup.
- What's next after an automated evaluation?
- Review your evaluation results to:
- Identify configuration gaps in your agent or workflow
- Assess deployment readiness
- Analyze tool performance for issues with inputs or descriptions
- Drill down into individual executions and metric scores
- How do I create a custom metric?
- Create a custom metric in a few steps:
- Name and describe your metric.
- Define its evaluation scope—agentic workflow, agents, or both.
- Specify what it measures, how it works, and its output format.
- Add metric inputs and write your script-based metric.
- Save and publish to make it available for use.
- How do I interpret evaluation results?
- Based on the metrics you select, each execution will display a score for every metric. Refer to the "Metric guide" to understand what the scores mean. You can also customize metric thresholds to align with your organization's definitions of success and failure.
- How do I track the progress of my evaluations?
- Evaluations may take some time, but you don't need to stay on the page. From the homepage, you can track all evaluations and even see if any action is required.
- How is the parser tool used during custom metric creation?
- When creating a custom metric for agentic evaluations, providing a metric input is optional—we include the 'execution plan record sys_id' by default. We also provide a parser tool that pulls structured data from your execution logs, so you won't need to manually parse through the XML or JSON. You can access the parser tool's outputs with tool output.