Agentic evaluation run results
Summarize
Summary of Agentic Evaluation Run Results
Agentic evaluations assess the effectiveness of agentic workflows in achieving their objectives. They utilize a Now LLM Service model to analyze execution logs, providing metrics and scores on task completeness and tool usage. The results page offers recommended actions to enhance workflow performance based on the evaluation outcomes.
Show less
Key Features
The evaluation results page displays key metrics, including:
- Overall Evaluation Score: A percentage score categorized as Excellent, Good, Moderate, or Poor, indicating the workflow's performance.
- Customizable Metric Thresholds: Users can adjust thresholds for performance labels.
- Individual Record Metric Scores: Each task is assessed based on its completion status and tool performance.
Key Outcomes
Evaluation labels and their recommended actions include:
- Excellent (90%-100%): High standard performance; proceed with confidence.
- Good (70%-89%): Some inconsistencies; deploy with caution.
- Moderate (50%-69%): Significant unfinished tasks; investigate issues.
- Poor (0%-49%): Major failures; do not deploy.
Individual records are scored based on task completeness and tool performance, ensuring thorough assessment and actionability for continuous improvement.
Learn about agentic evaluation runs and the meaning behind different evaluation scores from the agentic evaluation results page.
Agentic evaluations overview
Agentic evaluations measure how well agentic workflows are accomplishing their objectives. A Now LLM Service model judges the agentic workflow based on the execution logs of that agentic workflow. The results page of an evaluation run shows multiple metrics and scores measuring task completeness and tool use.
If you run an overall task completion evaluation, the results page shows recommended actions for the workflow. Recommended actions give you suggestions for deployment or improvement to help ensure that the agentic workflows that you deploy are performing up to your standards.
For more information on AI agent usage and other analytics, you can review the AI Agent Analytics dashboard in the AI Agent Studio.
Evaluation results overview
For each evaluation method that you execute, the results page displays an overall score for the agentic workflow with a percentage of successful record evaluations and a label of Excellent, Good, Moderate, or Poor. You can change the metric thresholds for each label by selecting Customize metric thresholds.
|
Label |
Description |
Recommended action |
Default threshold |
|---|---|---|---|
|
Excellent |
Tasks were consistently performed at a high standard. The agentic workflow is working well. |
Proceed with confidence |
90%–100% |
|
Good |
Most tasks were performed successfully, but some performance inconsistencies suggest areas for improvement. |
Deploy with caution |
70%–89% |
|
Moderate |
A significant number of tasks weren’t fully completed. Performance is below the desired level. |
Investigate the root causes of poor task completion |
50%–69% |
|
Poor |
The agentic workflow is consistently failing to complete tasks adequately. Major issues are present. |
Do not deploy |
0%–49% |
Individual record metric scores
Evaluations are run against the log tables of agentic workflow executions. Each record is individually scored for each evaluation plan that you run. Individual record evaluations are scored according to the following metrics.
|
Number |
Score |
Description |
|---|---|---|
|
3 |
Successful |
The main task was fully completed. All subtasks were resolved, and the steps followed a logical sequence with no critical errors. |
|
2 |
Partially successful |
The task was partially completed. Some subtasks remain unresolved or inefficiencies affected the process. |
|
1 |
Unsuccessful |
The task wasn’t completed. Critical subtasks were abandoned or unresolved or the execution failed entirely. |
|
Number |
Score |
Description |
|---|---|---|
|
1 |
True |
The right tool was chosen for the action in the plan. |
|
0 |
False |
The right tool wasn’t chosen. |
|
Number |
Score |
Description |
|---|---|---|
|
1 |
True |
Input key completeness, input value completeness, and input format completeness were successful. |
|
0 |
False |
One or more of input key completeness, input value completeness, or input format completeness wasn’t successful. |