Evaluation metrics and calculations

Yokohama Enable AI

Release

yokohama

ft:locale

en-US

ft:publication_title

Yokohama Enable AI

ft:clusterId

platai

bundleId

platai

workflow

Platform

Evaluation metrics and calculations

Release version: Yokohama

Updated September 3, 2025

3 minutes to read

Summarize

Summarized using AI

Summary of Evaluation metrics and calculations

This documentation details the metrics used to evaluate virtual agent conversations and explains how adjusted scores are calculated to align automated evaluations with human judgment. ServiceNow customers using virtual agents can leverage these metrics and calculations to accurately assess and improve virtual agent performance over time.

Show full answer Show less

Metrics

The evaluation is based on several key metrics that measure different aspects of the virtual agent's conversational abilities. Each metric is rated on a 3- or 5-point scale, then scaled to 5 points. The metrics include:

Request Completion: Measures the agent’s ability to fulfill user requests by correctly identifying intent and gathering necessary information.
Intent Accuracy: Assesses how well the agent understands user requests to provide relevant responses.
Slot Filling: Evaluates the agent’s extraction of structured answers from user inputs.
Smooth Flowing Conversation (Deadlock avoidance): Checks if the agent progresses the conversation dynamically without repetition.
Context Retention: Measures the agent’s ability to remember and use information throughout the conversation.
Truthfulness (Hallucination Prevention): Ensures the agent provides genuine, fact-based responses without fabrications.
Conciseness (Redundancy Avoidance): Evaluates if responses avoid unnecessary verbosity or generic content.
Coherence: Checks for logical flow and clear structure in agent responses.
User Satisfaction: A weighted average of the above metrics reflecting overall conversation quality.

Users can filter evaluation trends by these metrics to pinpoint areas of strength or improvement.

Calculations for Adjusted Scores

To better reflect human judgment, the system calculates deviations between automated and human-labeled scores over the past six months and adjusts scores accordingly.

Upper Deviation: Calculated if over 30 cases exist where human scores exceed automated scores. The top 90% of differences are averaged to determine this deviation.
Lower Deviation: Calculated if over 30 cases exist where human scores are lower than automated scores, similarly averaged from the top 90% differences.

The Adjusted Score is computed by adding the error band (difference between average human and automated scores) to the auto-evaluation score when both deviations have sufficient data. If not, the adjusted score defaults to the automated score.

User Satisfaction Score Calculations

User satisfaction at the evaluation level is derived by combining metric scores weighted appropriately:

Auto Eval User Satisfaction Score: Weighted average of machine-generated metric scores per evaluation.
Human User Satisfaction Score: Calculated using human-labeled scores where available; otherwise, automated scores are used.
Gap: Difference between human and automated satisfaction scores guides deviation calculations.
When sufficient data (over 30 records) exists, positive gaps contribute to upper deviation adjustments and negative gaps to lower deviation adjustments.
The Adjusted User Satisfaction Score incorporates these deviations to better align with human evaluations.

Practical Implications for ServiceNow Customers

Understanding these metrics and adjusted scoring calculations enables customers to:

Monitor virtual agent performance across multiple dimensions of conversational quality.
Identify specific areas where the agent may be underperforming or excelling.
Utilize adjusted scores that reflect more accurate alignment with human judgment, improving confidence in evaluation results.
Leverage Performance Analytics indicators to track score trends over time for continuous improvement.

Note that batch processing historical data assigns evaluation scores to evaluation dates rather than chat dates, which is important when interpreting trends.

Metrics against which conversations are evaluated and calculation of adjusted scores.

Metrics

The Select metric list shows all the metrics against which each conversation is evaluated for the selected date range. You can filter the evaluation trend based on each metric. The following metrics are available:


Metric	Description
Request Completion	Measures the virtual agent's ability to complete user requests by accurately identifying the user's intent and gathering all required information (slot filling).
Intent Accuracy	Shows the virtual agent's ability to comprehend user requests, resulting in relevant responses.
Slot Filling	Shows the virtual agent's capability to interpret user responses and extract structured answers to the required questions.
Smooth Flowing Conversation (Deadlock avoidance)	Checks if the virtual agent responds dynamically, successfully moving the conversation forward without repetition.
Context Retention	Shows if the virtual agent succeeds in retaining and using information provided during the conversation, including request interpretation and slot filling.
Truthfulness (Hallucination Prevention)	Shows if the virtual agent generated genuine responses grounded in conversation, excluding fabrication or memory and comprehension failures.
Conciseness (Redundancy Avoidance)	Checks the virtual agent's ability to avoid superfluous or verbose and generic responses, which doesn't contribute to the core intent of the conversation.
Coherence	Checks for clear logical flow, structure, and organization of the virtual agent's responses.
User Satisfaction	Weighted average of all the other metrics on which the conversation was evaluated.

Note:

All the metrics are rated on a scale of 3 or 5, and are finally scaled up to 5.

Calculations

Calculation of deviations and Adjusted Score:

To align the auto-evaluation scores with human judgment over time, a deviation is calculated and used to produce an adjusted score on metric level.

Upper Deviation
Condition: If the number of human-labeled scores that are higher than the auto-evaluated scores in the last 6 months is more than 30.
Calculation: The top 90% of these cases are taken and the difference (delta) between the human score and the auto-evaluated score is averaged. This delta is the Upper Deviation.
Lower Deviation
Condition: If the number of human-labeled scores that are lower than the auto-evaluated scores in the last 6 months is more than 30.
Calculation: The top 90% of these cases are taken and the difference (delta) between the human score and the auto-evaluated score is averaged. This delta is the Lower Deviation.
Adjusted Score
The final Adjusted Score is calculated based on the availability of the deviations.
- If at least 30 distinct evaluations of both upper and lower deviations are labeled for a given metric, Error band is calculated as SUM(Avg labeling score – LLM score)/Distinct evaluations. This error band is added to Auto-Eval score to get Adjusted Score.
- If neither deviation is available, then Adjusted Score = Auto-Eval Score

Calculation of Auto eval user satisfaction score, Human user satisfaction score, and Upper and Lower deviation on Evaluation level:

Auto eval user satisfaction score: For a given evaluation, get all the scores for each metric that are LLM generated and calculate SUM(metric score * metric weight)/SUM(metric weights).
Human user satisfaction score: For a given evaluation, if at least one metric is labeled, it’s considered to calculate the human user satisfaction score. If labeled, the labeling score is used, or else LLM score is used. Calculated as SUM(metric score * metric weight)/SUM(metric weights).
Gap: Gap is calculated as (Human user satisfaction score – Auto eval satisfaction score).
Upper Deviation: If the Gap is positive and there are more than 30 records, the error band is calculated at top 90% by SUM(Positive Gap) / Distinct evaluations. This error band is added to the Auto eval user satisfaction score.
Lower Deviation: If the Gap is negative and there are more than 30 records, the error band is calculated at top 90% by SUM(Negative Gap) / Distinct evaluations. This error band is added to the Auto eval user satisfaction score.
Adjusted user satisfaction score is calculated as SUM(Gap) / Distinct evaluations.

Note:

The evaluator provides aggregated score per chat, even if there are multiple different requests made by user.
Performance Analytics indicators are used to calculate the average score over time. If you run batch jobs on historical data, then by the definition of Performance Analytics indicators, these evaluations are counted on the evaluation date in aggregated scores and not counted for scores on the actual chat date.