Enabling evaluations

  • Release version: Yokohama
  • Updated September 2, 2025
  • 3 minutes to read
  • Summarize
    Summarized using AI
    This content was generated using new OpenAI-powered functionality. Results are provided on an as is basis and are not guaranteed to be accurate or complete.

    Summary of Enabling evaluations

    This guide explains how ServiceNow administrators can enable and configure continuous evaluation of virtual agent conversations using the Conversation Evaluator feature in the Yokohama release. Enabling evaluations helps assess virtual agent performance across critical dimensions to improve task success and user trust. It complements the Conversation Insights application, which focuses on customer satisfaction and effort metrics.

    Show full answer Show less

    Enabling Evaluations

    • Activate Skills: Navigate to Admin > Now Assist Admin > Now Assist Skills > Platform and enable key evaluation skills including Intent Accuracy, Slot Filling, Smoothness, Context Retention, Truthfulness, Conciseness, and Chat Topic Classifier.
    • Set Evaluation Limits: Configure the system property snnaconveval.maxEvaluateCount to define the maximum number of conversations evaluated daily. This is done in All > System Properties > All Properties.
    • Activate Scheduled Jobs: Enable all scheduled jobs related to Conversation Evaluator except the one that runs only once after installation. These are found under All > System Definition > Scheduled Jobs filtered by Application = Conversation Evaluator.
    • Activate Evaluation Flows: By default, the Execute Evaluation flow is deactivated. You can activate it in Flow Designer for real-time evaluation upon chat completion, or use the scheduled nightly job for batch evaluation to avoid conflicts with other LLM calls.
    • Batch Evaluations: To evaluate historical data, activate the Execute Batch Evaluation flow. This supports importing and processing past conversations.
    • Domain Separation: Note that domain separation is not supported for evaluation features.

    Evaluation Dashboard vs. Conversation Insights

    The Evaluation dashboard provides detailed diagnostic metrics focused on virtual agent system performance and task success, such as:

    • Intent Recognition accuracy
    • Slot Filling precision
    • Conciseness and Coherence of responses
    • Truthfulness and Context Retention
    • Deadlock avoidance in dialog flows
    • User satisfaction inferred from critical failures

    The Conversation Insights application measures end-user perceptions and overall conversation quality, including:

    • Inferred customer satisfaction (CSAT)
    • User effort level
    • Issue resolution status
    • Signs of frustration or confusion
    • Transfers and escalations
    • Empathy and clarity of next steps

    Together, these tools provide complementary perspectives: the Evaluation dashboard offers granular, task-focused diagnostics for improving virtual agent design, while Conversation Insights delivers a broad, cost-free view of user experience. Both sets of metrics are consolidated in AI Agent Analytics and AI Control Tower dashboards for comprehensive monitoring of virtual agent health and user satisfaction.

    Evaluate random conversations by enabling continuous monitoring.

    Role required: admin

    Enable evaluations and set the number of evaluations to be performed daily.

    Enable evaluations

    1. Activate Skills:
      1. Navigate to Admin > Now Assist Admin > Now Assist Skills > Platform.
      2. Turn on the following skills:
        • Intent Accuracy Chat Eval
        • Inadequate Slot Filling Chat Eval
        • Smoothness (Deadlock avoidance)
        • Context Retention
        • Coherence Chat Evaluation
        • Truthfulness Hallucination Chat Eval
        • Conciseness Chat Eval
        • Chat topic classifier
        Note:
        You can get the filtered list using the filter condition Conversation Evaluator under Features.
    2. Set the value for the system property sn_na_conv_eval.maxEvaluateCount.
      1. Navigate to All > System Properties > All Properties.
      2. Search for and select the property sn_na_conv_eval.maxEvaluateCount.
      3. Update the Value field to set the maximum number of conversations to be evaluated daily.

        Update the Value field.

      4. Select Save.
    3. Activate the following associated scheduled jobs:
      1. Navigate to All > System Definition > Scheduled Jobs.
      2. Apply the filter condition Application is Conversation Evaluator and filter out the job Evaluation Value Calcuation - Runs Only once after install.

        Apply filter condition.

      3. Activate all the scheduled jobs in the filtered list.
    4. Activate the Execute Evaluation flow in Flow Designer.
      Note:

      By default, the Execute Evaluation flow is deactivated. You can use the nightly scheduled job, Execute Evaluations, to evaluate the chats. The nightly job won't dominate over LLM calls from other services, whereas the real-time evaluation through the Execute Evaluation flow might conflict with LLM calls from other applications.

      If you want to evaluate the chats real-time on chat completion, activate the Execute Evaluation flow. Domain separation is not supported.

      1. Navigate to All > Process Automation > Flow Designer and select Flows.
      2. Select the Execute Evaluation flow.
      3. Select Edit flow.
      4. Select Activate.

        Activate the Execute Evaluation flow.

    Note:
    • If you want to configure some of the evaluation parameters based on your requirements, see Configuring evaluations.
    • If you want to import historical data to be evaluated, you must run batch evaluations by activating the Execute Batch Evaluation flow. For more information on the batch evaluation workflow, see Evaluation flow for batch evaluations.

    Evaluation dashboard vs. Conversation Insights

    You can use the Evaluation dashboard and the Conversation Insights (CI) application together to gain a complete picture of virtual agent effectiveness, from system performance to end-user satisfaction.

    For more information about Conversation Insights, see Conversation Insights.

    Metrics captured by the Evaluation dashboard Metrics captured by Conversation Insights

    The Evaluation dashboard provides granular diagnostic explanations that help improve virtual agent design, dialog flows, and model accuracy. It evaluates performance along dimensions critical to task success and trustworthiness. For example, "Is the system working properly and performing the expected task?"

    • Intent Recognition: Whether the virtual agent correctly interprets the user's primary goal.
    • Slot Filling: Accuracy of extracting structured inputs required to complete tasks.
    • Conciseness: Avoidance of verbose or repetitive responses.
    • Coherence: Logical flow and organization of the virtual agent's responses.
    • Truthfulness: Ensures that responses are grounded in context or knowledge sources, not hallucinated.
    • Context Retention: Virtual agent's ability to remember earlier user inputs in the same session.
    • Deadlock Avoidance: Detects loops where the virtual agent gets stuck repeating questions or responses.
    • User Satisfaction (proxy): Correlates critical failures (like intent recognition or slot filling) with likely negative user perception.

    Conversation Insights focuses on measuring customer satisfaction and effort. It uses inferred customer satisfaction (CSAT) and supporting signals to show how end users perceive their interaction with the virtual agent. For example "Is the end user happy with the virtual agent's performance?"

    • Inferred CSAT: Composite score (1–5) estimating overall satisfaction with the conversation.
    • Effort Score: Measures how much work the user had to do (low/medium/high), based on transfers, escalations, repeated data collection, or wait times.
    • Resolution: Tracks whether the user's issue was fully, partially, or not resolved.
    • Frustration: Detects explicit signs of user frustration such as rude language, sarcasm, or complaints about difficulty.
    • Confusion: Identifies misunderstandings between the user and virtual agent or the live agent.
    • Transfers and Escalations: Flags when a conversation is passed to another agent, team, or supervisor.
    • Empathy: Scores how polite, personable, and supportive the virtual agent or live agent was (low/medium/high).
    • Next Steps: Measures how clearly the virtual agent or live agent explained outcomes, instructions, or follow-up actions (low/medium/high).
    Together, the Evaluation dashboard and Conversation Insights provide complementary value for virtual agent implementations.
    • Conversation Insights offers a lightweight, cost-free view of the customer experience across all conversations.
    • The Evaluation dashboard delivers granular, task-focused diagnostics that enable targeted improvements to virtual agent design and performance.
    • Consolidated in AI Agent Analytics and AI Control Tower dashboards, these metrics give users complementary views into virtual agent system health and end-user satisfaction.