Set up an automated evaluation

  • Versão de lançamento: Australia
  • Atualizado 12 de mar. de 2026
  • 6 min. de leitura
  • To set up an automated evaluation, select Set up automated evaluation/Create evaluation from the Evaluations tab.

    Antes de Iniciar

    Role required: virtual_agent_admin or admin

    Procedimento

    1. Add basic information.
      • Name: Enter a unique name to identify your evaluation.
      • Description: Add a brief description to explain the purpose of your evaluation, to help you identify the evaluation later.
      • Select assistant to evaluate: Select the assistant you want to evaluate from the list.

      Add the basic information.

    2. Provide evaluation details.
      Select the table to provide the inputs for automatically generated conversation scenarios evaluation.
      1. Select a table with evaluation inputs.
        • Table: Search for and select your table.
        • Number of rows to use: Specify the number of rows to use. You can select up to 500 rows per evaluation run.
      2. Map the table columns.
        Map your table columns to the provided fields so the automated evaluation knows which data to use. Map your table columns to the required fields:
        • User utterance: The user's question or comment.
        • Scenario context: Details that will support the assistant in its response.
        • Ground truth: task completion: The correct answer or result to check if the task was successfully completed.
        • Ground truth: skill discovery: The expected assets or skills needed to display an ideal response or outcome.

        Select Apply mappings to generate the scenarios.

      3. Review scenarios for evaluation.

        Review the scenarios to confirm they include the information you expect.

      Provide inputs to generate the scenarios for evaluation.

    3. Choose evaluation metrics that define success.
      Select the metrics that define success for your evaluation. The metrics are organized into 3 categories: Quality, Accuracy, and Performance. You must select at least one metric to continue.
      • Conversation success (requires ground truth): Measures whether your assistant understands user requests and delivers the correct outcome. Conversation success checks whether your assistant successfully completes user requests from start to finish, understanding the intent and delivering the correct response or outcome. Ground truth is recommended for more accurate results. Ground truth is a provided correct answer or intended outcome that indicates the assistant performed successfully.

        The evaluation assigns scores to each user's intent in a conversation to verify successful completion. If users make multiple requests in conversation, each is evaluated individually. Scores include:
        • Success: Assistant identified correct intent, gave required system completion, also required system completion (score 1).
        • Failure (again): Gave task executed or missing information in incorrect outcome (score 0).
        The output from this metric is Success/failure (1/0) per conversation. It can be measured with or without reference data. The evaluation examines each user's intent in a conversation and checks whether it is supported by the available context.
      • Faithfulness: The faithfulness score shows the proportion of claims that are grounded in the context. A higher score means more of all claims are supported, indicating that the assistant is not inventing unsupported information (hallucinating).

        Faithfulness score measures whether your assistant's responses stay grounded in the information it has access to, without making things up. Faithfulness checks if everything your assistant says comes from its available content: retrieved knowledge, conversation history, or data sources like ticket records. A faithful response only includes information supported by these sources. An unfaithful response adds claims or details that weren't in the provided context, even if they're actually true.

        The evaluation examines each response from your assistant and checks whether it is supported by the available context. The faithfulness score shows the proportion of claims that are grounded in the context. Higher scores mean more, or all claims are supported. Lower scores indicate the assistant invented unsupported information (hallucinating). Faithfulness score as 0 -100% or binary (faithful/unfaithful). Shows the proportion of claims supported by context.

      • Conversation fluency: Measures whether your assistant's responses are clear, natural, and grammatically correct.

        Conversation fluency checks if your assistant's overall output is easy to understand. It looks at whether responses read smoothly, use correct grammar, and feel like real conversation instead of robotic or confusing text. This helps you give responses that might frustrate users because of awkward wording, repetition, or unclear language.

        Each response gets scored on a 3-point scale. Scores include:

        • Score of 3 (sounds great): Easy to read and understand, feels like natural conversation, right amount of detail, no grammar issues.
        • Score of 2 (minor issues): Slightly awkward phrasing, a bit robotic, small grammar hiccups that doesn't block understanding.
        • Score of 1 (hard to follow): Confusing or unclear, major grammar problems, too much repetition, doesn't make sense.

        You can check conversation fluency for single responses or look at patterns across whole conversations. Each response gets a numerical score from 1 to 3. Can show results per response or add be aggregated for entire conversations.

      • Skill selection accuracy (requires ground truth): Measures whether your assistant selects the correct assistant skill, AI agent, or QnA module to handle each user's request.

        Skill selection accuracy tracks routing decisions - verifying your assistant calls on the right Assistant skill, AI agent, or QnA module from your available options based on what the user asked for. This includes correctly making no selection when a question doesn't match any available options. Requires ground truth, which in this case is a pre-defined skill that determines which skills the assistant is intended to select.

        The evaluation examines each user utterance to verify your assistant made the correct selection, and looks at:
        • Initial selections: When users first make a request, did the assistant choose the right option?

        • Mid-conversation switches: When users change topics, did the assistant recognize this and switch to the appropriate skill, agent, or module?

        • With ground truth data: Accuracy = correct selections divided by total utterances evaluated.

        • Without ground truth data: The system compares selections against available options using relevancy as a measure of correctness.

        Ground truth: Reference data that shows the correct response (or outcome) for each utterance, letting you measure accuracy against source data.

        Measured as accuracy percentage (0-100%), can be measured with or without ground truth.

      • Latency: Measures how long users wait for your assistant to respond.

        Latency tracks the time between when the user submits their question and when they get a response. This helps you understand if your assistant feels responsive or if delays are impacting user experience.

        The evaluation examines two types of latency:

        • First byte latency (perceived latency): Time from when the user submits their question to when the assistant starts responding.

        • End-to-end latency: Total time from when the user submits their question to when the complete response is delivered.

        Latency is measured in seconds for each response your assistant provides. You can track it at two levels:
        • Individual response (turn level): How long each single response takes.

        • Full conversation: Total latency accumulated across all responses in one conversation.

        This helps you spot whether delays happen in specific responses or build over longer interactions.

        Measured as time in seconds, can be measured per response or aggregated across conversations.

      • Turn count: Measures how quickly your assistant resolves user requests compared to the ideal number of conversation turns.

        Turn efficiency compares the minimum number of conversation turns needed to resolve a request against how many turns your assistant actually took. Ideal for:

        • Questions and answer interactions.
        • Structured task flows (like booking appointments or submitting requests).

        • Scenarios where there's a clear goal and defined path.
        Nota:
        the metric isn't designed for open-ended conversations where there's no single "right" number of turns.
        The evaluation establishes a baseline for the ideal number of turns needed, then compares it to how many turns actually occurred. This helps you identify where your assistant might be: 
        • Asking redundant questions.

        • Failing to capture information efficiently.

        • Requiring users to repeat themselves.

        • Taking indirect paths to the solution.

        Measures the number of total turns.

      Select evaluation metrics.

    4. Review the automated evaluation setup.
      Review your selections before starting the evaluation.

      Summary.

    5. Select Start evaluation to begin the evaluation.

    Resultado

    After an evaluation completes, you can view the results from the Testing tab. To view the results of a completed evaluation, select the evaluation name from the list. The evaluation detail page displays the scores and reasoning for each metric you selected.

    Evaluation details.

    The evaluation results provide scores and reasoning for each metric you selected. Review the insights to identify patterns in conversation quality and areas where your assistant performs well or needs improvement. The conversation transcripts and metric explanations help you understand the context behind each score.

    Use these insights to refine your virtual agent topics, improve AI agent configurations, or adjust your knowledge base content. Regular evaluation helps you maintain and improve the quality of your conversational assistant over time.