Now Assist Guardian

Yokohama Enable AI

Release

yokohama

ft:locale

en-US

ft:publication_title

Yokohama Enable AI

ft:clusterId

platai

bundleId

platai

workflow

Platform

Now Assist Guardian

Release version: Yokohama

Updated July 31, 2025

10 minutes to read

Now Assist Guardian is built on the ServiceNow Small Language Model (SLM) and monitors generative AI interactions to detect offensive content, prompt injection attacks, and sensitive topics.

Now Assist Guardian Overview

Generative AI is an emerging technology. Human interactions are unpredictable, and outputs generated by large language model (LLM) are probabilistic, which means that they're based on probabilities. Running the same input twice may generate two different outputs. Managing this risk is an important consideration when implementing generative AI on your instance. Now Assist Guardian evaluates requests sent to LLMs and their responses in real time to reduce that risk.

Guardrails

Now Assist Guardian provides three guardrails. Each guardrail has a different scope of applicability:


Guardrail	What it detects	Scope
Offensiveness detection	Offensive or harmful content in AI inputs and outputs	Specific Now Assist skills and workflows
Prompt injection detection	Attempts to override LLM instructions or expose restricted information	All generative AI applications and features
Sensitive topic filters	Subjects not suited for AI responses, such as workplace safety or employee compensation	Virtual Agent conversational skills only (requires HR Service Delivery)

Note:

The scope of each guardrail differs. Prompt injection detection applies to all generative AI applications and features. Offensiveness detection applies only to supported Now Assist skills and workflows. Sensitive topic filters apply only to Virtual Agent conversations and require HR Service Delivery.

Offensive content: Due to the probabilistic nature of generative AI, it's possible for an LLM to generate offensive content. If there's offensive content in the input of the request, offensive content can also occur in the response. Examples of offensive content include language that is toxic, defamatory, or fraudulent.
When offensive content is detected, Now Assist Guardian logs the event. You can also configure it to block the content. This guardrail applies to specific Now Assist skills and workflows.
Prompt injection: Prompt injection is a type of security attack where someone tries to override the normal instructions of an LLM to access restricted information or cause unintended behaviors. Now Assist Guardian detects prompt injection attempts by using an LLM trained on various types of prompt injection techniques, such as role playing, paraphrasing, repetition, instructions to ignore other instructions, and persuasion.
Note:
Due to the probabilistic nature of the model and evolving attack techniques, Now Assist Guardian may not identify every prompt injection attempt in some cases.
Prompt injection protection applies to all generative AI applications and features on your instance. It is not limited to specific skills or workflows.
Filtered subjects: Certain subjects, such as workplace safety employee compensation, or personal well-being may not be best suited for generative AI responses. You can activate filters that detect these kinds of subjects in Virtual Agent conversations and redirect users to the Sensitivity Detection: Fallback Virtual Agent topic instead of generating an AI response.
Note:
Sensitive topic filters are available only with HR Service Delivery and apply only to Virtual Agent conversational skills.

Logging and blocking

Now Assist Guardian logs detected events for offensiveness and prompt injection. You can access logs from Now Assist Admin > Settings > Now Assist Guardian. Log dataincludes information about the request, the conversation that contains the offensive content, and any user feedback.

In addition to logging, you can configure Now Assist Guardian to block offensive content or prompt injection attempts. When blocking is enabled and content is detected, you see a standard error message instead of the generated response. The standard error message displays that the request couldn’t be completed, and you don't see what the AI generated. Before enabling blocking, review logs for a period of time to understand how frequently these issues occur in your environment.

Redirection for sensitive filtered topics

After a filter detects a sensitive topic, Now Assist Guardian redirects you to the Sensitivity Detection: Fallback topic in Virtual Agent. This topic can redirect you to a live agent or help you create an HR case.

You can override the redirection by selecting Proceed, not sensitive. Itreturns you to their original topic without initiating the fallback flow.

Note:

After you continue with the fallback topic, for example, by starting the flow to create an HR case, Virtual Agent does not continue detecting sensitive topics in that conversation.

Now Assist Guardian at runtime

All skills that use Now Assist Guardian remove personally identifiable information (PII) before the request reaches the LLM. You can configure what type of data is anonymized. For more information see, Configuring Now Assist for Data Privacy.

For conversational skills, semantic search processes requests to determine whether a filter has been detected. If so, the user is redirected to a Virtual Agent topic that asks if they want to make an HR case or speak to a live agent.

Infographic showing Now Assist Guardian at runtime with sensitivity filter guardrail

For catalog item generation and agent skills, such as summarization and resolution note generation, offensiveness and prompt injection guardrails run on inputs and outputs of requests. If either is detected, Now Assist Guardian logs the request. If you’ve chosen to block this content, then a standard error message appears, and the user doesn’t see the result of the request.

Infographic showing Now Assist Guardian at runtime with offensiveness and security guardrails

Safety categories captured in Now Assist Guardian logs

The following are the different kinds of content that can be flagged as offensive or poses a security risk. You can identify what categories are being caught in the Now Assist Guardian logs for the different guardrails.
Category	Category ID	Description
Toxic Content	O1	Should not: Promote hate speech, incite violence like child abuse or bullying or engage in personal attacks or harassment. Can: Discuss the impacts and societal issues related to toxic content for educational purposes.
Unfair Representation	O2	Should not: Assist in perpetuating stereotypes or encourage discrimination or bias. Can: Discuss issues of representation and promote awareness and inclusivity.
Adult Content	O3	Should not: Display sexually explicit material or content inappropriate for minors including pornography, erotic chat and sexual activity. Can: Offer educational discussions on topics related to sexuality and consent.
Erosion of Trust in Public Information	O4	Should not: Spread misinformation or disinformation undermining public trust such as misleading medical advice, illegal drugs and advice. Can: Critically analyze misinformation trends and educate about media literacy.
Propagating Misconceptions/False Beliefs	O5	Should not: Disseminate unverified, misleading information, conspiracy theories, science denialism, historical revisionism, or content that could lead to radicalization. Can: Provide critical analysis and debunking of common misconceptions and false beliefs.
Risky Financial Practices	O6	Should not: Promote unethical trade practices such as multi-level marketing, deceptive lending practices, or gambling. Can: Educate about associated risks, ethical lending, responsible gambling, trade compliance, and fair business practices.
Trade and Compliance	O7	Should not: Involve labor issues, engage in unethical trade affairs, contribute to unfair labor practices or promote fraudulent trade practices. Can: Provide information and insights into labor issues, ethical trade practices, trade compliance, and fair business conduct.
Dissemination of Dangerous Information	O8	Should not: Spread fake news, propagate social media rumors, disclose government secrets or leak financial data and other contents that contributing to public panic or national security. Can: Educate about the dangers of such information and promote safety.
Privacy Infringement	O9	Should not: Involve doxxing, unauthorized data sharing, or surveillance like private information inferring or leaking. Can: Discuss the importance of privacy rights and digital security.
Security Threats	O10	Should not: Spread malware or encourage terrorism or other security threats such as weapon generation, cyberattack or chemical harms. Can: Provide information and methods to identify and mitigate security threats.
Defamation	O11	Should not: Publish false statements damaging to individuals, groups or organizations. Can: Explore the legal and ethical aspects of defamation and free speech.
Fraud or Deceptive Action	O12	Should not: Engage in academic dishonesty, perpetrate scams, promote pseudo-pharmaceuticals, or disseminate fake online content. Content should not deceive or defraud individuals. Can: Educate about recognizing and preventing fraudulent activities.
Influence Operations	O13	Should not: Manipulate public opinion or disrupt societal harmony for political or ideological purposes like surveillance or campaign. Can: Analyze and educate about the nature and impact of influence operations.
Illegal Activities	O14	Should not: Promote or provide guidance on engaging in illegal actions, such as crimes or terrorism. Can: Educate about the consequences and prevention of illegal activities.
Persuasion and Manipulation	O15	Should not: Involve persuasion or manipulation techniques that promote self-harm, psychological manipulation, or any form of harm to individuals' well-being. Can: Educate about recognizing and resisting manipulation, offer support and resources for those affected.
Violation of Personal Property	O16	Should not: Enagge in violating an individual's personal property,malicious mimicking, plagiarism, copyright infringement or identity forgery. Can: Promote awareness and discussions on respecting personal property and preventing such violations.

Skills that support offensive detection

Table 1. Supported skills by workflow
Workflow	Application	Supported skills
Technology	Now Assist for Configuration Management Database (CMDB)	Configuration item (CI) summarization Manage duplicate CIs Service Graph Connector diagnosis
Technology	Now Assist for IT Operations Management (ITOM)	Alert analysis Alert investigation
Technology	Now Assist for IT Service Management (ITSM)	Change request risk explanation Change request summarization Chat reply recommendation Chat summarization Incident assist Incident summarization KB generation Resolution notes generation Sidebar discussion summarization
Technology	Now Assist for Security Incident Response	Post-incident analysis Resolution notes generation Security incident recommended actions Security incident summarization
Technology	Now Assist for Strategic Portfolio Management (SPM)	Multi feedback summarization Planning item doc summarization Project doc summarization Project summary emails Story generation Write planning item
Customer	Now Assist for Customer Service Management (CSM)	Case summarization Chat recommendation Chat summarization Email recommendation KB generation Resolution notes generation Sidebar summarization
Customer	Now Assist for Field Service Management (FSM)	KB generation Sidebar summarization Work order task summarization
Customer	Now Assist for Financial Services Operations (FSO)	Case summarization Disputes intake via Virtual Agent
Customer	Now Assist for PSDS	Government case summarization Chat summarization
Employee	Now Assist for Health and Safety	Incident summarization
Employee	Now Assist for HR Service Delivery (HRSD)	Case summarization Chat summarization KB generation Resolution notes generation
Employee	Now Assist for Legal Service Delivery (LSD)	Legal request summarization
Employee	Now Assist in Contract Management	Contract analysis Contract metadata extraction
Creator	Now Assist for Creator	Catalog item generation
Finance & Supply Chain	Now Assist for Accounts Payable Operations (APO)	Record summarization
Finance & Supply Chain	Now Assist for Supplier Lifecycle Operations (SLO)	Supplier case summarization
Finance & Supply Chain	Now Assist for Sourcing and Procurement Operations (SPO)	Record summarization

Now Assist Guardian analytics

Monitor the performance of guardrails enabled through Now Assist Guardian.

The Now Assist Guardian analytics dashboard helps admins monitor and evaluate the effectiveness of offensive content and prompt injection guardrails in tracking and analyzing requests sent to large language models (LLM) and their responses.

Prompt injection dashboard page — Figure 1. Now Assist Guardian dashboard page

The indicators on the Now Assist Guardian dashboard page provide the following insights.

Average latency as a result of active offensive content and prompt injection guardrails. High latency could mean increased guardrail activity in the period.
Count and percentage of offensive content and prompt injection occurrences.
Skills where offensive content and prompt injection occurrences were detected.

Apply the filters on the dashboard to view guardrail activity for skills in a date range. See Now Assist Analytics dashboard indicator details for information on the data and calculations behind each indicator.

Offensive content indicators

Guardrail-added latency: This area of the dashboard shows the average latency as a result of the active offensive content guardrail for the selected skills and date range.

Figure 2. Guardrail-added latency indicator
Percentage flagged as offensive: This area of the dashboard shows the percentage of requests and responses to and from the LLM service that are flagged for offensive content.

Figure 3. Percentage flagged as offensive indicator
Total offensive content occurrences: This area of the dashboard shows the total number of offensive content occurrences for the selected skills and date range.

Figure 4. Total offensive content occurrences indicator
Categories of offensive content: This area of the dashboard shows a breakdown of offensive content occurrences by the categories. If content is deemed to be offensive under more than one category, for example, toxic and defamatory, the occurrence is counted individually toward both the categories. For more information on offensive content categories, see Now Assist Guardian.

Figure 5. Categories of offensive content indicator
Offensive content occurrences by skill: This area of the dashboard shows the number of offensive content occurrences over time by the skills in which the content is detected.

Figure 6. Offensive content occurrences by skill indicator

Prompt injection indicators

Guardrail-added latency: This area of the dashboard shows the average latency as a result of the active prompt injection guardrail for the selected skills and date range.

Figure 7. Guardrail-added latency indicator
Percentage flagged as prompt injection: This area of the dashboard shows the percentage of requests and responses to and from the LLM service that are flagged for offensive content.

Figure 8. Percentage flagged as prompt injection indicator
Total prompt injection occurrences: This area of the dashboard shows the total number of offensive content occurrences for the selected skills and date range.

Figure 9. Total prompt injection occurrences indicator
Prompt injection occurrences by skill: This area of the dashboard shows the number of prompt injection occurrences over time by the skills where prompt injection attempts were detected.

Figure 10. Prompt injection occurrences by skill indicator