Lemma and Unicode normalization
Summarize
Summary of Lemma and Unicode normalization
AI Search in ServiceNow automatically normalizes inflected words and Unicode glyphs both during indexing and at search query time. This normalization enhances search recall by enabling users to find content with variant forms of their search terms without requiring manual configuration.
Show less
Key Features
- Lemma normalization: AI Search identifies and expands inflected forms of words (such as plural nouns and verb tenses) to their root or base form called a lemma. This enables matches between different forms of the same word—for example, "selling," "sold," and "sell" are linked. Lemma normalization supports multiple languages including English, Arabic, French, German, Japanese, Korean, Chinese, and others.
- Decompounding: For languages like German, Danish, Hungarian, Korean, Norwegian (Bokmål), and Swedish, AI Search indexes both compound words and their individual components. For instance, the German compound "Humanressourcen" is indexed along with "Human" and "ressourcen," improving search accuracy.
- Unicode normalization: AI Search converts Unicode characters to their nearest equivalent forms, making accented and non-accented versions of words interchangeable in searches. For example, "resumé" and "resume" are treated as equivalents. This uses Unicode normalization forms NFKD and NFKC to ensure consistent indexing and querying.
Interaction with Other Search Features
- Genius Results: Terms added by lemma or Unicode normalization do not trigger Genius Result configurations that rely on exact term matches.
- Result improvement rules: Normalized query terms can trigger rules if they match the rule’s query trigger.
- Stop words: Stop words are removed prior to normalization and are not normalized themselves.
- Synonyms: Terms defined as synonyms are excluded from normalization.
- Typo handling: Normalization is applied to auto-corrected search terms, improving recall even with misspellings.
What This Means for ServiceNow Customers
With lemma and Unicode normalization enabled by default in AI Search, customers benefit from improved search recall and relevance without additional configuration. Users can find relevant content regardless of word inflections, compound words, or Unicode variations, ensuring a more intuitive and comprehensive search experience across multiple languages.
AI Search normalizes inflected words and Unicode glyphs during indexing and at search query time. Normalization improves search recall and enables users to find content with variant forms of their search query terms.
Normalization features are automatically enabled and aren't configurable.
Lemma normalization
Many languages include inflected forms of terms, such as plural nouns or verb tenses. AI Search normalizes inflected terms found in indexed content and search queries. Normalization enables matching based on a root form, such as the singular for a plural noun or the base form for a conjugated verb. This root form is called a lemma, and this process is referred to as lemma normalization.
For example, when a source record includes the conjugated verb selling, AI Search expands the indexed term to include the lemma form sell in addition to selling. When a user searches for the past-tense conjugated form sold, AI Search expands the search query term to include the lemma form sell as well as sold. Because the indexed term and the search query term include matching forms, the user's search returns the selling record as a result.
AI Search supports language-specific lemma normalization for Arabic, Brazilian Portuguese, Czech, Danish, Dutch, English, Finnish, French, French - Canada, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian (Bokmål), Polish, Portuguese, Russian, Simplified Chinese, Spanish, Swedish, Traditional Chinese, and Turkish.
Decompounding
In addition to normalizing lemmas for German, Danish, Hungarian, Korean, Norwegian (Bokmål), and Swedish, AI Search indexes compound words and their individual component words. For example, when indexing a German record that contains the compound word Humanressourcen, AI Search indexes the component terms Human and ressourcen in addition to the compound term.
Unicode normalization
AI Search performs Unicode normalization on indexed terms and search query terms. This normalization makes alphabetical Unicode glyphs searchable using their nearest equivalent characters.
For example, when indexing a record containing the term resumé, AI Search expands the term to also include the non-accented form resume. This record appears as a search result when users search for either resume or resumé.
Unicode normalization includes NFKD (compatibility decomposition) and NFKC (compatibility composition) stages. For more information on these normalization forms, see the Unicode Standard Annex #15, https://www.unicode.org/reports/tr15/.
Interaction with other search features
The following table describes interactions between normalization and other search features.
| Feature | Interaction with lemma and Unicode normalization |
|---|---|
| Genius Results | Search query terms added by lemma or Unicode normalization can't trigger Genius Result configurations with Term trigger conditions. |
| Result improvement rules | A search query term added by lemma or Unicode normalization can trigger a result improvement rule if it matches the rule's Query trigger. |
| Stop words | If a search query term is defined as a stop word, AI Search removes that term without normalizing it. |
| Synonyms | If a search query term is defined as a synonym, AI Search doesn't normalize it. |
| Typo handling | AI Search performs lemma and Unicode normalization on auto-corrected search query terms. |