Lemma and Unicode normalization

  • Release version: Washingtondc
  • Updated July 24, 2024
  • 2 minutes to read
  • Summarize
    Summarized using AI
    This content was generated using new OpenAI-powered functionality. Results are provided on an as is basis and are not guaranteed to be accurate or complete.

    Summary of Lemma and Unicode normalization

    AI Search in ServiceNow enhances search functionality by normalizing inflected words and Unicode glyphs during both indexing and search query processing. This normalization improves search recall, allowing users to find content with various forms of their search terms.

    Show full answer Show less

    Key Features

    • Lemma Normalization: Normalizes inflected forms of terms, allowing searches to match root forms. For example, the term "selling" is indexed as "sell," enabling searches for "sold" to return relevant results.
    • Language Support: Supports lemma normalization for various languages including English, French, German, Japanese, and more.
    • Decompounding: For languages like German, Korean, and Swedish, AI Search indexes compound words and their components for better search accuracy.
    • Unicode Normalization: Converts Unicode glyphs to their nearest equivalent characters, ensuring terms like "resumé" and "resume" are searchable interchangeably.

    Key Outcomes

    Normalization features are automatically enabled and non-configurable, enhancing the search experience by:

    • Improving search results through expanded term matching.
    • Facilitating searches across different languages and forms of words.
    • Ensuring broader accessibility of content through Unicode normalization.

    Interactions with other search features include:

    • Search query terms added via normalization do not trigger Genius Result configurations.
    • Search terms defined as stop words are removed without normalization.
    • Normalized terms can trigger result improvement rules if they match specified queries.
    • Normalized auto-corrected terms undergo lemma and Unicode normalization.

    AI Search normalizes inflected words and Unicode glyphs during indexing and at search query time. Normalization improves search recall and enables users to find content with variant forms of their search query terms.

    Normalization features are automatically enabled and aren't configurable.

    Lemma normalization

    Many languages include inflected forms of terms, such as plural nouns or verb tenses. AI Search normalizes inflected terms found in indexed content and search queries. Normalization enables matching based on a root form, such as the singular for a plural noun or the base form for a conjugated verb. This root form is called a lemma, and this process is referred to as lemma normalization.

    For example, when a source record includes the conjugated verb selling, AI Search expands the indexed term to include the lemma form sell in addition to selling. When a user searches for the past-tense conjugated form sold, AI Search expands the search query term to include the lemma form sell as well as sold. Because the indexed term and the search query term include matching forms, the user's search returns the selling record as a result.

    AI Search supports language-specific lemma normalization for Brazilian Portuguese, Dutch, English, French, French - Canada, German, Italian, Japanese, Korean, Portuguese, Simplified Chinese, Spanish, Swedish, and Traditional Chinese.

    Decompounding

    In addition to normalizing lemmas for German, Korean, and Swedish, AI Search indexes compound words and their individual component words. For example, when indexing a German record that contains the compound word Humanressourcen, AI Search indexes the component terms Human and ressourcen in addition to the compound term.

    Unicode normalization

    AI Search performs Unicode normalization on indexed terms and search query terms. This normalization makes alphabetical Unicode glyphs searchable using their nearest equivalent characters.

    For example, when indexing a record containing the term resumé, AI Search expands the term to also include the non-accented form resume. This record appears as a search result when users search for either resume or resumé.

    Unicode normalization includes NFKD (compatibility decomposition) and NFKC (compatibility composition) stages. For more information on these normalization forms, see the Unicode Standard Annex #15, https://www.unicode.org/reports/tr15/.

    Interaction with other search features

    The following table describes interactions between normalization and other search features.

    Feature Interaction with lemma and Unicode normalization
    Genius Results

    Search query terms added by lemma or Unicode normalization can't trigger Genius Result configurations with Term trigger conditions.

    Result improvement rules

    A search query term added by lemma or Unicode normalization can trigger a result improvement rule if it matches the rule's Query trigger.

    Stop words

    If a search query term is defined as a stop word, AI Search removes that term without normalizing it.

    Synonyms

    If a search query term is defined as a synonym, AI Search doesn't normalize it.

    Typo handling

    AI Search performs lemma and Unicode normalization on auto-corrected search query terms.