Wednesday, October 2, 2013

Lexer -> Basic Lexer -> Printjoins in DOMAIN index

lexer  Web definitions

Fancy term for a tokener..

Lexer Types

Use the lexer preference to specify the language of the text to be indexed. To create a lexer preference, you must use one of the following lexer types:

Type     Description
BASIC_LEXER                     Lexer for extracting tokens from text in languages, such as English and most western European languages that use white space delimited words.
MULTI_LEXER                    Lexer for indexing tables containing documents of different languages
CHINESE_VGRAM_LEXER     Lexer for extracting tokens from Chinese text.
CHINESE_LEXER                 Lexer for extracting tokens from Chinese text.
JAPANESE_VGRAM_LEXER   Lexer for extracting tokens from Japanese text.
JAPANESE_LEXER               Lexer for extracting tokens from Japanese text.
KOREAN_MORPH_LEXER     Lexer for extracting tokens from Korean text.
USER_LEXER                      Lexer you create to index a particular language.
WORLD_LEXER                   Lexer for indexing tables containing documents of different languages; autodetects languages in a document


Use the BASIC_LEXER type to identify tokens for creating Text indexes for English and all other
supported whitespace-delimited languages.

The BASIC_LEXER also enables base-letter conversion, composite word indexing, case-sensitive indexing
 and alternate spelling for whitespace-delimited languages that have extended character sets.

In English and French, you can use the BASIC_LEXER to enable theme indexing.

Any processing the lexer does to tokens before indexing (for example, removal of characters,
 and base-letter conversion) are also performed on query terms at query time. This ensures
 that the query terms match the form of the tokens in the Text index.

BASIC_LEXER supports any database character set.

BASIC_LEXER attribute printjoins:

    Specify the non alphanumeric characters that, when they appear anywhere in a word (beginning, middle, or end), are processed as alphanumeric and included with the token in the Text index. This includes printjoins that occur consecutively.

    For example, if the hyphen '-' and underscore '_' characters are defined as printjoins, terms such as pseudo-intellectual and _file_ are stored in the Text index as pseudo-intellectual and _file_.

BASIC_LEXER ExampleThe following example sets printjoin characters and disables theme indexing with the BASIC_LEXER:

ctx_ddl.create_preference('mylex', 'BASIC_LEXER');
ctx_ddl.set_attribute('mylex', 'printjoins', '_-');
ctx_ddl.set_attribute ( 'mylex', 'index_themes', 'NO');
ctx_ddl.set_attribute ( 'mylex', 'index_text', 'YES');

To create the index with no theme indexing and with printjoins characters set as described, issue the following statement:

create index myindex on mytable ( docs ) indextype is ctxsys.context parameters ( 'LEXER mylex' );

No comments:

Post a Comment

web stats