lexer Web definitions
Fancy term for a tokener..Lexer Types
Use the lexer preference to specify the language of the text to be indexed. To create a lexer preference, you must use one of the following lexer types:
Type Description
BASIC_LEXER Lexer for extracting tokens from text in languages, such as English and most western European languages that use white space delimited words.
MULTI_LEXER Lexer for indexing tables containing documents of different languages
CHINESE_VGRAM_LEXER Lexer for extracting tokens from Chinese text.
CHINESE_LEXER Lexer for extracting tokens from Chinese text.
JAPANESE_VGRAM_LEXER Lexer for extracting tokens from Japanese text.
JAPANESE_LEXER Lexer for extracting tokens from Japanese text.
KOREAN_MORPH_LEXER Lexer for extracting tokens from Korean text.
USER_LEXER Lexer you create to index a particular language.
WORLD_LEXER Lexer for indexing tables containing documents of different languages; autodetects languages in a document
BASIC_LEXER
Use the BASIC_LEXER type to identify tokens for creating Text indexes for English and all other
supported whitespace-delimited languages.
The BASIC_LEXER also enables base-letter conversion, composite word indexing, case-sensitive indexing
and alternate spelling for whitespace-delimited languages that have extended character sets.
In English and French, you can use the BASIC_LEXER to enable theme indexing.
Note:
Any processing the lexer does to tokens before indexing (for example, removal of characters,
and base-letter conversion) are also performed on query terms at query time. This ensures
that the query terms match the form of the tokens in the Text index.
BASIC_LEXER supports any database character set.
BASIC_LEXER attribute printjoins:
Specify the non alphanumeric characters that, when they appear anywhere in a word (beginning, middle, or end), are processed as alphanumeric and included with the token in the Text index. This includes printjoins that occur consecutively.
For example, if the hyphen '-' and underscore '_' characters are defined as printjoins, terms such as pseudo-intellectual and _file_ are stored in the Text index as pseudo-intellectual and _file_.
BASIC_LEXER ExampleThe following example sets printjoin characters and disables theme indexing with the BASIC_LEXER:
begin
ctx_ddl.create_preference('mylex', 'BASIC_LEXER');
ctx_ddl.set_attribute('mylex', 'printjoins', '_-');
ctx_ddl.set_attribute ( 'mylex', 'index_themes', 'NO');
ctx_ddl.set_attribute ( 'mylex', 'index_text', 'YES');
end;
To create the index with no theme indexing and with printjoins characters set as described, issue the following statement:
create index myindex on mytable ( docs ) indextype is ctxsys.context parameters ( 'LEXER mylex' );
No comments:
Post a Comment