Lucene.Net.Analysis.Common
for Arabic.
This analyzer implements light-stemming as specified by:
Light Stemming for Arabic Information Retrieval
http://www.mtholyoke.edu/~lballest/Pubs/arab_stem05.pdf
The analysis package contains three primary components:
- : Arabic orthographic normalization.
- : Arabic light stemming
- Arabic stop words file: a set of default Arabic stop words.
File containing default Arabic stopwords.
Default stopword list is from http://members.unine.ch/jacques.savoy/clef/index.html
The stopword list is BSD-Licensed.
Returns an unmodifiable instance of the default stop-words set.
an unmodifiable instance of the default stop-words set.
Atomically loads the DEFAULT_STOP_SET in a lazy fashion once the outer class
accesses the static final set the first time.;
Builds an analyzer with the default stop words: .
Builds an analyzer with the given stop words
lucene compatibility version
a stopword set
Builds an analyzer with the given stop word. If a none-empty stem exclusion set is
provided this analyzer will add a before
.
lucene compatibility version
a stopword set
a set of terms not to be stemmed
Creates
used to tokenize all the text in the provided .
built from an filtered with
, ,
,
if a stem exclusion set is provided and .
Tokenizer that breaks text into runs of letters and diacritics.
The problem with the standard Letter tokenizer is that it fails on diacritics.
Handling similar to this is necessary for Indic Scripts, Hebrew, Thaana, etc.
You must specify the required compatibility when creating
:
- As of 3.1, uses an int based API to normalize and
detect token characters. See and
for details.
@deprecated (3.1) Use instead.
Construct a new ArabicLetterTokenizer.
to match
the input to split up into tokens
Construct a new using a given
.
Lucene version to match - See
.
the attribute factory to use for this Tokenizer
the input to split up into tokens
Allows for Letter category or NonspacingMark category
Factory for
@deprecated (3.1) Use StandardTokenizerFactory instead.
Creates a new
A that applies to normalize the orthography.
Factory for .
<fieldType name="text_arnormal" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ArabicNormalizationFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Normalizer for Arabic.
Normalization is done in-place for efficiency, operating on a termbuffer.
Normalization is defined as:
- Normalization of hamza with alef seat to a bare alef.
- Normalization of teh marbuta to heh
- Normalization of dotless yeh (alef maksura) to yeh.
- Removal of Arabic diacritics (the harakat)
- Removal of tatweel (stretching character).
Normalize an input buffer of Arabic text
input buffer
length of input buffer
length of input buffer after normalization
A that applies to stem Arabic words..
To prevent terms from being stemmed use an instance of
or a custom that sets
the before this .
Factory for .
<fieldType name="text_arstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ArabicNormalizationFilterFactory"/>
<filter class="solr.ArabicStemFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Stemmer for Arabic.
Stemming is done in-place for efficiency, operating on a termbuffer.
Stemming is defined as:
- Removal of attached definite article, conjunction, and prepositions.
- Stemming of common suffixes.
Stem an input buffer of Arabic text.
input buffer
length of input buffer
length of input buffer after normalization
Stem a prefix off an Arabic word.
input buffer
length of input buffer
new length of input buffer after stemming.
Stem suffix(es) off an Arabic word.
input buffer
length of input buffer
new length of input buffer after stemming
Returns true if the prefix matches and can be stemmed
input buffer
length of input buffer
prefix to check
true if the prefix matches and can be stemmed
Returns true if the suffix matches and can be stemmed
input buffer
length of input buffer
suffix to check
true if the suffix matches and can be stemmed
for Bulgarian.
This analyzer implements light-stemming as specified by: Searching
Strategies for the Bulgarian Language
http://members.unine.ch/jacques.savoy/Papers/BUIR.pdf
File containing default Bulgarian stopwords.
Default stopword list is from
http://members.unine.ch/jacques.savoy/clef/index.html The stopword list is
BSD-Licensed.
Returns an unmodifiable instance of the default stop-words set.
an unmodifiable instance of the default stop-words set.
Atomically loads the DEFAULT_STOP_SET in a lazy fashion once the outer
class accesses the static final set the first time.;
Builds an analyzer with the default stop words:
.
Builds an analyzer with the given stop words.
Builds an analyzer with the given stop words and a stem exclusion set.
If a stem exclusion set is provided this analyzer will add a
before .
Creates a
which tokenizes all the text in the provided .
A
built from an filtered with
, , ,
if a stem exclusion set is
provided and .
A that applies to stem Bulgarian
words.
To prevent terms from being stemmed use an instance of
or a custom that sets
the before this .
Factory for .
<fieldType name="text_bgstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.BulgarianStemFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Light Stemmer for Bulgarian.
Implements the algorithm described in:
Searching Strategies for the Bulgarian Language
http://members.unine.ch/jacques.savoy/Papers/BUIR.pdf
Stem an input buffer of Bulgarian text.
input buffer
length of input buffer
length of input buffer after normalization
Mainly remove the definite article
input buffer
length of input buffer
new stemmed length
for Brazilian Portuguese language.
Supports an external list of stopwords (words that
will not be indexed at all) and an external list of exclusions (words that will
not be stemmed, but indexed).
NOTE: This class uses the same
dependent settings as .
File containing default Brazilian Portuguese stopwords.
Returns an unmodifiable instance of the default stop-words set.
an unmodifiable instance of the default stop-words set.
Contains words that should be indexed but not stemmed.
Builds an analyzer with the default stop words ().
Builds an analyzer with the given stop words
lucene compatibility version
a stopword set
Builds an analyzer with the given stop words and stemming exclusion words
lucene compatibility version
a stopword set
a set of terms not to be stemmed
Creates
used to tokenize all the text in the provided .
built from a filtered with
, , ,
and .
A that applies .
To prevent terms from being stemmed use an instance of
or a custom that sets
the before this .
in use by this filter.
Creates a new
the source
Factory for .
<fieldType name="text_brstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.BrazilianStemFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
A stemmer for Brazilian Portuguese words.
Changed term
Stems the given term to an unique discriminator.
The term that should be stemmed.
Discriminator for
Checks a term if it can be processed correctly.
true if, and only if, the given term consists in letters.
Checks a term if it can be processed indexed.
true if it can be indexed
See if string is 'a','e','i','o','u'
true if is vowel
Gets R1
R1 - is the region after the first non-vowel following a vowel,
or is the null region at the end of the word if there is
no such non-vowel.
null or a string representing R1
Gets RV
RV - IF the second letter is a consonant, RV is the region after
the next following vowel,
OR if the first two letters are vowels, RV is the region
after the next consonant,
AND otherwise (consonant-vowel case) RV is the region after
the third letter.
BUT RV is the end of the word if this positions cannot be
found.
null or a string representing RV
1) Turn to lowercase
2) Remove accents
3) ã -> a ; õ -> o
4) ç -> c
null or a string transformed
Check if a string ends with a suffix
true if the string ends with the specified suffix
Replace a suffix by another
the replaced
Remove a suffix
the without the suffix
See if a suffix is preceded by a
true if the suffix is preceded
Creates CT (changed term) , substituting * 'ã' and 'õ' for 'a~' and 'o~'.
Standard suffix removal.
Search for the longest among the following suffixes, and perform
the following actions:
false if no ending was removed
Verb suffixes.
Search for the longest among the following suffixes in RV,
and if found, delete.
false if no ending was removed
Delete suffix 'i' if in RV and preceded by 'c'
Residual suffix
If the word ends with one of the suffixes (os a i o á í ó)
in RV, delete it
If the word ends with one of ( e é ê) in RV,delete it,
and if preceded by 'gu' (or 'ci') with the 'u' (or 'i') in RV,
delete the 'u' (or 'i')
Or if the word ends ç remove the cedilha
For log and debug purpose
TERM, CT, RV, R1 and R2
for Catalan.
You must specify the required
compatibility when creating CatalanAnalyzer:
- As of 3.6, with a set of Catalan
contractions is used by default.
File containing default Catalan stopwords.
Returns an unmodifiable instance of the default stop words set.
default stop words set.
Atomically loads the in a lazy fashion once the outer class
accesses the static final set the first time.;
Builds an analyzer with the default stop words: .
Builds an analyzer with the given stop words.
lucene compatibility version
a stopword set
Builds an analyzer with the given stop words. If a non-empty stem exclusion set is
provided this analyzer will add a before
stemming.
lucene compatibility version
a stopword set
a set of terms not to be stemmed
Creates a
which tokenizes all the text in the provided .
A
built from an filtered with
, , ,
, if a stem exclusion set is
provided and .
Base utility class for implementing a .
You subclass this, and then record mappings by calling
, and then invoke the correct
method to correct an offset.
Retrieve the corrected offset.
Adds an offset correction mapping at the given output stream offset.
Assumption: the offset given with each successive call to this method
will not be smaller than the offset given at the previous invocation.
The output stream offset at which to apply the correction
The input offset is given by adding this
to the output offset
A that wraps another and attempts to strip out HTML constructs.
This character denotes the end of file
initial size of the lookahead buffer
ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l
ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l
at the beginning of a line
l is of the form l = 2*k, k a non negative integer
Translates characters to character classes
Translates characters to character classes
Translates DFA states to action switch labels.
Translates a state to a row index in the transition table
The transition table of the DFA
error codes
error messages for the codes above
ZZ_ATTRIBUTE[aState] contains the attributes of state aState
the input device
the current state of the DFA
the current lexical state
this buffer contains the current text to be matched and is the source of the YyText() string
the textposition at the last accepting state
the current text position in the buffer
startRead marks the beginning of the YyText() string in the buffer
endRead marks the last character in the buffer, that has been read from input
the number of characters up to the start of the matched text
zzAtEOF == true <=> the scanner is at the EOF
denotes if the user-EOF-code has already been executed
user code:
Creates a new HTMLStripCharFilter over the provided TextReader.
to strip html tags from.
Creates a new over the provided
with the specified start and end tags.
to strip html tags from.
Tags in this set (both start and end tags) will not be filtered out.
LUCENENET: Copied this method from the WordlistLoader class - this class requires readers
with a Reset() method (which .NET readers don't support). So, we use the Java BufferedReader
as a wrapper for whatever reader the user passes (unless it is already a BufferedReader).
The position from which the next char will be read.
Wraps the given and sets this.len to the given .
Allocates an internal buffer of the given size.
Sets len = 0 and pos = 0.
Sets pos = 0
Returns the next char in the segment.
Returns true when all characters in the text segment have been read
Unpacks the compressed character translation table.
the packed character translation table
the unpacked character translation table
Refills the input buffer.
false, iff there was new input.
if any I/O-Error occurs
Disposes the input stream.
Resets the scanner to read from a new input stream.
Does not close the old reader.
All internal variables are reset, the old input stream
cannot be reused (internal buffer is discarded and lost).
Lexical state is set to .
Internal scan buffer is resized down to its initial length, if it has grown.
the new input stream
Returns the current lexical state.
Enters a new lexical state
the new lexical state
Returns the character at position pos from the
matched text. It is equivalent to YyText[pos], but faster
the position of the character to fetch. A value from 0 to YyLength()-1.
the character at position pos
Returns the length of the matched text region.
Reports an error that occured while scanning.
In a wellformed scanner (no or only correct usage of
YyPushBack(int) and a match-all fallback rule) this method
will only be called with things that "Can't Possibly Happen".
If this method is called, something is seriously wrong
(e.g. a JFlex bug producing a faulty scanner etc.).
Usual syntax/scanner level error handling should be done
in error fallback rules.
the code of the errormessage to display
Pushes the specified amount of characters back into the input stream.
They will be read again by then next call of the scanning method
the number of characters to be read again.
This number must not be greater than YyLength()!
Contains user EOF-code, which will be executed exactly once,
when the end of file is reached
Resumes scanning until the next regular expression is matched,
the end of input is encountered or an I/O-Error occurs.
the next token
if any I/O-Error occurs
Factory for .
<fieldType name="text_html" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<charFilter class="solr.HTMLStripCharFilterFactory" escapedTags="a, title" />
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
Creates a new
Simplistic that applies the mappings
contained in a to the character
stream, and correcting the resulting changes to the
offsets. Matching is greedy (longest pattern matching at
a given point wins). Replacement is allowed to be the
empty string.
LUCENENET specific support to buffer the reader.
Default constructor that takes a .
LUCENENET: Copied this method from the class - this class requires readers
with a Reset() method (which .NET readers don't support). So, we use the
(which is similar to Java BufferedReader) as a wrapper for whatever reader the user passes
(unless it is already a ).
Factory for .
<fieldType name="text_map" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
@since Solr 1.4
Creates a new
Holds a map of input to output, to be used
with . Use the
to create this.
Builds an NormalizeCharMap.
Call add() until you have added all the mappings, then call build() to get a NormalizeCharMap
@lucene.experimental
Records a replacement to be applied to the input
stream. Whenever singleMatch
occurs in
the input, it will be replaced with
replacement
.
input String to be replaced
output String
if
match
is the empty string, or was
already previously added
Builds the ; call this once you
are done calling .
An that tokenizes text with ,
normalizes content with , folds case with
, forms bigrams of CJK with ,
and filters stopwords with
File containing default CJK stopwords.
Currently it contains some common English words that are not usually
useful for searching and some double-byte interpunctions.
Returns an unmodifiable instance of the default stop-words set.
an unmodifiable instance of the default stop-words set.
Builds an analyzer which removes words in .
Builds an analyzer with the given stop words
lucene compatibility version
a stopword set
bigram flag for Han Ideographs
bigram flag for Hiragana
bigram flag for Katakana
bigram flag for Hangul
bigram flag for all scripts
Forms bigrams of CJK terms that are generated from
or ICUTokenizer.
CJK types are set by these tokenizers, but you can also use
to explicitly control which
of the CJK scripts are turned into bigrams.
By default, when a CJK character has no adjacent characters to form
a bigram, it is output in unigram form. If you want to always output
both unigrams and bigrams, set the outputUnigrams
flag in .
This can be used for a combined unigram+bigram approach.
In all cases, all non-CJK input is passed thru unmodified.
when we emit a bigram, its then marked as this type
when we emit a unigram, its then marked as this type
Calls
CJKBigramFilter(@in, CJKScript.HAN | CJKScript.HIRAGANA | CJKScript.KATAKANA | CJKScript.HANGUL)
Input
Calls
CJKBigramFilter(in, flags, false)
Input
OR'ed set from , ,
,
Create a new , specifying which writing systems should be bigrammed,
and whether or not unigrams should also be output.
Input
OR'ed set from , ,
,
true if unigrams for the selected writing systems should also be output.
when this is false, this is only done when there are no adjacent characters to form
a bigram.
looks at next input token, returning false is none is available
refills buffers with new data from the current token.
Flushes a bigram token to output from our buffer
This is the normal case, e.g. ABC -> AB BC
Flushes a unigram token to output from our buffer.
This happens when we encounter isolated CJK characters, either the whole
CJK string is a single character, or we encounter a CJK character surrounded
by space, punctuation, english, etc, but not beside any other CJK.
True if we have multiple codepoints sitting in our buffer
True if we have a single codepoint sitting in our buffer, where its future
(whether it is emitted as unigram or forms a bigram) depends upon not-yet-seen
inputs.
Factory for .
<fieldType name="text_cjk" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"
han="true" hiragana="true"
katakana="true" hangul="true" outputUnigrams="false" />
</analyzer>
</fieldType>
Creates a new
CJKTokenizer is designed for Chinese, Japanese, and Korean languages.
The tokens returned are every two adjacent characters with overlap match.
Example: "java C1C2C3C4" will be segmented to: "java" "C1C2" "C2C3" "C3C4".
Additionally, the following is applied to Latin text (such as English):
- Text is converted to lowercase.
- Numeric digits, '+', '#', and '_' are tokenized as letters.
- Full-width forms are converted to half-width forms.
For more info on Asian language (Chinese, Japanese, and Korean) text segmentation:
please search google
@deprecated Use StandardTokenizer, CJKWidthFilter, CJKBigramFilter, and LowerCaseFilter instead.
Word token type
Single byte token type
Double byte token type
Names for token types
Max word length
buffer size:
Regular expression for testing Unicode character class \p{IsHalfwidthandFullwidthForms}.
Regular expression for testing Unicode character class \p{IsBasicLatin}.
word offset, used to imply which character(in ) is parsed
the index used only for ioBuffer
data length
character buffer, store the characters which are used to compose
the returned Token
I/O buffer, used to store the content of the input(one of the
members of Tokenizer)
word type: single=>ASCII double=>non-ASCII word=>default
tag: previous character is a cached double-byte character "C1C2C3C4"
----(set the C1 isTokened) C1C2 "C2C3C4" ----(set the C2 isTokened)
C1C2 C2C3 "C3C4" ----(set the C3 isTokened) "C1C2 C2C3 C3C4"
Construct a token stream processing the given input.
I/O reader
Returns true for the next token in the stream, or false at EOS.
See http://java.sun.com/j2se/1.3/docs/api/java/lang/Character.UnicodeBlock.html
for detail.
false for end of stream, true otherwise
when read error
happened in the InputStream
Factory for .
<fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.CJKTokenizerFactory"/>
</analyzer>
</fieldType>
@deprecated Use instead.
Creates a new
A that normalizes CJK width differences:
- Folds fullwidth ASCII variants into the equivalent basic latin
- Folds halfwidth Katakana variants into the equivalent kana
NOTE: this filter can be viewed as a (practical) subset of NFKC/NFKD
Unicode normalization. See the normalization support in the ICU package
for full normalization.
halfwidth kana mappings: 0xFF65-0xFF9D
note: 0xFF9C and 0xFF9D are only mapped to 0x3099 and 0x309A
as a fallback when they cannot properly combine with a preceding
character into a composed form.
kana combining diffs: 0x30A6-0x30FD
returns true if we successfully combined the voice mark
Factory for .
<fieldType name="text_cjk" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
for Sorani Kurdish.
File containing default Kurdish stopwords.
Returns an unmodifiable instance of the default stop words set.
default stop words set.
Atomically loads the in a lazy fashion once the outer class
accesses the static final set the first time.;
Builds an analyzer with the default stop words: .
Builds an analyzer with the given stop words.
lucene compatibility version
a stopword set
Builds an analyzer with the given stop words. If a non-empty stem exclusion set is
provided this analyzer will add a before
stemming.
lucene compatibility version
a stopword set
a set of terms not to be stemmed
Creates a
which tokenizes all the text in the provided .
A
built from an filtered with
, ,
, ,
if a stem exclusion set is
provided and .
A that applies to normalize the
orthography.
Factory for .
<fieldType name="text_ckbnormal" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SoraniNormalizationFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Normalizes the Unicode representation of Sorani text.
Normalization consists of:
- Alternate forms of 'y' (0064, 0649) are converted to 06CC (FARSI YEH)
- Alternate form of 'k' (0643) is converted to 06A9 (KEHEH)
- Alternate forms of vowel 'e' (0647+200C, word-final 0647, 0629) are converted to 06D5 (AE)
- Alternate (joining) form of 'h' (06BE) is converted to 0647
- Alternate forms of 'rr' (0692, word-initial 0631) are converted to 0695 (REH WITH SMALL V BELOW)
- Harakat, tatweel, and formatting characters such as directional controls are removed.
Normalize an input buffer of Sorani text
input buffer
length of input buffer
length of input buffer after normalization
A that applies to stem Sorani words.
To prevent terms from being stemmed use an instance of
or a custom that sets
the before this .
Factory for .
<fieldType name="text_ckbstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SoraniNormalizationFilterFactory"/>
<filter class="solr.SoraniStemFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Light stemmer for Sorani
Stem an input buffer of Sorani text.
input buffer
length of input buffer
length of input buffer after normalization
An that tokenizes text with and
filters with
@deprecated (3.1) Use instead, which has the same functionality.
This analyzer will be removed in Lucene 5.0
Creates
used to tokenize all the text in the provided .
built from a filtered with
A with a stop word table.
- Numeric tokens are removed.
- English tokens must be larger than 1 character.
- One Chinese character as one Chinese word.
TO DO:
- Add Chinese stop words, such as \ue400
- Dictionary based Chinese word extraction
- Intelligent Chinese word extraction
@deprecated (3.1) Use instead, which has the same functionality.
This filter will be removed in Lucene 5.0
Factory for
@deprecated Use instead.
Creates a new
Tokenize Chinese text as individual chinese characters.
The difference between and
is that they have different
token parsing logic.
For example, if the Chinese text
"C1C2C3C4" is to be indexed:
- The tokens returned from ChineseTokenizer are C1, C2, C3, C4.
- The tokens returned from the CJKTokenizer are C1C2, C2C3, C3C4.
Therefore the index created by is much larger.
The problem is that when searching for C1, C1C2, C1C3,
C4C2, C1C2C3 ... the works, but the
will not work.
@deprecated (3.1) Use instead, which has the same functionality.
This filter will be removed in Lucene 5.0
Factory for
@deprecated Use instead.
Creates a new
Construct bigrams for frequently occurring terms while indexing. Single terms
are still indexed too, with bigrams overlaid. This is achieved through the
use of . Bigrams have a type
of Example:
- input:"the quick brown fox"
- output:|"the","the-quick"|"brown"|"fox"|
- "the-quick" has a position increment of 0 so it is in the same position
as "the" "the-quick" has a term.type() of "gram"
Construct a token stream filtering the given input using a Set of common
words to create bigrams. Outputs both unigrams with position increment and
bigrams with position increment 0 type=gram where one or both of the words
in a potential bigram are in the set of common words .
lucene compatibility version
input in filter chain
The set of common words.
Inserts bigrams for common words into a token stream. For each input token,
output the token. If the token and/or the following token are in the list
of common words also output a bigram with position increment 0 and
type="gram"
TODO:Consider adding an option to not emit unigram stopwords
as in CDL XTF BigramStopFilter, would need to be
changed to work with this.
TODO: Consider optimizing for the case of three
commongrams i.e "man of the year" normally produces 3 bigrams: "man-of",
"of-the", "the-year" but with proper management of positions we could
eliminate the middle bigram "of-the"and save a disk seek and a whole set of
position lookups.
This method is called by a consumer before it begins consumption using
.
Resets this stream to a clean state. Stateful implementations must implement
this method so that they can be reused, just as if they had been created fresh.
If you override this method, always call base.Reset(), otherwise
some internal state will not be correctly reset (e.g., will
throw on further usage).
NOTE:
The default implementation chains the call to the input , so
be sure to call base.Reset() when overriding this method.
Determines if the current token is a common term
true if the current token is a common term, false otherwise
Saves this information to form the left part of a gram
Constructs a compound token.
Constructs a .
<fieldType name="text_cmmngrms" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.CommonGramsFilterFactory" words="commongramsstopwords.txt" ignoreCase="false"/>
</analyzer>
</fieldType>
Creates a new
Wrap a optimizing phrase queries by only returning single
words when they are not a member of a bigram.
Example:
- query input to CommonGramsFilter: "the rain in spain falls mainly"
- output of CommomGramsFilter/input to CommonGramsQueryFilter:
|"the, "the-rain"|"rain" "rain-in"|"in, "in-spain"|"spain"|"falls"|"mainly"
- output of CommonGramsQueryFilter:"the-rain", "rain-in" ,"in-spain",
"falls", "mainly"
See:http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//all/org/apache/lucene/analysis/TokenStream.html and
http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/analysis/package.html?revision=718798
Constructs a new CommonGramsQueryFilter based on the provided CommomGramsFilter
CommonGramsFilter the QueryFilter will use
This method is called by a consumer before it begins consumption using
.
Resets this stream to a clean state. Stateful implementations must implement
this method so that they can be reused, just as if they had been created fresh.
If you override this method, always call base.Reset(), otherwise
some internal state will not be correctly reset (e.g., will
throw on further usage).
NOTE:
The default implementation chains the call to the input , so
be sure to call base.Reset() when overriding this method.
Output bigrams whenever possible to optimize queries. Only output unigrams
when they are not a member of a bigram. Example:
- input: "the rain in spain falls mainly"
- output:"the-rain", "rain-in" ,"in-spain", "falls", "mainly"
Convenience method to check if the current type is a gram type
true if the current type is a gram type, false otherwise
Construct .
<fieldType name="text_cmmngrmsqry" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.CommonGramsQueryFilterFactory" words="commongramsquerystopwords.txt" ignoreCase="false"/>
</analyzer>
</fieldType>
Creates a new
Create a and wrap it with a
Base class for decomposition token filters.
You must specify the required compatibility when creating
:
- As of 3.1, CompoundWordTokenFilterBase correctly handles Unicode 4.0
supplementary characters in strings and char arrays provided as compound word
dictionaries.
- As of 4.4, doesn't update offsets.
The default for minimal word length that gets decomposed
The default for minimal length of subwords that get propagated to the output of this filter
The default for maximal length of subwords that get propagated to the output of this filter
Decomposes the current and places instances in the list.
The original token may not be placed in the list, as it is automatically passed through this filter.
Helper class to hold decompounded token information
Construct the compound token based on a slice of the current .
A that decomposes compound words found in many Germanic languages.
"Donaudampfschiff" becomes Donau, dampf, schiff so that you can find
"Donaudampfschiff" even when you only enter "schiff".
It uses a brute-force algorithm to achieve this.
You must specify the required compatibility when creating
:
- As of 3.1, CompoundWordTokenFilterBase correctly handles Unicode 4.0
supplementary characters in strings and char arrays provided as compound word
dictionaries.
Creates a new
Lucene version to enable correct Unicode 4.0 behavior in the
dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.
the to process
the word dictionary to match against.
Creates a new
Lucene version to enable correct Unicode 4.0 behavior in the
dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.
the to process
the word dictionary to match against.
only words longer than this get processed
only subwords longer than this get to the output stream
only subwords shorter than this get to the output stream
Add only the longest matching subword to the stream
Factory for .
<fieldType name="text_dictcomp" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="dictionary.txt"
minWordSize="5" minSubwordSize="2" maxSubwordSize="15" onlyLongestMatch="true"/>
</analyzer>
</fieldType>
Creates a new
A that decomposes compound words found in many Germanic languages.
"Donaudampfschiff" becomes Donau, dampf, schiff so that you can find
"Donaudampfschiff" even when you only enter "schiff". It uses a hyphenation
grammar and a word dictionary to achieve this.
You must specify the required compatibility when creating
:
- As of 3.1, CompoundWordTokenFilterBase correctly handles Unicode 4.0
supplementary characters in strings and char arrays provided as compound word
dictionaries.
Creates a new instance.
Lucene version to enable correct Unicode 4.0 behavior in the
dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.
the to process
the hyphenation pattern tree to use for hyphenation
the word dictionary to match against.
Creates a new instance.
Lucene version to enable correct Unicode 4.0 behavior in the
dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.
the to process
the hyphenation pattern tree to use for hyphenation
the word dictionary to match against.
only words longer than this get processed
only subwords longer than this get to the output stream
only subwords shorter than this get to the output stream
Add only the longest matching subword to the stream
Create a with no dictionary.
Calls
Create a with no dictionary.
Calls
Create a hyphenator tree
the filename of the XML grammar to load
An object representing the hyphenation patterns
If there is a low-level I/O error.
Create a hyphenator tree
the filename of the XML grammar to load
The character encoding to use
An object representing the hyphenation patterns
If there is a low-level I/O error.
Create a hyphenator tree
the file of the XML grammar to load
An object representing the hyphenation patterns
If there is a low-level I/O error.
Create a hyphenator tree
the file of the XML grammar to load
The character encoding to use
An object representing the hyphenation patterns
If there is a low-level I/O error.
Create a hyphenator tree
the InputSource pointing to the XML grammar
An object representing the hyphenation patterns
If there is a low-level I/O error.
Create a hyphenator tree
the InputSource pointing to the XML grammar
The character encoding to use
An object representing the hyphenation patterns
If there is a low-level I/O error.
Factory for .
This factory accepts the following parameters:
hyphenator
(mandatory): path to the FOP xml hyphenation pattern.
See http://offo.sourceforge.net/hyphenation/.
encoding
(optional): encoding of the xml hyphenation file. defaults to UTF-8.
dictionary
(optional): dictionary of words. defaults to no dictionary.
minWordSize
(optional): minimal word length that gets decomposed. defaults to 5.
minSubwordSize
(optional): minimum length of subwords. defaults to 2.
maxSubwordSize
(optional): maximum length of subwords. defaults to 15.
onlyLongestMatch
(optional): if true, adds only the longest matching subword
to the stream. defaults to false.
<fieldType name="text_hyphncomp" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.HyphenationCompoundWordTokenFilterFactory" hyphenator="hyphenator.xml" encoding="UTF-8"
dictionary="dictionary.txt" minWordSize="5" minSubwordSize="2" maxSubwordSize="15" onlyLongestMatch="false"/>
</analyzer>
</fieldType>
Creates a new
This class implements a simple byte vector with access to the underlying
array.
This class has been taken from the Apache FOP project (http://xmlgraphics.apache.org/fop/). They have been slightly modified.
Capacity increment size
The encapsulated array
Points to next free item
LUCENENET indexer for .NET
return number of items in array
returns current capacity of array
This is to implement memory allocation in the array. Like malloc().
This class implements a simple char vector with access to the underlying
array.
This class has been taken from the Apache FOP project (http://xmlgraphics.apache.org/fop/). They have been slightly modified.
Capacity increment size
The encapsulated array
Points to next free item
Reset Vector but don't resize or clear elements
LUCENENET indexer for .NET
return number of items in array
returns current capacity of array
This class represents a hyphen. A 'full' hyphen is made of 3 parts: the
pre-break text, post-break text and no-break. If no line-break is generated
at this position, the no-break text is used, otherwise, pre-break and
post-break are used. Typically, pre-break is equal to the hyphen character
and the others are empty. However, this general scheme allows support for
cases in some languages where words change spelling if they're split across
lines, like german's 'backen' which hyphenates 'bak-ken'. BTW, this comes
from TeX.
This class has been taken from the Apache FOP project (http://xmlgraphics.apache.org/fop/). They have been slightly modified.
This class represents a hyphenated word.
This class has been taken from the Apache FOP project (http://xmlgraphics.apache.org/fop/). They have been slightly modified.
rawWord as made of alternating strings and instances
the number of hyphenation points in the word
the hyphenation points
This tree structure stores the hyphenation patterns in an efficient way for
fast lookup. It provides the provides the method to hyphenate a word.
This class has been taken from the Apache FOP project (http://xmlgraphics.apache.org/fop/). They have been slightly modified.
Lucene.NET specific note:
If you are going to extend this class by inheriting from it, you should be aware that the
base class TernaryTree initializes its state in the constructor by calling its protected Init() method.
If your subclass needs to initialize its own state, you add your own "Initialize()" method
and call it both from the inside of your constructor and you will need to override the Balance() method
and call "Initialize()" before the call to base.Balance().
Your class can use the data that is initialized in the base class after the call to base.Balance().
value space: stores the interletter values
This map stores hyphenation exceptions
This map stores the character classes
Temporary map to store interletter values on pattern loading.
Packs the values by storing them in 4 bits, two values into a byte Values
range is from 0 to 9. We use zero as terminator, so we'll add 1 to the
value.
a string of digits from '0' to '9' representing the
interletter values.
the index into the vspace array where the packed values are stored.
Read hyphenation patterns from an XML file.
the filename
In case the parsing fails
Read hyphenation patterns from an XML file.
the filename
The character encoding to use
In case the parsing fails
Read hyphenation patterns from an XML file.
a object representing the file
In case the parsing fails
Read hyphenation patterns from an XML file.
a object representing the file
The character encoding to use
In case the parsing fails
Read hyphenation patterns from an XML file.
input source for the file
In case the parsing fails
Read hyphenation patterns from an XML file.
input source for the file
The character encoding to use
In case the parsing fails
Read hyphenation patterns from an .
input source for the file
In case the parsing fails
String compare, returns 0 if equal or t is a substring of s
Search for all possible partial matches of word starting at index an update
interletter values. In other words, it does something like:
for (i=0; i<patterns.Length; i++)
{
if (word.Substring(index).StartsWith(patterns[i], StringComparison.Ordinal))
update_interletter_values(patterns[i]);
}
But it is done in an efficient way since the patterns are stored in a
ternary tree. In fact, this is the whole purpose of having the tree: doing
this search without having to test every single pattern. The number of
patterns for languages such as English range from 4000 to 10000. Thus,
doing thousands of string comparisons for each word to hyphenate would be
really slow without the tree. The tradeoff is memory, but using a ternary
tree instead of a trie, almost halves the the memory used by Lout or TeX.
It's also faster than using a hash table
null terminated word to match
start index from word
interletter values array to update
Hyphenate word and return a object.
the word to be hyphenated
Minimum number of characters allowed before the
hyphenation point.
Minimum number of characters allowed after the
hyphenation point.
a object representing the
hyphenated word or null if word is not hyphenated.
Hyphenate word and return an array of hyphenation points.
w = "****nnllllllnnn*****", where n is a non-letter, l is a letter, all n
may be absent, the first n is at offset, the first l is at offset +
iIgnoreAtBeginning; word = ".llllll.'\0'***", where all l in w are copied
into word. In the first part of the routine len = w.length, in the second
part of the routine len = word.length. Three indices are used: index(w),
the index in w, index(word), the index in word, letterindex(word), the
index in the letter part of word. The following relations exist: index(w) =
offset + i - 1 index(word) = i - iIgnoreAtBeginning letterindex(word) =
index(word) - 1 (see first loop). It follows that: index(w) - index(word) =
offset - 1 + iIgnoreAtBeginning index(w) = letterindex(word) + offset +
iIgnoreAtBeginning
char array that contains the word
Offset to first character in word
Length of word
Minimum number of characters allowed before the
hyphenation point.
Minimum number of characters allowed after the
hyphenation point.
a object representing the
hyphenated word or null if word is not hyphenated.
Add a character class to the tree. It is used by
as callback to add character classes.
Character classes define the valid word characters for hyphenation. If a
word contains a character not defined in any of the classes, it is not
hyphenated. It also defines a way to normalize the characters in order to
compare them with the stored patterns. Usually pattern files use only lower
case characters, in this case a class for letter 'a', for example, should
be defined as "aA", the first character being the normalization char.
Add an exception to the tree. It is used by
class as callback to store the
hyphenation exceptions.
normalized word
a vector of alternating strings and
objects.
Add a pattern to the tree. Mainly, to be used by
class as callback to add a pattern to
the tree.
the hyphenation pattern
interletter weight values indicating the desirability and
priority of hyphenating at a given point within the pattern. It
should contain only digit characters. (i.e. '0' to '9').
This interface is used to connect the XML pattern file parser to the
hyphenation tree.
This interface has been taken from the Apache FOP project (http://xmlgraphics.apache.org/fop/). They have been slightly modified.
Add a character class. A character class defines characters that are
considered equivalent for the purpose of hyphenation (e.g. "aA"). It
usually means to ignore case.
character group
Add a hyphenation exception. An exception replaces the result obtained by
the algorithm for cases for which this fails or the user wants to provide
his own hyphenation. A hyphenatedword is a vector of alternating String's
and instances
Add hyphenation patterns.
the pattern
interletter values expressed as a string of digit characters.
A XMLReader document handler to read and parse hyphenation patterns from a XML
file.
LUCENENET: This class has been refactored from its Java counterpart to use XmlReader rather
than a SAX parser.
Parses a hyphenation pattern file.
The complete file path to be read.
In case of an exception while parsing
Parses a hyphenation pattern file.
The complete file path to be read.
The character encoding to use
In case of an exception while parsing
Parses a hyphenation pattern file.
a object representing the file
In case of an exception while parsing
Parses a hyphenation pattern file.
a object representing the file
The character encoding to use
In case of an exception while parsing
Parses a hyphenation pattern file.
The stream containing the XML data.
The scans the first bytes of the stream looking for a byte order mark
or other sign of encoding. When encoding is determined, the encoding is used to continue reading
the stream, and processing continues parsing the input as a stream of (Unicode) characters.
In case of an exception while parsing
Parses a hyphenation pattern file.
input source for the file
In case of an exception while parsing
LUCENENET specific helper class to force the DTD file to be read from the embedded resource
rather than from the file system.
Receive notification of the beginning of an element.
The Parser will invoke this method at the beginning of every element in the XML document;
there will be a corresponding event for every event
(even when the element is empty). All of the element's content will be reported,
in order, before the corresponding endElement event.
the Namespace URI, or the empty string if the element has no Namespace URI or if Namespace processing is not being performed
the local name (without prefix), or the empty string if Namespace processing is not being performed
the attributes attached to the element. If there are no attributes, it shall be an empty Attributes object. The value of this object after startElement returns is undefined
Receive notification of the end of an element.
The parser will invoke this method at the end of every element in the XML document;
there will be a corresponding event for every
event (even when the element is empty).
the Namespace URI, or the empty string if the element has no Namespace URI or if Namespace processing is not being performed
the local name (without prefix), or the empty string if Namespace processing is not being performed
Receive notification of character data.
The Parser will call this method to report each chunk of character data. Parsers may
return all contiguous character data in a single chunk, or they may split it into
several chunks; however, all of the characters in any single event must come from
the same external entity so that the Locator provides useful information.
The application must not attempt to read from the array outside of the specified range.
Ternary Search Tree.
A ternary search tree is a hybrid between a binary tree and a digital search
tree (trie). Keys are limited to strings. A data value of type char is stored
in each leaf node. It can be used as an index (or pointer) to the data.
Branches that only contain one key are compressed to one node by storing a
pointer to the trailer substring of the key. This class is intended to serve
as base class or helper class to implement Dictionary collections or the
like. Ternary trees have some nice properties as the following: the tree can
be traversed in sorted order, partial matches (wildcard) can be implemented,
retrieval of all keys within a given distance from the target, etc. The
storage requirements are higher than a binary tree but a lot less than a
trie. Performance is comparable with a hash table, sometimes it outperforms a
hash function (most of the time can determine a miss faster than a hash).
The main purpose of this java port is to serve as a base for implementing
TeX's hyphenation algorithm (see The TeXBook, appendix H). Each language
requires from 5000 to 15000 hyphenation patterns which will be keys in this
tree. The strings patterns are usually small (from 2 to 5 characters), but
each char in the tree is stored in a node. Thus memory usage is the main
concern. We will sacrifice 'elegance' to keep memory requirements to the
minimum. Using java's char type as pointer (yes, I know pointer it is a
forbidden word in java) we can keep the size of the node to be just 8 bytes
(3 pointers and the data char). This gives room for about 65000 nodes. In my
tests the english patterns took 7694 nodes and the german patterns 10055
nodes, so I think we are safe.
All said, this is a map with strings as keys and char as value. Pretty
limited!. It can be extended to a general map by using the string
representation of an object and using the char value as an index to an array
that contains the object values.
This class has been taken from the Apache FOP project (http://xmlgraphics.apache.org/fop/). They have been slightly modified.
Pointer to low branch and to rest of the key when it is stored directly in
this node, we don't have unions in java!
Pointer to high branch.
Pointer to equal branch and to data when this node is a string terminator.
The character stored in this node: splitchar. Two special values are
reserved:
- 0x0000 as string terminator
- 0xFFFF to indicate that the branch starting at this node is compressed
This shouldn't be a problem if we give the usual semantics to strings since
0xFFFF is guaranteed not to be an Unicode character.
This vector holds the trailing of the keys when the branch is compressed.
Branches are initially compressed, needing one node per key plus the size
of the string key. They are decompressed as needed when another key with
same prefix is inserted. This saves a lot of space, specially for long
keys.
The actual insertion function, recursive version.
Compares 2 null terminated char arrays
Compares a string with null terminated char array
Recursively insert the median first and then the median of the lower and
upper halves, and so on in order to get a balanced tree. The array of keys
is assumed to be sorted in ascending order.
Balance the tree for best search performance
Each node stores a character (splitchar) which is part of some key(s). In a
compressed branch (one that only contain a single string key) the trailer
of the key which is not already in nodes is stored externally in the kv
array. As items are inserted, key substrings decrease. Some substrings may
completely disappear when the whole branch is totally decompressed. The
tree is traversed to find the key substrings actually used. In addition,
duplicate substrings are removed using a map (implemented with a
TernaryTree!).
Gets an enumerator over the keys of this .
NOTE: This was keys() in Lucene.
An enumerator over the keys of this .
Enumerator for TernaryTree
LUCENENET NOTE: This differs a bit from its Java counterpart to adhere to
.NET IEnumerator semantics. In Java, when the is
instantiated, it is already positioned at the first element. However,
to act like a .NET IEnumerator, the initial state is undefined and considered
to be before the first element until is called, and
if a move took place it will return true;
current node index
current key
Node stack
key stack implemented with a
traverse upwards
traverse the tree to find next key
"Tokenizes" the entire stream as a single token. This is useful
for data like zip codes, ids, and some product names.
Emits the entire input as a single token.
Default read buffer size
Factory for .
<fieldType name="text_keyword" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
</fieldType>
Creates a new
A is a tokenizer that divides text at non-letters. That's to
say, it defines tokens as maximal strings of adjacent letters, as defined by
predicate.
Note: this does a decent job for most European languages, but does a terrible
job for some Asian languages, where words are not separated by spaces.
You must specify the required compatibility when creating
:
- As of 3.1, uses an based API to normalize and
detect token characters. See and
for details.
Construct a new .
to match.
the input to split up into tokens
Construct a new using a given
.
to match
the attribute factory to use for this
the input to split up into tokens
Collects only characters which satisfy
.
Factory for .
<fieldType name="text_letter" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.LetterTokenizerFactory"/>
</analyzer>
</fieldType>
Creates a new
Normalizes token text to lower case.
You must specify the required
compatibility when creating LowerCaseFilter:
- As of 3.1, supplementary characters are properly lowercased.
Create a new , that normalizes token text to lower case.
See
to filter
Factory for .
<fieldType name="text_lwrcase" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
performs the function of
and together. It divides text at non-letters and converts
them to lower case. While it is functionally equivalent to the combination
of and , there is a performance advantage
to doing the two tasks at once, hence this (redundant) implementation.
Note: this does a decent job for most European languages, but does a terrible
job for some Asian languages, where words are not separated by spaces.
You must specify the required compatibility when creating
:
- As of 3.1, uses an int based API to normalize and
detect token characters. See and
for details.
Construct a new .
to match
the input to split up into tokens
Construct a new using a given
.
to match
the attribute factory to use for this
the input to split up into tokens
Converts char to lower case
in the invariant culture.
Factory for .
<fieldType name="text_lwrcase" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
</analyzer>
</fieldType>
Creates a new
An that filters
with
You must specify the required compatibility
when creating :
- As of 3.1, uses an int based API to normalize and
detect token codepoints. See and
for details.
Creates a new
to match
Filters with and .
You must specify the required
compatibility when creating :
- As of 3.1, StopFilter correctly handles Unicode 4.0
supplementary characters in stopwords
- As of 2.9, position increments are preserved
An unmodifiable set containing some common English words that are not usually useful
for searching.
Builds an analyzer which removes words in
.
See
Builds an analyzer with the stop words from the given set.
See
Set of stop words
Builds an analyzer with the stop words from the given file.
See
File to load stop words from
Builds an analyzer with the stop words from the given reader.
See
to load stop words from
Creates
used to tokenize all the text in the provided .
built from a filtered with
Removes stop words from a token stream.
You must specify the required
compatibility when creating :
- As of 3.1, StopFilter correctly handles Unicode 4.0
supplementary characters in stopwords and position
increments are preserved
Constructs a filter which removes words from the input that are
named in the .
Lucene version to enable correct Unicode 4.0 behavior in the stop
set if Version > 3.0. See > for details.
Input
A representing the stopwords.
Builds a from an array of stop words,
appropriate for passing into the constructor.
This permits this construction to be cached once when
an is constructed.
to enable correct Unicode 4.0 behavior in the returned set if Version > 3.0
An array of stopwords
passing false to ignoreCase
Builds a from an array of stop words,
appropriate for passing into the constructor.
This permits this construction to be cached once when
an is constructed.
to enable correct Unicode 4.0 behavior in the returned set if Version > 3.0
A List of s or or any other ToString()-able list representing the stopwords
A Set () containing the words
passing false to ignoreCase
Creates a stopword set from the given stopword array.
to enable correct Unicode 4.0 behavior in the returned set if Version > 3.0
An array of stopwords
If true, all words are lower cased first.
a Set () containing the words
Creates a stopword set from the given stopword list.
to enable correct Unicode 4.0 behavior in the returned set if Version > 3.0
A List of s or or any other ToString()-able list representing the stopwords
if true, all words are lower cased first
A Set () containing the words
Creates a stopword set from the given stopword list.
to enable correct Unicode 4.0 behavior in the returned set if Version > 3.0
A List of s or or any other ToString()-able list representing the stopwords
if true, all words are lower cased first
A Set () containing the words
Returns the next input Token whose Term is not a stop word.
Factory for .
<fieldType name="text_stop" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" format="wordset" />
</analyzer>
</fieldType>
All attributes are optional:
- ignoreCase defaults to false
- words should be the name of a stopwords file to parse, if not
specified the factory will use
- format defines how the words file will be parsed,
and defaults to wordset. If words is not specified,
then format must not be specified.
The valid values for the format option are:
- wordset - This is the default format, which supports one word per
line (including any intra-word whitespace) and allows whole line comments
begining with the "#" character. Blank lines are ignored. See
for details.
- snowball - This format allows for multiple words specified on each
line, and trailing comments may be specified using the vertical line ("|").
Blank lines are ignored. See
for details.
Creates a new
Removes tokens whose types appear in a set of blocked types from a token stream.
@deprecated enablePositionIncrements=false is not supported anymore as of Lucene 4.4.
@deprecated enablePositionIncrements=false is not supported anymore as of Lucene 4.4.
Create a new .
the match version
the to consume
the types to filter
if true, then tokens whose type is in will
be kept, otherwise they will be filtered out
Create a new that filters tokens out
(useWhiteList=false).
By default accept the token if its type is not a stop type.
When the parameter is set to true then accept the token if its type is contained in the
Factory class for .
<fieldType name="chars" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.TypeTokenFilterFactory" types="stoptypes.txt"
useWhitelist="false"/>
</analyzer>
</fieldType>
Creates a new
Normalizes token text to UPPER CASE.
You must specify the required
compatibility when creating
NOTE: In Unicode, this transformation may lose information when the
upper case character represents more than one lower case character. Use this filter
when you Require uppercase tokens. Use the for
general search matching
Create a new , that normalizes token text to upper case.
See
to filter
Factory for .
<fieldType name="text_uppercase" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.UpperCaseFilterFactory"/>
</analyzer>
</fieldType>
NOTE: In Unicode, this transformation may lose information when the
upper case character represents more than one lower case character. Use this filter
when you require uppercase tokens. Use the for
general search matching
Creates a new
An that uses .
You must specify the required compatibility
when creating :
- As of 3.1, uses an int based API to normalize and
detect token codepoints. See and
for details.
Creates a new
to match
A is a tokenizer that divides text at whitespace.
Adjacent sequences of non-Whitespace characters form tokens.
You must specify the required compatibility when creating
:
- As of 3.1, uses an int based API to normalize and
detect token characters. See and
for details.
Construct a new .
to match
the input to split up into tokens
Construct a new using a given
.
to match
the attribute factory to use for this
the input to split up into tokens
Collects only characters which do not satisfy
.
Factory for .
<fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
Creates a new
for Czech language.
Supports an external list of stopwords (words that will not be indexed at
all). A default set of stopwords is used unless an alternative list is
specified.
You must specify the required compatibility when creating
:
- As of 3.1, words are stemmed with
- As of 2.9, StopFilter preserves position increments
- As of 2.4, Tokens incorrectly identified as acronyms are corrected (see
LUCENE-1068)
File containing default Czech stopwords.
Returns a set of default Czech-stopwords
a set of default Czech-stopwords
Builds an analyzer with the default stop words ().
to match
Builds an analyzer with the given stop words.
to match
a stopword set
Builds an analyzer with the given stop words and a set of work to be
excluded from the .
to match
a stopword set
a stemming exclusion set
Creates
used to tokenize all the text in the provided .
built from a filtered with
, , ,
and (only if version is >= LUCENE_31). If
a version is >= LUCENE_31 and a stem exclusion set is provided via
a
is added before
.
A that applies to stem Czech words.
To prevent terms from being stemmed use an instance of
or a custom that sets
the before this .
NOTE: Input is expected to be in lowercase,
but with diacritical marks
Factory for .
<fieldType name="text_czstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CzechStemFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Light Stemmer for Czech.
Implements the algorithm described in:
Indexing and stemming approaches for the Czech language
http://portal.acm.org/citation.cfm?id=1598600
Stem an input buffer of Czech text.
NOTE: Input is expected to be in lowercase,
but with diacritical marks
input buffer
length of input buffer
length of input buffer after normalization
for Danish.
File containing default Danish stopwords.
Returns an unmodifiable instance of the default stop words set.
default stop words set.
Atomically loads the in a lazy fashion once the outer class
accesses the static final set the first time.;
Builds an analyzer with the default stop words: .
Builds an analyzer with the given stop words.
lucene compatibility version
a stopword set
Builds an analyzer with the given stop words. If a non-empty stem exclusion set is
provided this analyzer will add a before
stemming.
lucene compatibility version
a stopword set
a set of terms not to be stemmed
Creates a
which tokenizes all the text in the provided .
A
built from an filtered with
, , ,
if a stem exclusion set is
provided and .
for German language.
Supports an external list of stopwords (words that
will not be indexed at all) and an external list of exclusions (word that will
not be stemmed, but indexed).
A default set of stopwords is used unless an alternative list is specified, but the
exclusion list is empty by default.
You must specify the required
compatibility when creating GermanAnalyzer:
- As of 3.6, GermanLightStemFilter is used for less aggressive stemming.
- As of 3.1, Snowball stemming is done with SnowballFilter, and
Snowball stopwords are used by default.
- As of 2.9, StopFilter preserves position
increments
NOTE: This class uses the same
dependent settings as .
@deprecated in 3.1, remove in Lucene 5.0 (index bw compat)
File containing default German stopwords.
Returns a set of default German-stopwords
a set of default German-stopwords
@deprecated in 3.1, remove in Lucene 5.0 (index bw compat)
Contains words that should be indexed but not stemmed.
Builds an analyzer with the default stop words:
.
Builds an analyzer with the given stop words
lucene compatibility version
a stopword set
Builds an analyzer with the given stop words
lucene compatibility version
a stopword set
a stemming exclusion set
Creates
used to tokenize all the text in the provided .
built from a filtered with
, , ,
if a stem exclusion set is
provided, and
A that applies to stem German
words.
To prevent terms from being stemmed use an instance of
or a custom that sets
the before this .
Factory for .
<fieldType name="text_delgtstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.GermanLightStemFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Light Stemmer for German.
This stemmer implements the "UniNE" algorithm in:
Light Stemming Approaches for the French, Portuguese, German and Hungarian Languages
Jacques Savoy
A that applies to stem German
words.
To prevent terms from being stemmed use an instance of
or a custom that sets
the before this .
Factory for .
<fieldType name="text_deminstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.GermanMinimalStemFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Minimal Stemmer for German.
This stemmer implements the following algorithm:
Morphologie et recherche d'information
Jacques Savoy.
Normalizes German characters according to the heuristics
of the http://snowball.tartarus.org/algorithms/german2/stemmer.html
German2 snowball algorithm.
It allows for the fact that ä, ö and ü are sometimes written as ae, oe and ue.
- 'ß' is replaced by 'ss'
- 'ä', 'ö', 'ü' are replaced by 'a', 'o', 'u', respectively.
- 'ae' and 'oe' are replaced by 'a', and 'o', respectively.
- 'ue' is replaced by 'u', when not following a vowel or q.
This is useful if you want this normalization without using
the German2 stemmer, or perhaps no stemming at all.
Factory for .
<fieldType name="text_denorm" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.GermanNormalizationFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
A that stems German words.
It supports a table of words that should
not be stemmed at all. The stemmer used can be changed at runtime after the
filter object is created (as long as it is a ).
To prevent terms from being stemmed use an instance of
or a custom that sets
the before this .
The actual token in the input stream.
Creates a instance
the source
Returns true for next token in the stream, or false at EOS
Set a alternative/custom for this filter.
Factory for .
<fieldType name="text_destem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.GermanStemFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
A stemmer for German words.
The algorithm is based on the report
"A Fast and Simple Stemming Algorithm for German Words" by Jörg
Caumanns (joerg.caumanns at isst.fhg.de).
Buffer for the terms while stemming them.
Amount of characters that are removed with while stemming.
Stemms the given term to an unique discriminator.
The term that should be stemmed.
Discriminator for
Checks if a term could be stemmed.
true if, and only if, the given term consists in letters.
suffix stripping (stemming) on the current term. The stripping is reduced
to the seven "base" suffixes "e", "s", "n", "t", "em", "er" and * "nd",
from which all regular suffixes are build of. The simplification causes
some overstemming, and way more irregular stems, but still provides unique.
discriminators in the most of those cases.
The algorithm is context free, except of the length restrictions.
Does some optimizations on the term. This optimisations are
contextual.
Removes a particle denotion ("ge") from a term.
Do some substitutions for the term to reduce overstemming:
- Substitute Umlauts with their corresponding vowel: äöü -> aou,
"ß" is substituted by "ss"
- Substitute a second char of a pair of equal characters with
an asterisk: ?? -> ?*
- Substitute some common character combinations with a token:
sch/ch/ei/ie/ig/st -> $/§/%/&/#/!
Undoes the changes made by . That are character pairs and
character combinations. Umlauts will remain as their corresponding vowel,
as "ß" remains as "ss".
for the Greek language.
Supports an external list of stopwords (words
that will not be indexed at all).
A default set of stopwords is used unless an alternative list is specified.
You must specify the required
compatibility when creating :
- As of 3.1, StandardFilter and GreekStemmer are used by default.
- As of 2.9, StopFilter preserves position
increments
NOTE: This class uses the same
dependent settings as .
File containing default Greek stopwords.
Returns a set of default Greek-stopwords
a set of default Greek-stopwords
Builds an analyzer with the default stop words.
Lucene compatibility version,
See
Builds an analyzer with the given stop words.
NOTE: The stopwords set should be pre-processed with the logic of
for best results.
Lucene compatibility version,
See
a stopword set
Creates
used to tokenize all the text in the provided .
built from a filtered with
, ,
, and
Normalizes token text to lower case, removes some Greek diacritics,
and standardizes final sigma to sigma.
You must specify the required
compatibility when creating :
- As of 3.1, supplementary characters are properly lowercased.
Create a that normalizes Greek token text.
Lucene compatibility version,
See
to filter
Factory for .
<fieldType name="text_glc" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.GreekLowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
A that applies to stem Greek
words.
To prevent terms from being stemmed use an instance of
or a custom that sets
the before this .
NOTE: Input is expected to be casefolded for Greek (including folding of final
sigma to sigma), and with diacritics removed. This can be achieved by using
either or ICUFoldingFilter before .
@lucene.experimental
Factory for .
<fieldType name="text_gstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.GreekLowerCaseFilterFactory"/>
<filter class="solr.GreekStemFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
A stemmer for Greek words, according to: Development of a Stemmer for the
Greek Language. Georgios Ntais
NOTE: Input is expected to be casefolded for Greek (including folding of final
sigma to sigma), and with diacritics removed. This can be achieved with
either or ICUFoldingFilter.
@lucene.experimental
Stems a word contained in a leading portion of a array.
The word is passed through a number of rules that modify it's length.
A array that contains the word to be stemmed.
The length of the array.
The new length of the stemmed word.
Checks if the word contained in the leading portion of char[] array ,
ends with the suffix given as parameter.
A char[] array that represents a word.
The length of the char[] array.
A object to check if the word given ends with these characters.
True if the word ends with the suffix given , false otherwise.
Checks if the word contained in the leading portion of array ,
ends with a Greek vowel.
A array that represents a word.
The length of the array.
True if the word contained in the leading portion of array ,
ends with a vowel , false otherwise.
Checks if the word contained in the leading portion of array ,
ends with a Greek vowel.
A array that represents a word.
The length of the array.
True if the word contained in the leading portion of array ,
ends with a vowel , false otherwise.
for English.
Returns an unmodifiable instance of the default stop words set.
default stop words set.
Atomically loads the in a lazy fashion once the outer class
accesses the static final set the first time.;
Builds an analyzer with the default stop words: .
lucene compatibility version
Builds an analyzer with the given stop words.
lucene compatibility version
a stopword set
Builds an analyzer with the given stop words. If a non-empty stem exclusion set is
provided this analyzer will add a before
stemming.
lucene compatibility version
a stopword set
a set of terms not to be stemmed
Creates a
which tokenizes all the text in the provided .
A
built from an filtered with
, ,
, ,
if a stem exclusion set is
provided and .
A that applies to stem
English words.
To prevent terms from being stemmed use an instance of
or a custom that sets
the before this .
Factory for .
<fieldType name="text_enminstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Minimal plural stemmer for English.
This stemmer implements the "S-Stemmer" from
How Effective Is Suffixing?
Donna Harman.
TokenFilter that removes possessives (trailing 's) from words.
You must specify the required
compatibility when creating :
- As of 3.6, U+2019 RIGHT SINGLE QUOTATION MARK and
U+FF07 FULLWIDTH APOSTROPHE are also treated as
quotation marks.
@deprecated Use instead.
Factory for .
<fieldType name="text_enpossessive" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
A list of words used by Kstem
A list of words used by Kstem
A list of words used by Kstem
A list of words used by Kstem
A list of words used by Kstem
A list of words used by Kstem
A list of words used by Kstem
A list of words used by Kstem
A high-performance kstem filter for english.
See
"Viewing Morphology as an Inference Process"
(Krovetz, R., Proceedings of the Sixteenth Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval, 191-203, 1993).
All terms must already be lowercased for this filter to work correctly.
Note: This filter is aware of the . To prevent
certain terms from being passed to the stemmer
should be set to true
in a previous .
Note: For including the original term as well as the stemmed version, see
Returns the next, stemmed, input Token.
The stemmed form of a token.
If there is a low-level I/O error.
Factory for .
<fieldType name="text_kstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KStemFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
This class implements the Kstem algorithm
Title: Kstemmer
Description: This is a java version of Bob Krovetz' kstem stemmer
Copyright: Copyright 2008, Luicid Imagination, Inc.
Copyright: Copyright 2003, CIIR University of Massachusetts Amherst (http://ciir.cs.umass.edu)
INDEX of final letter in word. You must add 1 to k to get
the current length of word. When you want the length of
word, use the method wordLength, which returns (k+1).
length of stem within word
Convert plurals to singular form, and '-ies' to 'y'
replace old suffix with s
convert past tense (-ed) to present, and `-ied' to `y'
return TRUE if word ends with a double consonant
handle `-ing' endings
this routine deals with -ity endings. It accepts -ability, -ibility, and
-ality, even without checking the dictionary because they are so
productive. The first two are mapped to -ble, and the -ity is remove for
the latter
handle -ence and -ance
handle -ness
handle -ism
this routine deals with -ment endings.
this routine deals with -ize endings.
handle -ency and -ancy
handle -able and -ible
handle -ic endings. This is fairly straightforward, but this is also the
only place we try *expanding* an ending, -ic -> -ical. This is to handle
cases like `canonic' -> `canonical'
this routine deals with -ion, -ition, -ation, -ization, and -ication. The
-ization ending is always converted to -ize
this routine deals with -er, -or, -ier, and -eer. The -izer ending is
always converted to -ize
this routine deals with -ly endings. The -ally ending is always converted
to -al Sometimes this will temporarily leave us with a non-word (e.g.,
heuristically maps to heuristical), but then the -al is removed in the next
step.
this routine deals with -al endings. Some of the endings from the previous
routine are finished up here.
this routine deals with -ive endings. It normalizes some of the -ative
endings directly, and also maps some -ive endings to -ion.
Returns the result of the stem (assuming the word was changed) as a .
Stems the text in the token. Returns true if changed.
Transforms the token stream as per the Porter stemming algorithm.
Note: the input to the stemming filter must already be in lower case,
so you will need to use LowerCaseFilter or LowerCaseTokenizer farther
down the Tokenizer chain in order for this to work properly!
To use this filter with other analyzers, you'll want to write an
Analyzer class that sets up the TokenStream chain as you want it.
To use this with LowerCaseTokenizer, for example, you'd write an
analyzer like this:
class MyAnalyzer : Analyzer {
protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader) {
Tokenizer source = new LowerCaseTokenizer(version, reader);
return new TokenStreamComponents(source, new PorterStemFilter(source));
}
}
Note: This filter is aware of the . To prevent
certain terms from being passed to the stemmer
should be set to true
in a previous .
Note: For including the original term as well as the stemmed version, see
Factory for .
<fieldType name="text_porterstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Stemmer, implementing the Porter Stemming Algorithm
The Stemmer class transforms a word into its root form. The input
word can be provided a character at time (by calling ), or at once
by calling one of the various Stem methods, such as .
resets the stemmer so it can stem another word. If you invoke
the stemmer by calling and then , you must call
before starting another word.
Add a character to the word being stemmed. When you are finished
adding characters, you can call to process the word.
After a word has been stemmed, it can be retrieved by ,
or a reference to the internal buffer can be retrieved by
and (which is generally more efficient.)
Returns the length of the word resulting from the stemming process.
Returns a reference to a character buffer containing the results of
the stemming process. You also need to consult
to determine the length of the result.
Stem a word provided as a . Returns the result as a .
Stem a word contained in a . Returns true if the stemming process
resulted in a word different from the input. You can retrieve the
result with / or .
Stem a word contained in a portion of a array. Returns
true if the stemming process resulted in a word different from
the input. You can retrieve the result with
/ or .
Stem a word contained in a leading portion of a array.
Returns true if the stemming process resulted in a word different
from the input. You can retrieve the result with
/ or .
Stem the word placed into the Stemmer buffer through calls to .
Returns true if the stemming process resulted in a word different
from the input. You can retrieve the result with
/ or .
for Spanish.
You must specify the required
compatibility when creating :
- As of 3.6, is used for less aggressive stemming.
File containing default Spanish stopwords.
Returns an unmodifiable instance of the default stop words set.
default stop words set.
Atomically loads the in a lazy fashion once the outer class
accesses the static final set the first time.;
Builds an analyzer with the default stop words: .
Builds an analyzer with the given stop words.
lucene compatibility version
a stopword set
Builds an analyzer with the given stop words. If a non-empty stem exclusion set is
provided this analyzer will add a before
stemming.
lucene compatibility version
a stopword set
a set of terms not to be stemmed
Creates a
which tokenizes all the text in the provided .
A
built from an filtered with
, , ,
if a stem exclusion set is
provided and .
A that applies to stem Spanish
words.
To prevent terms from being stemmed use an instance of
or a custom that sets
the before this .
Factory for .
<fieldType name="text_eslgtstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SpanishLightStemFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Light Stemmer for Spanish
This stemmer implements the algorithm described in:
Report on CLEF-2001 Experiments
Jacques Savoy
for Basque.
File containing default Basque stopwords.
Returns an unmodifiable instance of the default stop words set.
default stop words set.
Atomically loads the in a lazy fashion once the outer class
accesses the static final set the first time.;
Builds an analyzer with the default stop words: .
Builds an analyzer with the given stop words.
lucene compatibility version
a stopword set
Builds an analyzer with the given stop words. If a non-empty stem exclusion set is
provided this analyzer will add a before
stemming.
lucene compatibility version
a stopword set
a set of terms not to be stemmed
Creates a
which tokenizes all the text in the provided .
A
built from an filtered with
, , ,
if a stem exclusion set is
provided and .
for Persian.
This Analyzer uses which implies tokenizing around
zero-width non-joiner in addition to whitespace. Some persian-specific variant forms (such as farsi
yeh and keheh) are standardized. "Stemming" is accomplished via stopwords.
File containing default Persian stopwords.
Default stopword list is from
http://members.unine.ch/jacques.savoy/clef/index.html. The stopword list is
BSD-Licensed.
The comment character in the stopwords file. All lines prefixed with this
will be ignored
Returns an unmodifiable instance of the default stop-words set.
an unmodifiable instance of the default stop-words set.
Atomically loads the in a lazy fashion once the outer class
accesses the static final set the first time.;
Builds an analyzer with the default stop words:
.
Builds an analyzer with the given stop words
lucene compatibility version
a stopword set
Creates
used to tokenize all the text in the provided .
built from a filtered with
, ,
and Persian Stop words
Wraps the with
that replaces instances of Zero-width non-joiner with an
ordinary space.
Factory for .
<fieldType name="text_fa" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<charFilter class="solr.PersianCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer>
</fieldType>
Creates a new
A that applies to normalize the
orthography.
Factory for .
<fieldType name="text_fanormal" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<charFilter class="solr.PersianCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.PersianNormalizationFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Normalizer for Persian.
Normalization is done in-place for efficiency, operating on a termbuffer.
Normalization is defined as:
- Normalization of various heh + hamza forms and heh goal to heh.
- Normalization of farsi yeh and yeh barree to arabic yeh
- Normalization of persian keheh to arabic kaf
Normalize an input buffer of Persian text
input buffer
length of input buffer
length of input buffer after normalization
A that applies to stem Arabic words..
To prevent terms from being stemmed use an instance of
or a custom that sets
the before this .
Factory for .
<fieldType name="text_arstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.PersianNormalizationFilterFactory"/>
<filter class="solr.PersianStemFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Stemmer for Persian.
Stemming is done in-place for efficiency, operating on a termbuffer.
Stemming is defined as:
- Removal of attached definite article, conjunction, and prepositions.
- Stemming of common suffixes.
Stem an input buffer of Persian text.
input buffer
length of input buffer
length of input buffer after normalization
Stem suffix(es) off an Persian word.
input buffer
length of input buffer
new length of input buffer after stemming
Returns true if the suffix matches and can be stemmed
input buffer
length of input buffer
suffix to check
true if the suffix matches and can be stemmed
for Finnish.
File containing default Italian stopwords.
Returns an unmodifiable instance of the default stop words set.
default stop words set.
Atomically loads the in a lazy fashion once the outer class
accesses the static final set the first time.;
Builds an analyzer with the default stop words: .
Builds an analyzer with the given stop words.
lucene compatibility version
a stopword set
Builds an analyzer with the given stop words. If a non-empty stem exclusion set is
provided this analyzer will add a before
stemming.
lucene compatibility version
a stopword set
a set of terms not to be stemmed
Creates a
which tokenizes all the text in the provided .
A
built from an filtered with
, , ,
if a stem exclusion set is
provided and .
A that applies to stem Finnish
words.
To prevent terms from being stemmed use an instance of
or a custom that sets
the before this .
Factory for .
<fieldType name="text_filgtstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.FinnishLightStemFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Light Stemmer for Finnish.
This stemmer implements the algorithm described in:
Report on CLEF-2003 Monolingual Tracks
Jacques Savoy
for French language.
Supports an external list of stopwords (words that
will not be indexed at all) and an external list of exclusions (word that will
not be stemmed, but indexed).
A default set of stopwords is used unless an alternative list is specified, but the
exclusion list is empty by default.
You must specify the required
compatibility when creating FrenchAnalyzer:
- As of 3.6, is used for less aggressive stemming.
- As of 3.1, Snowball stemming is done with ,
is used prior to , and and
Snowball stopwords are used by default.
- As of 2.9, preserves position
increments
NOTE: This class uses the same
dependent settings as .
Extended list of typical French stopwords.
@deprecated (3.1) remove in Lucene 5.0 (index bw compat)
File containing default French stopwords.
Default set of articles for
Contains words that should be indexed but not stemmed.
Returns an unmodifiable instance of the default stop-words set.
an unmodifiable instance of the default stop-words set.
@deprecated (3.1) remove this in Lucene 5.0, index bw compat
Builds an analyzer with the default stop words ().
Builds an analyzer with the given stop words
lucene compatibility version
a stopword set
Builds an analyzer with the given stop words
lucene compatibility version
a stopword set
a stemming exclusion set
Creates
used to tokenize all the text in the provided .
built from a filtered with
, ,
, ,
if a stem exclusion set is
provided, and
A that applies to stem French
words.
To prevent terms from being stemmed use an instance of
or a custom that sets
the before this .
Factory for .
<fieldType name="text_frlgtstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ElisionFilterFactory"/>
<filter class="solr.FrenchLightStemFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Light Stemmer for French.
This stemmer implements the "UniNE" algorithm in:
Light Stemming Approaches for the French, Portuguese, German and Hungarian Languages
Jacques Savoy
A that applies to stem French
words.
To prevent terms from being stemmed use an instance of
or a custom that sets
the before this .
Factory for .
<fieldType name="text_frminstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ElisionFilterFactory"/>
<filter class="solr.FrenchMinimalStemFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Light Stemmer for French.
This stemmer implements the following algorithm:
A Stemming procedure and stopword list for general French corpora.
Jacques Savoy.
A that stems french words.
The used stemmer can be changed at runtime after the
filter object is created (as long as it is a ).
To prevent terms from being stemmed use an instance of
or a custom that sets
the before this .
@deprecated (3.1) Use with
instead, which has the
same functionality. This filter will be removed in Lucene 5.0
The actual token in the input stream.
Returns true for the next token in the stream, or false at EOS
Set a alternative/custom for this filter.
A stemmer for French words.
The algorithm is based on the work of
Dr Martin Porter on his snowball project
refer to http://snowball.sourceforge.net/french/stemmer.html
(French stemming algorithm) for details
@deprecated Use instead,
which has the same functionality. This filter will be removed in Lucene 4.0
Buffer for the terms while stemming them.
A temporary buffer, used to reconstruct R2
Region R0 is equal to the whole buffer
Region RV
"If the word begins with two vowels, RV is the region after the third letter,
otherwise the region after the first vowel not at the beginning of the word,
or the end of the word if these positions cannot be found."
Region R1
"R1 is the region after the first non-vowel following a vowel
or is the null region at the end of the word if there is no such non-vowel"
Region R2
"R2 is the region after the first non-vowel in R1 following a vowel
or is the null region at the end of the word if there is no such non-vowel"
Set to true if we need to perform step 2
Set to true if the buffer was modified
Stems the given term to a unique discriminator.
The term that should be stemmed
Discriminator for
Sets the search region strings
it needs to be done each time the buffer was modified
First step of the Porter Algorithm
refer to http://snowball.sourceforge.net/french/stemmer.html for an explanation
Second step (A) of the Porter Algorithm
Will be performed if nothing changed from the first step
or changed were done in the amment, emment, ments or ment suffixes
refer to http://snowball.sourceforge.net/french/stemmer.html for an explanation
true if something changed in the
Second step (B) of the Porter Algorithm
Will be performed if step 2 A was performed unsuccessfully
refer to http://snowball.sourceforge.net/french/stemmer.html for an explanation
Third step of the Porter Algorithm
refer to http://snowball.sourceforge.net/french/stemmer.html for an explanation
Fourth step of the Porter Algorithm
refer to http://snowball.sourceforge.net/french/stemmer.html for an explanation
Fifth step of the Porter Algorithm
refer to http://snowball.sourceforge.net/french/stemmer.html for an explanation
Sixth (and last!) step of the Porter Algorithm
refer to http://snowball.sourceforge.net/french/stemmer.html for an explanation
Delete a suffix searched in zone "source" if zone "from" contains prefix + search string
the primary source zone for search
the strings to search for suppression
the secondary source zone for search
the prefix to add to the search string to test
true if modified
Delete a suffix searched in zone "source" if the preceding letter is (or isn't) a vowel
the primary source zone for search
the strings to search for suppression
true if we need a vowel before the search string
the secondary source zone for search (where vowel could be)
true if modified
Delete a suffix searched in zone "source" if preceded by the prefix
the primary source zone for search
the strings to search for suppression
the prefix to add to the search string to test
true if it will be deleted even without prefix found
Delete a suffix searched in zone "source" if preceded by prefix
or replace it with the replace string if preceded by the prefix in the zone "from"
or delete the suffix if specified
the primary source zone for search
the strings to search for suppression
the prefix to add to the search string to test
true if it will be deleted even without prefix found
the secondary source zone for search
the replacement string
Replace a search string with another within the source zone
the source zone for search
the strings to search for replacement
the replacement string
Delete a search string within the source zone
the source zone for search
the strings to search for suppression
Test if a char is a french vowel, including accentuated ones
the char to test
true if the char is a vowel
Retrieve the "R zone" (1 or 2 depending on the buffer) and return the corresponding string
"R is the region after the first non-vowel following a vowel
or is the null region at the end of the word if there is no such non-vowel"
the in buffer
the resulting string
Retrieve the "RV zone" from a buffer an return the corresponding string
"If the word begins with two vowels, RV is the region after the third letter,
otherwise the region after the first vowel not at the beginning of the word,
or the end of the word if these positions cannot be found."
the in buffer
the resulting string
Turns u and i preceded AND followed by a vowel to UpperCase
Turns y preceded OR followed by a vowel to UpperCase
Turns u preceded by q to UpperCase
the buffer to treat
the treated buffer
Checks a term if it can be processed correctly.
true if, and only if, the given term consists in letters.
for Irish.
File containing default Irish stopwords.
When StandardTokenizer splits t‑athair into {t, athair}, we don't
want to cause a position increment, otherwise there will be problems
with phrase queries versus tAthair (which would not have a gap).
Returns an unmodifiable instance of the default stop words set.
default stop words set.
Atomically loads the in a lazy fashion once the outer class
accesses the static final set the first time.;
Builds an analyzer with the default stop words: .
Builds an analyzer with the given stop words.
lucene compatibility version
a stopword set
Builds an analyzer with the given stop words. If a non-empty stem exclusion set is
provided this analyzer will add a before
stemming.
lucene compatibility version
a stopword set
a set of terms not to be stemmed
Creates a
which tokenizes all the text in the provided .
A
built from an filtered with
, , ,
if a stem exclusion set is
provided and .
Normalises token text to lower case, handling t-prothesis
and n-eclipsis (i.e., that 'nAthair' should become 'n-athair')
Create an that normalises Irish token text.
Factory for .
<fieldType name="text_ga" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.IrishLowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
for Galician.
File containing default Galician stopwords.
Returns an unmodifiable instance of the default stop words set.
default stop words set.
Atomically loads the in a lazy fashion once the outer class
accesses the static final set the first time.;
Builds an analyzer with the default stop words: .
Builds an analyzer with the given stop words.
lucene compatibility version
a stopword set
Builds an analyzer with the given stop words. If a non-empty stem exclusion set is
provided this analyzer will add a before
stemming.
lucene compatibility version
a stopword set
a set of terms not to be stemmed
Creates a
which tokenizes all the text in the provided .
A
built from an filtered with
, , ,
if a stem exclusion set is
provided and .
A that applies to stem
Galician words.
To prevent terms from being stemmed use an instance of
or a custom that sets
the before this .
Factory for .
<fieldType name="text_glplural" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.GalicianMinimalStemFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Minimal Stemmer for Galician
This follows the "RSLP-S" algorithm, but modified for Galician.
Hence this stemmer only applies the plural reduction step of:
"Regras do lematizador para o galego"
A that applies to stem
Galician words.
To prevent terms from being stemmed use an instance of
or a custom that sets
the before this .
Factory for .
<fieldType name="text_glstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.GalicianStemFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Galician stemmer implementing "Regras do lematizador para o galego".
Description of rules
buffer, oversized to at least len+1
initial valid length of buffer
new valid length, stemmed
Analyzer for Hindi.
You must specify the required
compatibility when creating HindiAnalyzer:
- As of 3.6, StandardTokenizer is used for tokenization
File containing default Hindi stopwords.
Default stopword list is from http://members.unine.ch/jacques.savoy/clef/index.html
The stopword list is BSD-Licensed.
Returns an unmodifiable instance of the default stop-words set.
an unmodifiable instance of the default stop-words set.
Atomically loads the in a lazy fashion once the outer class
accesses the static final set the first time.;
Builds an analyzer with the given stop words
lucene compatibility version
a stopword set
a stemming exclusion set
Builds an analyzer with the given stop words
lucene compatibility version
a stopword set
Builds an analyzer with the default stop words:
.
Creates
used to tokenize all the text in the provided .
built from a filtered with
, ,
,
if a stem exclusion set is provided, , and
Hindi Stop words
A that applies to normalize the
orthography.
In some cases the normalization may cause unrelated terms to conflate, so
to prevent terms from being normalized use an instance of
or a custom that sets
the before this .
Factory for .
<fieldType name="text_hinormal" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.HindiNormalizationFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Normalizer for Hindi.
Normalizes text to remove some differences in spelling variations.
Implements the Hindi-language specific algorithm specified in:
Word normalization in Indian languages
Prasad Pingali and Vasudeva Varma.
http://web2py.iiit.ac.in/publications/default/download/inproceedings.pdf.3fe5b38c-02ee-41ce-9a8f-3e745670be32.pdf
with the following additions from Hindi CLIR in Thirty Days
Leah S. Larkey, Margaret E. Connell, and Nasreen AbdulJaleel.
http://maroo.cs.umass.edu/pub/web/getpdf.php?id=454:
- Internal Zero-width joiner and Zero-width non-joiners are removed
- In addition to chandrabindu, NA+halant is normalized to anusvara
Normalize an input buffer of Hindi text
input buffer
length of input buffer
length of input buffer after normalization
A that applies to stem Hindi words.
Factory for .
<fieldType name="text_histem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.HindiStemFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Light Stemmer for Hindi.
Implements the algorithm specified in:
A Lightweight Stemmer for Hindi
Ananthakrishnan Ramanathan and Durgesh D Rao.
http://computing.open.ac.uk/Sites/EACLSouthAsia/Papers/p6-Ramanathan.pdf
In-memory structure for the dictionary (.dic) and affix (.aff)
data of a hunspell dictionary.
Creates a new containing the information read from the provided s to hunspell affix
and dictionary files.
You have to dispose the provided s yourself.
for reading the hunspell affix file (won't be disposed).
for reading the hunspell dictionary file (won't be disposed).
Can be thrown while reading from the s
Can be thrown if the content of the files does not meet expected formats
Creates a new containing the information read from the provided s to hunspell affix
and dictionary files.
You have to dispose the provided s yourself.
for reading the hunspell affix file (won't be disposed).
for reading the hunspell dictionary files (won't be disposed).
ignore case?
Can be thrown while reading from the s
Can be thrown if the content of the files does not meet expected formats
Looks up HunspellAffix suffixes that have an append that matches the created from the given array, offset and length
array to generate the from
Offset in the char array that the starts at
Length from the offset that the is
List of HunspellAffix suffixes with an append that matches the , or null if none are found
Reads the affix file through the provided , building up the prefix and suffix maps
to read the content of the affix file from
to decode the content of the file
Can be thrown while reading from the InputStream
Parses a specific affix rule putting the result into the provided affix map
where the result of the parsing will be put
Header line of the affix rule
to read the content of the rule from
pattern to be used to generate the condition regex
pattern
map from condition -> index of patterns, for deduplication.
Can be thrown while reading the rule
pattern accepts optional BOM + SET + any whitespace
Parses the encoding specified in the affix file readable through the provided
for reading the affix file
Encoding specified in the affix file
Can be thrown while reading from the
Thrown if the first non-empty non-comment line read from the file does not adhere to the format SET <encoding>
Retrieves the for the given encoding. Note, This isn't perfect as I think ISCII-DEVANAGARI and
MICROSOFT-CP1251 etc are allowed...
Encoding to retrieve the instance for
for the given encoding
Determines the appropriate based on the FLAG definition line taken from the affix file
Line containing the flag information
that handles parsing flags in the way specified in the FLAG definition
Reads the dictionary file through the provided s, building up the words map
s to read the dictionary file through
used to decode the contents of the file
Can be thrown while reading from the file
Abstraction of the process of parsing flags taken from the affix and dic files
Parses the given into a single flag
to parse into a flag
Parsed flag
Parses the given into multiple flags
to parse into flags
Parsed flags
Simple implementation of that treats the chars in each as a individual flags.
Can be used with both the ASCII and UTF-8 flag types.
Implementation of that assumes each flag is encoded in its numerical form. In the case
of multiple flags, each number is separated by a comma.
Implementation of that assumes each flag is encoded as two ASCII characters whose codes
must be combined into a single character.
that uses hunspell affix rules and words to stem tokens.
Since hunspell supports a word having multiple stems, this filter can emit
multiple tokens for each consumed token
Note: This filter is aware of the . To prevent
certain terms from being passed to the stemmer
should be set to true
in a previous .
Note: For including the original term as well as the stemmed version, see
@lucene.experimental
Create a outputting all possible stems.
Create a outputting all possible stems.
Creates a new HunspellStemFilter that will stem tokens from the given using affix rules in the provided
Dictionary
whose tokens will be stemmed
Hunspell containing the affix rules and words that will be used to stem the tokens
remove duplicates
true if only the longest term should be output.
that creates instances of .
Example config for British English:
<filter class="solr.HunspellStemFilterFactory"
dictionary="en_GB.dic,my_custom.dic"
affix="en_GB.aff"
ignoreCase="false"
longestOnly="false" />
Both parameters dictionary and affix are mandatory.
Dictionaries for many languages are available through the OpenOffice project.
See http://wiki.apache.org/solr/Hunspell
@lucene.experimental
Creates a new
Stemmer uses the affix rules declared in the to generate one or more stems for a word. It
conforms to the algorithm in the original hunspell algorithm, including recursive suffix stripping.
Constructs a new Stemmer which will use the provided to create its stems.
that will be used to create the stems
Find the stem(s) of the provided word.
Word to find the stems for
of stems for the word
Find the stem(s) of the provided word
Word to find the stems for
length
of stems for the word
Find the unique stem(s) of the provided word
Word to find the stems for
length
of stems for the word
Generates a list of stems for the provided word
Word to generate the stems for
length
previous affix that was removed (so we dont remove same one twice)
Flag from a previous stemming step that need to be cross-checked with any affixes in this recursive step
flag of the most inner removed prefix, so that when removing a suffix, its also checked against the word
current recursiondepth
true if we should remove prefixes
true if we should remove suffixes
true if the previous removal was a prefix:
if we are removing a suffix, and it has no continuation requirements, its ok.
but two prefixes (COMPLEXPREFIXES) or two suffixes must have continuation requirements to recurse.
true if the previous prefix removal was signed as a circumfix
this means inner most suffix must also contain circumfix flag.
true if we are searching for a case variant. if the word has KEEPCASE flag it cannot succeed.
of stems, or empty list if no stems are found
checks condition of the concatenation of two strings
Applies the affix rule to the given word, producing a list of stems if any are found
Word the affix has been removed and the strip added
valid length of stripped word
HunspellAffix representing the affix rule itself
when we already stripped a prefix, we cant simply recurse and check the suffix, unless both are compatible
so we must check dictionary form against both to add it as a stem!
current recursion depth
true if we are removing a prefix (false if its a suffix)
true if the previous prefix removal was signed as a circumfix
this means inner most suffix must also contain circumfix flag.
true if we are searching for a case variant. if the word has KEEPCASE flag it cannot succeed.
of stems for the word, or an empty list if none are found
Checks if the given flag cross checks with the given array of flags
Flag to cross check with the array of flags
Array of flags to cross check against. Can be null
If true, will match a zero length flags array.
true if the flag is found in the array or the array is null, false otherwise
for Hungarian.
File containing default Hungarian stopwords.
Returns an unmodifiable instance of the default stop words set.
default stop words set.
Atomically loads the in a lazy fashion once the outer class
accesses the static final set the first time.;
Builds an analyzer with the default stop words: .
lucene compatibility version
Builds an analyzer with the given stop words.
lucene compatibility version
a stopword set
Builds an analyzer with the given stop words. If a non-empty stem exclusion set is
provided this analyzer will add a before
stemming.
lucene compatibility version
a stopword set
a set of terms not to be stemmed
Creates a
which tokenizes all the text in the provided .
A
built from an filtered with
, , ,
if a stem exclusion set is
provided and .
A that applies to stem
Hungarian words.
To prevent terms from being stemmed use an instance of
or a custom that sets
the before this .
Factory for .
<fieldType name="text_hulgtstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.HungarianLightStemFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Light Stemmer for Hungarian.
This stemmer implements the "UniNE" algorithm in:
Light Stemming Approaches for the French, Portuguese, German and Hungarian Languages
Jacques Savoy
for Armenian.
File containing default Armenian stopwords.
Returns an unmodifiable instance of the default stop words set.
default stop words set.
Atomically loads the in a lazy fashion once the outer class
accesses the static final set the first time.;
Builds an analyzer with the default stop words: .
lucene compatibility version
Builds an analyzer with the given stop words.
lucene compatibility version
a stopword set
Builds an analyzer with the given stop words. If a non-empty stem exclusion set is
provided this analyzer will add a before
stemming.
lucene compatibility version
a stopword set
a set of terms not to be stemmed
Creates a
which tokenizes all the text in the provided .
A
built from an filtered with
, , ,
if a stem exclusion set is
provided and .
for Indonesian (Bahasa)
File containing default Indonesian stopwords.
Returns an unmodifiable instance of the default stop-words set.
an unmodifiable instance of the default stop-words set.
Atomically loads the in a lazy fashion once the outer class
accesses the static final set the first time.;
Builds an analyzer with the default stop words: .
Builds an analyzer with the given stop words
lucene compatibility version
a stopword set
Builds an analyzer with the given stop word. If a none-empty stem exclusion set is
provided this analyzer will add a before
.
lucene compatibility version
a stopword set
a set of terms not to be stemmed
Creates
used to tokenize all the text in the provided .
built from an filtered with
, ,
,
if a stem exclusion set is provided and .
A that applies to stem Indonesian words.
Calls IndonesianStemFilter(input, true)
Create a new .
If is false,
only inflectional suffixes (particles and possessive pronouns) are stemmed.
Factory for .
<fieldType name="text_idstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.IndonesianStemFilterFactory" stemDerivational="true"/>
</analyzer>
</fieldType>
Creates a new
Stemmer for Indonesian.
Stems Indonesian words with the algorithm presented in:
A Study of Stemming Effects on Information Retrieval in
Bahasa Indonesia, Fadillah Z Tala.
http://www.illc.uva.nl/Publications/ResearchReports/MoL-2003-02.text.pdf
Stem a term (returning its new length).
Use to control whether full stemming
or only light inflectional stemming is done.
A that applies to normalize text
in Indian Languages.
Factory for .
<fieldType name="text_innormal" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.IndicNormalizationFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Normalizes the Unicode representation of text in Indian languages.
Follows guidelines from Unicode 5.2, chapter 6, South Asian Scripts I
and graphical decompositions from http://ldc.upenn.edu/myl/IndianScriptsUnicode.html
Decompositions according to Unicode 5.2,
and http://ldc.upenn.edu/myl/IndianScriptsUnicode.html
Most of these are not handled by unicode normalization anyway.
The numbers here represent offsets into the respective codepages,
with -1 representing null and 0xFF representing zero-width joiner.
the columns are: ch1, ch2, ch3, res, flags
ch1, ch2, and ch3 are the decomposition
res is the composition, and flags are the scripts to which it applies.
Normalizes input text, and returns the new length.
The length will always be less than or equal to the existing length.
input text
valid length
normalized length
Compose into standard form any compositions in the decompositions table.
LUCENENET: Returns the unicode block for the specified character. Caches the
last script and script data used on the current thread to optimize performance
when not switching between scripts.
Simple Tokenizer for text in Indian Languages.
@deprecated (3.6) Use instead.
for Italian.
You must specify the required
compatibility when creating :
- As of 3.6, is used for less aggressive stemming.
- As of 3.2, with a set of Italian
contractions is used by default.
File containing default Italian stopwords.
Returns an unmodifiable instance of the default stop words set.
default stop words set.
Atomically loads the in a lazy fashion once the outer class
accesses the static final set the first time.;
Builds an analyzer with the default stop words: .
lucene compatibility version
Builds an analyzer with the given stop words.
lucene compatibility version
a stopword set
Builds an analyzer with the given stop words. If a non-empty stem exclusion set is
provided this analyzer will add a before
stemming.
lucene compatibility version
a stopword set
a set of terms not to be stemmed
Creates a
which tokenizes all the text in the provided .
A
built from an filtered with
, , , ,
if a stem exclusion set is
provided and .
A that applies to stem Italian
words.
To prevent terms from being stemmed use an instance of
or a custom that sets
the before this .
Factory for .
<fieldType name="text_itlgtstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ItalianLightStemFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Light Stemmer for Italian.
This stemmer implements the algorithm described in:
Report on CLEF-2001 Experiments
Jacques Savoy
for Latvian.
File containing default Latvian stopwords.
Returns an unmodifiable instance of the default stop words set.
default stop words set.
Atomically loads the in a lazy fashion once the outer class
accesses the static final set the first time.;
Builds an analyzer with the default stop words: .
lucene compatibility version
Builds an analyzer with the given stop words.
lucene compatibility version
a stopword set
Builds an analyzer with the given stop words. If a non-empty stem exclusion set is
provided this analyzer will add a before
stemming.
lucene compatibility version
a stopword set
a set of terms not to be stemmed
Creates a
which tokenizes all the text in the provided .
A
built from an filtered with
, , ,
if a stem exclusion set is
provided and .
A that applies to stem Latvian
words.
To prevent terms from being stemmed use an instance of
or a custom that sets
the before this .
Factory for .
<fieldType name="text_lvstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.LatvianStemFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Light stemmer for Latvian.
This is a light version of the algorithm in Karlis Kreslin's PhD thesis
A stemming algorithm for Latvian with the following modifications:
- Only explicitly stems noun and adjective morphology
- Stricter length/vowel checks for the resulting stems (verb etc suffix stripping is removed)
- Removes only the primary inflectional suffixes: case and number for nouns ;
case, number, gender, and definitiveness for adjectives.
- Palatalization is only handled when a declension II,V,VI noun suffix is removed.
Stem a latvian word. returns the new adjusted length.
Most cases are handled except for the ambiguous ones:
- s -> š
- t -> š
- d -> ž
- z -> ž
Count the vowels in the string, we always require at least
one in the remaining stem to accept it.
This class converts alphabetic, numeric, and symbolic Unicode characters
which are not in the first 127 ASCII characters (the "Basic Latin" Unicode
block) into their ASCII equivalents, if one exists.
Characters from the following Unicode blocks are converted; however, only
those characters with reasonable ASCII alternatives are converted:
See: http://en.wikipedia.org/wiki/Latin_characters_in_Unicode
For example, 'à' will be replaced by 'a'.
Create a new .
TokenStream to filter
should the original tokens be kept on the input stream with a 0 position increment
from the folded tokens?
Does the filter preserve the original tokens?
Converts characters above ASCII to their ASCII equivalents. For example,
accents are removed from accented characters.
The string to fold
The number of characters in the input string
Converts characters above ASCII to their ASCII equivalents. For example,
accents are removed from accented characters.
@lucene.internal
The characters to fold
Index of the first character to fold
The result of the folding. Should be of size >= length * 4.
Index of output where to put the result of the folding
The number of characters to fold
length of output
Factory for .
<fieldType name="text_ascii" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false"/>
</analyzer>
</fieldType>
Creates a new
A filter to apply normal capitalization rules to Tokens. It will make the first letter
capital and the rest lower case.
This filter is particularly useful to build nice looking facet parameters. This filter
is not appropriate if you intend to use a prefix query.
Creates a with the default parameters using the invariant culture.
Calls
CapitalizationFilter(in, true, null, true, null, 0, DEFAULT_MAX_WORD_COUNT, DEFAULT_MAX_TOKEN_LENGTH, null)
Creates a with the default parameters and the specified .
Calls
CapitalizationFilter(in, true, null, true, null, 0, DEFAULT_MAX_WORD_COUNT, DEFAULT_MAX_TOKEN_LENGTH)
input tokenstream
The culture to use for the casing operation. If null, will be used.
Creates a with the specified parameters using the invariant culture.
input tokenstream
should each word be capitalized or all of the words?
a keep word list. Each word that should be kept separated by whitespace.
Force the first letter to be capitalized even if it is in the keep list.
do not change word capitalization if a word begins with something in this list.
how long the word needs to be to get capitalization applied. If the
minWordLength is 3, "and" > "And" but "or" stays "or".
if the token contains more then maxWordCount words, the capitalization is
assumed to be correct.
The maximum length for an individual token. Tokens that exceed this length will not have the capitalization operation performed.
Creates a with the specified parameters and the specified .
input tokenstream
should each word be capitalized or all of the words?
a keep word list. Each word that should be kept separated by whitespace.
Force the first letter to be capitalized even if it is in the keep list.
do not change word capitalization if a word begins with something in this list.
how long the word needs to be to get capitalization applied. If the
minWordLength is 3, "and" > "And" but "or" stays "or".
if the token contains more then maxWordCount words, the capitalization is
assumed to be correct.
The maximum length for an individual token. Tokens that exceed this length will not have the capitalization operation performed.
The culture to use for the casing operation. If null, will be used.
Factory for .
The factory takes parameters:
"onlyFirstWord" - should each word be capitalized or all of the words?
"keep" - a keep word list. Each word that should be kept separated by whitespace.
"keepIgnoreCase - true or false. If true, the keep list will be considered case-insensitive.
"forceFirstLetter" - Force the first letter to be capitalized even if it is in the keep list
"okPrefix" - do not change word capitalization if a word begins with something in this list.
for example if "McK" is on the okPrefix list, the word "McKinley" should not be changed to
"Mckinley"
"minWordLength" - how long the word needs to be to get capitalization applied. If the
minWordLength is 3, "and" > "And" but "or" stays "or"
"maxWordCount" - if the token contains more then maxWordCount words, the capitalization is
assumed to be correct.
"culture" - the culture to use to apply the capitalization rules. If not supplied or the string
"invariant" is supplied, the invariant culture is used.
<fieldType name="text_cptlztn" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.CapitalizationFilterFactory" onlyFirstWord="true"
keep="java solr lucene" keepIgnoreCase="false"
okPrefix="McK McD McA"/>
</analyzer>
</fieldType>
@since solr 1.3
Creates a new
Removes words that are too long or too short from the stream.
Note: Length is calculated as the number of Unicode codepoints.
Create a new . This will filter out tokens whose
is either too short (
< min) or too long ( > max).
the Lucene match version
the to consume
the minimum length
the maximum length
Factory for .
<fieldType name="text_lngth" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.CodepointCountFilterFactory" min="0" max="1" />
</analyzer>
</fieldType>
Creates a new
An always exhausted token stream.
When the plain text is extracted from documents, we will often have many words hyphenated and broken into
two lines. This is often the case with documents where narrow text columns are used, such as newsletters.
In order to increase search efficiency, this filter puts hyphenated words broken into two lines back together.
This filter should be used on indexing time only.
Example field definition in schema.xml:
<fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"/>
<filter class="solr.HyphenatedWordsFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldtype>
Creates a new
that will be filtered
Consumers (i.e., ) use this method to advance the stream to
the next token. Implementing classes must implement this method and update
the appropriate s with the attributes of the next
token.
The producer must make no assumptions about the attributes after the method
has been returned: the caller may arbitrarily change it. If the producer
needs to preserve the state for subsequent calls, it can use
to create a copy of the current attribute state.
this method is called for every token of a document, so an efficient
implementation is crucial for good performance. To avoid calls to
and ,
references to all s that this stream uses should be
retrieved during instantiation.
To ensure that filters and consumers know which attributes are available,
the attributes must be added during instantiation. Filters and consumers
are not required to check for availability of attributes in
.
false for end of stream; true otherwise
This method is called by a consumer before it begins consumption using
.
Resets this stream to a clean state. Stateful implementations must implement
this method so that they can be reused, just as if they had been created fresh.
If you override this method, always call base.Reset(), otherwise
some internal state will not be correctly reset (e.g., will
throw on further usage).
NOTE:
The default implementation chains the call to the input , so
be sure to call base.Reset() when overriding this method.
Writes the joined unhyphenated term
Factory for .
<fieldType name="text_hyphn" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.HyphenatedWordsFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
A that only keeps tokens with text contained in the
required words. This filter behaves like the inverse of .
@since solr 1.3
@deprecated enablePositionIncrements=false is not supported anymore as of Lucene 4.4.
Create a new .
NOTE: The words set passed to this constructor will be directly
used by this filter and should not be modified.
the Lucene match version
the to consume
the words to keep
Factory for .
<fieldType name="text_keepword" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.KeepWordFilterFactory" words="keepwords.txt" ignoreCase="false"/>
</analyzer>
</fieldType>
Creates a new
Marks terms as keywords via the .
Creates a new
the input stream
Factory for .
<fieldType name="text_keyword" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protectedkeyword.txt" pattern="^.+er$" ignoreCase="false"/>
</analyzer>
</fieldType>
Creates a new
This TokenFilter emits each incoming token twice once as keyword and once non-keyword, in other words once with
set to true and once set to false.
This is useful if used with a stem filter that respects the to index the stemmed and the
un-stemmed version of a term into the same field.
Construct a token stream filtering the given input.
Factory for .
Since emits two tokens for every input token, and any tokens that aren't transformed
later in the analysis chain will be in the document twice. Therefore, consider adding
later in the analysis chain.
Creates a new
Removes words that are too long or too short from the stream.
Note: Length is calculated as the number of UTF-16 code units.
Create a new . This will filter out tokens whose
is either too short (
< min) or too long ( > max).
the Lucene match version
the to consume
the minimum length
the maximum length
Factory for .
<fieldType name="text_lngth" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LengthFilterFactory" min="0" max="1" />
</analyzer>
</fieldType>
Creates a new
This limits the number of tokens while indexing. It is
a replacement for the maximum field length setting inside .
Build an analyzer that limits the maximum number of tokens per field.
This analyzer will not consume any tokens beyond the maxTokenCount limit
Build an analyzer that limits the maximum number of tokens per field.
the analyzer to wrap
max number of tokens to produce
whether all tokens from the delegate should be consumed even if maxTokenCount is reached.
This limits the number of tokens while indexing. It is
a replacement for the maximum field length setting inside .
By default, this filter ignores any tokens in the wrapped
once the limit has been reached, which can result in being
called prior to returning false. For most
implementations this should be acceptable, and faster
then consuming the full stream. If you are wrapping a
which requires that the full stream of tokens be exhausted in order to
function properly, use the
consumeAllTokens
option.
Build a filter that only accepts tokens up to a maximum number.
This filter will not consume any tokens beyond the limit
the stream to wrap
max number of tokens to produce
Build an filter that limits the maximum number of tokens per field.
the stream to wrap
max number of tokens to produce
whether all tokens from the input must be consumed even if is reached.
Factory for .
<fieldType name="text_lngthcnt" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LimitTokenCountFilterFactory" maxTokenCount="10" consumeAllTokens="false" />
</analyzer>
</fieldType>
The property is optional and defaults to false.
See for an explanation of it's use.
Creates a new
This limits its emitted tokens to those with positions that
are not greater than the configured limit.
By default, this filter ignores any tokens in the wrapped
once the limit has been exceeded, which can result in being
called prior to returning false. For most
implementations this should be acceptable, and faster
then consuming the full stream. If you are wrapping a
which requires that the full stream of tokens be exhausted in order to
function properly, use the
consumeAllTokens
option.
Build a filter that only accepts tokens up to and including the given maximum position.
This filter will not consume any tokens with position greater than the limit.
the stream to wrap
max position of tokens to produce (1st token always has position 1)
Build a filter that limits the maximum position of tokens to emit.
the stream to wrap
max position of tokens to produce (1st token always has position 1)
whether all tokens from the wrapped input stream must be consumed
even if maxTokenPosition is exceeded.
Factory for .
<fieldType name="text_limit_pos" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LimitTokenPositionFilterFactory" maxTokenPosition="3" consumeAllTokens="false" />
</analyzer>
</fieldType>
The property is optional and defaults to false.
See for an explanation of its use.
Creates a new
Old Broken version of
If not null is the set of tokens to protect from being delimited
Creates a new
to be filtered
table containing character types
Flags configuring the filter
If not null is the set of tokens to protect from being delimited
Creates a new using
as its charTypeTable
to be filtered
Flags configuring the filter
If not null is the set of tokens to protect from being delimited
This method is called by a consumer before it begins consumption using
.
Resets this stream to a clean state. Stateful implementations must implement
this method so that they can be reused, just as if they had been created fresh.
If you override this method, always call base.Reset(), otherwise
some internal state will not be correctly reset (e.g., will
throw on further usage).
NOTE:
The default implementation chains the call to the input , so
be sure to call base.Reset() when overriding this method.
Saves the existing attribute states
Flushes the given by either writing its concat and then clearing, or just clearing.
that will be flushed
true if the concatenation was written before it was cleared, false otherwise
Determines whether to concatenate a word or number if the current word is the given type
Type of the current word used to determine if it should be concatenated
true if concatenation should occur, false otherwise
Determines whether a word/number part should be generated for a word of the given type
Type of the word used to determine if a word/number part should be generated
true if a word/number part should be generated, false otherwise
Concatenates the saved buffer to the given WordDelimiterConcatenation
WordDelimiterConcatenation to concatenate the buffer to
Generates a word/number part, updating the appropriate attributes
true if the generation is occurring from a single word, false otherwise
Get the position increment gap for a subword or concatenation
true if this token wants to be injected
position increment gap
Checks if the given word type includes
Word type to check
true if the type contains , false otherwise
Checks if the given word type includes
Word type to check
true if the type contains , false otherwise
Checks if the given word type includes
Word type to check
true if the type contains , false otherwise
Checks if the given word type includes
Word type to check
true if the type contains , false otherwise
Determines whether the given flag is set
Flag to see if set
true if flag is set
A WDF concatenated 'run'
Appends the given text of the given length, to the concetenation at the given offset
Text to append
Offset in the concetenation to add the text
Length of the text to append
Writes the concatenation to the attributes
Determines if the concatenation is empty
true if the concatenation is empty, false otherwise
Clears the concatenation and resets its state
Convenience method for the common scenario of having to write the concetenation and then clearing its state
Efficient Lucene analyzer/tokenizer that preferably operates on a rather than a
, that can flexibly separate text into terms via a regular expression
(with behaviour similar to ),
and that combines the functionality of
,
,
,
into a single efficient
multi-purpose class.
If you are unsure how exactly a regular expression should look like, consider
prototyping by simply trying various expressions on some test texts via
. Once you are satisfied, give that regex to
. Also see Regular Expression Tutorial.
This class can be considerably faster than the "normal" Lucene tokenizers.
It can also serve as a building block in a compound Lucene
chain. For example as in this
stemming example:
PatternAnalyzer pat = ...
TokenStream tokenStream = new SnowballFilter(
pat.GetTokenStream("content", "James is running round in the woods"),
"English"));
@deprecated (4.0) use the pattern-based analysis in the analysis/pattern package instead.
"\\W+"; Divides text at non-letters (NOT Character.isLetter(c))
"\\s+"; Divides text at whitespaces (Character.isWhitespace(c))
A lower-casing word analyzer with English stop words (can be shared
freely across threads without harm); global per class loader.
A lower-casing word analyzer with extended English stop words
(can be shared freely across threads without harm); global per class
loader. The stop words are borrowed from
http://thomas.loc.gov/home/stopwords.html, see
http://thomas.loc.gov/home/all.about.inquery.html
Constructs a new instance with the given parameters.
currently does nothing
a regular expression delimiting tokens
if true
returns tokens after applying
String.toLowerCase()
if non-null, ignores all tokens that are contained in the
given stop set (after previously having applied toLowerCase()
if applicable). For example, created via
and/or
as in
WordlistLoader.getWordSet(new File("samples/fulltext/stopwords.txt")
or other stop words
lists .
Creates a token stream that tokenizes the given string into token terms
(aka words).
the name of the field to tokenize (currently ignored).
reader (e.g. charfilter) of the original text. can be null.
the string to tokenize
a new token stream
Creates a token stream that tokenizes all the text in the given SetReader;
This implementation forwards to and is
less efficient than .
the name of the field to tokenize (currently ignored).
the reader delivering the text
a new token stream
Indicates whether some other object is "equal to" this one.
the reference object with which to compare.
true if equal, false otherwise
Returns a hash code value for the object.
the hash code.
equality where o1 and/or o2 can be null
assumes p1 and p2 are not null
Reads until end-of-stream and returns all read chars, finally closes the stream.
the input stream
if an I/O error occurs while reading the stream
The work horse; performance isn't fantastic, but it's not nearly as bad
as one might think - kudos to the Sun regex developers.
Special-case class for best performance in common cases; this class is
otherwise unnecessary.
A that exposes it's contained string for fast direct access.
Might make sense to generalize this to ICharSequence and make it public?
Marks terms as keywords via the . Each token
that matches the provided pattern is marked as a keyword by setting
to true.
Create a new , that marks the current
token as a keyword if the tokens term buffer matches the provided
via the .
to filter
the pattern to apply to the incoming term buffer
This analyzer is used to facilitate scenarios where different
fields Require different analysis techniques. Use the Map
argument in
to add non-default analyzers for fields.
Example usage:
IDictionary<string, Analyzer> analyzerPerField = new Dictionary<string, Analyzer>();
analyzerPerField["firstname"] = new KeywordAnalyzer();
analyzerPerField["lastname"] = new KeywordAnalyzer();
PerFieldAnalyzerWrapper aWrapper =
new PerFieldAnalyzerWrapper(new StandardAnalyzer(version), analyzerPerField);
In this example, will be used for all fields except "firstname"
and "lastname", for which will be used.
A can be used like any other analyzer, for both indexing
and query parsing.
Constructs with default analyzer.
Any fields not specifically
defined to use a different analyzer will use the one provided here.
Constructs with default analyzer and a map of analyzers to use for
specific fields.
The type of supplied will determine the type of behavior.
-
General use. null keys are not supported.
-
Use when sorted keys are required. null keys are not supported.
-
Similar behavior as . null keys are supported.
-
Use when sorted keys are required. null keys are supported.
-
Use when insertion order must be preserved ( preserves insertion
order only until items are removed). null keys are supported.
Or, use a 3rd party or custom if other behavior is desired.
Any fields not specifically
defined to use a different analyzer will use the one provided here.
A (String field name to the Analyzer) to be
used for those fields.
Links two .
NOTE: This filter might not behave correctly if used with custom
s, i.e. s other than
the ones located in Lucene.Net.Analysis.TokenAttributes.
Joins two token streams and leaves the last token of the first stream available
to be used when updating the token values in the second stream based on that token.
The default implementation adds last prefix token end offset to the suffix token start and end offsets.
NOTE: This filter might not behave correctly if used with custom
s, i.e. s other than
the ones located in Lucene.Net.Analysis.TokenAttributes.
The default implementation adds last prefix token end offset to the suffix token start and end offsets.
a token from the suffix stream
the last token from the prefix stream
consumer token
A which filters out s at the same position and Term text as the previous token in the stream.
Creates a new RemoveDuplicatesTokenFilter
TokenStream that will be filtered
Consumers (i.e., ) use this method to advance the stream to
the next token. Implementing classes must implement this method and update
the appropriate s with the attributes of the next
token.
The producer must make no assumptions about the attributes after the method
has been returned: the caller may arbitrarily change it. If the producer
needs to preserve the state for subsequent calls, it can use
to create a copy of the current attribute state.
this method is called for every token of a document, so an efficient
implementation is crucial for good performance. To avoid calls to
and ,
references to all s that this stream uses should be
retrieved during instantiation.
To ensure that filters and consumers know which attributes are available,
the attributes must be added during instantiation. Filters and consumers
are not required to check for availability of attributes in
.
false for end of stream; true otherwise
This method is called by a consumer before it begins consumption using
.
Resets this stream to a clean state. Stateful implementations must implement
this method so that they can be reused, just as if they had been created fresh.
If you override this method, always call base.Reset(), otherwise
some internal state will not be correctly reset (e.g., will
throw on further usage).
Factory for .
<fieldType name="text_rmdup" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
This filter folds Scandinavian characters åÅäæÄÆ->a and öÖøØ->o.
It also discriminate against use of double vowels aa, ae, ao, oe and oo, leaving just the first one.
It's is a semantically more destructive solution than but
can in addition help with matching raksmorgas as räksmörgås.
blåbærsyltetøj == blåbärsyltetöj == blaabaarsyltetoej == blabarsyltetoj
räksmörgås == ræksmørgås == ræksmörgaos == raeksmoergaas == raksmorgas
Background:
Swedish åäö are in fact the same letters as Norwegian and Danish åæø and thus interchangeable
when used between these languages. They are however folded differently when people type
them on a keyboard lacking these characters.
In that situation almost all Swedish people use a, a, o instead of å, ä, ö.
Norwegians and Danes on the other hand usually type aa, ae and oe instead of å, æ and ø.
Some do however use a, a, o, oo, ao and sometimes permutations of everything above.
This filter solves that mismatch problem, but might also cause new.
Factory for .
<fieldType name="text_scandfold" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ScandinavianFoldingFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
This filter normalize use of the interchangeable Scandinavian characters æÆäÄöÖøØ
and folded variants (aa, ao, ae, oe and oo) by transforming them to åÅæÆøØ.
It's a semantically less destructive solution than ,
most useful when a person with a Norwegian or Danish keyboard queries a Swedish index
and vice versa. This filter does not the common Swedish folds of å and ä to a nor ö to o.
blåbærsyltetøj == blåbärsyltetöj == blaabaarsyltetoej but not blabarsyltetoj
räksmörgås == ræksmørgås == ræksmörgaos == raeksmoergaas but not raksmorgas
Factory for .
<fieldType name="text_scandnorm" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ScandinavianNormalizationFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Marks terms as keywords via the . Each token
contained in the provided set is marked as a keyword by setting
to true.
Create a new , that marks the current token as a
keyword if the tokens term buffer is contained in the given set via the
.
to filter
the keywords set to lookup the current termbuffer
A containing a single token.
Provides the ability to override any aware stemmer
with custom dictionary-based stemming.
Create a new , performing dictionary-based stemming
with the provided dictionary ().
Any dictionary-stemmed terms will be marked with
so that they will not be stemmed with stemmers down the chain.
A read-only 4-byte FST backed map that allows fast case-insensitive key
value lookups for
Creates a new
the fst to lookup the overrides
if the keys case should be ingored
Returns a to pass to the method.
Returns the value mapped to the given key or null
if the key is not in the FST dictionary.
This builder builds an for the
Creates a new with set to false
Creates a new
if the input case should be ignored.
Adds an input string and it's stemmer override output to this builder.
the input char sequence
the stemmer override output char sequence
false if the input has already been added to this builder otherwise true.
or is null.
Adds an input string and it's stemmer override output to this builder.
the input char sequence
the stemmer override output char sequence
false if the input has already been added to this builder otherwise true.
or is null.
Adds an input string and it's stemmer override output to this builder.
the input char sequence
the stemmer override output char sequence
false if the input has already been added to this builder otherwise true.
or is null.
Returns a to be used with the
a to be used with the
if an occurs;
Factory for .
<fieldType name="text_dicstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StemmerOverrideFilterFactory" dictionary="dictionary.txt" ignoreCase="false"/>
</analyzer>
</fieldType>
Creates a new
Trims leading and trailing whitespace from Tokens in the stream.
As of Lucene 4.4, this filter does not support updateOffsets=true anymore
as it can lead to broken token streams.
Create a new .
the Lucene match version
the stream to consume
whether to update offsets
@deprecated Offset updates are not supported anymore as of Lucene 4.4.
Create a new on top of .
Factory for .
<fieldType name="text_trm" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.NGramTokenizerFactory"/>
<filter class="solr.TrimFilterFactory" />
</analyzer>
</fieldType>
Creates a new
A token filter for truncating the terms into a specific length.
Fixed prefix truncation, as a stemming method, produces good results on Turkish language.
It is reported that F5, using first 5 characters, produced best results in
Information Retrieval on Turkish Texts
Factory for . The following type is recommended for "diacritics-insensitive search" for Turkish.
<fieldType name="text_tr_ascii_f5" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ApostropheFilterFactory"/>
<filter class="solr.TurkishLowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
<filter class="solr.KeywordRepeatFilterFactory"/>
<filter class="solr.TruncateTokenFilterFactory" prefixLength="5"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Adds the as a synonym,
i.e. another token at the same position, optionally with a specified prefix prepended.
Initializes a new instance of with
the specified token stream.
Input token stream.
Initializes a new instance of with
the specified token stream and prefix.
Input token stream.
Prepend this string to every token type emitted as token text.
If null, nothing will be prepended.
Factory for .
<fieldType name="text_type_as_synonym" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
<filter class="solr.TypeAsSynonymFilterFactory" prefix="_type_" />
</analyzer>
</fieldType>
If the optional prefix parameter is used, the specified value will be prepended
to the type, e.g.with prefix = "_type_", for a token "example.com" with type "<URL>",
the emitted synonym will have text "_type_<URL>".
Configuration options for the .
LUCENENET specific - these options were passed as int constant flags in Lucene.
Causes parts of words to be generated:
"PowerShot" => "Power" "Shot"
Causes number subwords to be generated:
"500-42" => "500" "42"
Causes maximum runs of word parts to be catenated:
"wi-fi" => "wifi"
Causes maximum runs of word parts to be catenated:
"wi-fi" => "wifi"
Causes all subword parts to be catenated:
"wi-fi-4000" => "wifi4000"
Causes original words are preserved and added to the subword list (Defaults to false)
"500-42" => "500" "42" "500-42"
If not set, causes case changes to be ignored (subwords will only be generated
given SUBWORD_DELIM tokens)
If not set, causes numeric changes to be ignored (subwords will only be generated
given SUBWORD_DELIM tokens).
Causes trailing "'s" to be removed for each subword
"O'Neil's" => "O", "Neil"
Splits words into subwords and performs optional transformations on subword
groups. Words are split into subwords with the following rules:
- split on intra-word delimiters (by default, all non alpha-numeric
characters): "Wi-Fi" → "Wi", "Fi"
- split on case transitions: "PowerShot" →
"Power", "Shot"
- split on letter-number transitions: "SD500" →
"SD", "500"
- leading and trailing intra-word delimiters on each subword are ignored:
"//hello---there, 'dude'" →
"hello", "there", "dude"
- trailing "'s" are removed for each subword: "O'Neil's"
→ "O", "Neil"
- Note: this step isn't performed in a separate filter because of possible
subword combinations.
The combinations parameter affects how subwords are combined:
- combinations="0" causes no subword combinations:
"PowerShot"
→ 0:"Power", 1:"Shot" (0 and 1 are the token positions)
- combinations="1" means that in addition to the subwords, maximum runs of
non-numeric subwords are catenated and produced at the same position of the
last subword in the run:
- "PowerShot" →
0:"Power", 1:"Shot" 1:"PowerShot"
- "A's+B's&C's" -gt; 0:"A", 1:"B", 2:"C", 2:"ABC"
- "Super-Duper-XL500-42-AutoCoder!" →
0:"Super", 1:"Duper", 2:"XL", 2:"SuperDuperXL", 3:"500" 4:"42", 5:"Auto", 6:"Coder", 6:"AutoCoder"
One use for is to help match words with different
subword delimiters. For example, if the source text contained "wi-fi" one may
want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so
is to specify combinations="1" in the analyzer used for indexing, and
combinations="0" (the default) in the analyzer used for querying. Given that
the current immediately removes many intra-word
delimiters, it is recommended that this filter be used after a tokenizer that
does not do this (such as ).
If not null is the set of tokens to protect from being delimited
Creates a new WordDelimiterFilter
lucene compatibility version
TokenStream to be filtered
table containing character types
Flags configuring the filter
If not null is the set of tokens to protect from being delimited
Creates a new WordDelimiterFilter using
as its charTypeTable
lucene compatibility version
to be filtered
Flags configuring the filter
If not null is the set of tokens to protect from being delimited
Saves the existing attribute states
Flushes the given by either writing its concat and then clearing, or just clearing.
that will be flushed
true if the concatenation was written before it was cleared, false otherwise
Determines whether to concatenate a word or number if the current word is the given type
Type of the current word used to determine if it should be concatenated
true if concatenation should occur, false otherwise
Determines whether a word/number part should be generated for a word of the given type
Type of the word used to determine if a word/number part should be generated
true if a word/number part should be generated, false otherwise
Concatenates the saved buffer to the given
to concatenate the buffer to
Generates a word/number part, updating the appropriate attributes
true if the generation is occurring from a single word, false otherwise
Get the position increment gap for a subword or concatenation
true if this token wants to be injected
position increment gap
Checks if the given word type includes
Word type to check
true if the type contains , false otherwise
Checks if the given word type includes
Word type to check
true if the type contains , false otherwise
Checks if the given word type includes
Word type to check
true if the type contains , false otherwise
Checks if the given word type includes
Word type to check
true if the type contains , false otherwise
Determines whether the given flag is set
Flag to see if set
true if flag is set
A WDF concatenated 'run'
Appends the given text of the given length, to the concetenation at the given offset
Text to append
Offset in the concetenation to add the text
Length of the text to append
Writes the concatenation to the attributes
Determines if the concatenation is empty
true if the concatenation is empty, false otherwise
Clears the concatenation and resets its state
Convenience method for the common scenario of having to write the concetenation and then clearing its state
Factory for .
<fieldType name="text_wd" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" protected="protectedword.txt"
preserveOriginal="0" splitOnNumerics="1" splitOnCaseChange="1"
catenateWords="0" catenateNumbers="0" catenateAll="0"
generateWordParts="1" generateNumberParts="1" stemEnglishPossessive="1"
types="wdfftypes.txt" />
</analyzer>
</fieldType>
Creates a new
A BreakIterator-like API for iterating over subwords in text, according to rules.
@lucene.internal
Indicates the end of iteration
start position of text, excluding leading delimiters
end position of text, excluding trailing delimiters
Beginning of subword
End of subword
does this string end with a possessive such as 's
If false, causes case changes to be ignored (subwords will only be generated
given SUBWORD_DELIM tokens). (Defaults to true)
If false, causes numeric changes to be ignored (subwords will only be generated
given SUBWORD_DELIM tokens). (Defaults to true)
If true, causes trailing "'s" to be removed for each subword. (Defaults to true)
"O'Neil's" => "O", "Neil"
if true, need to skip over a possessive found in the last call to next()
Create a new operating with the supplied rules.
table containing character types
if true, causes "PowerShot" to be two tokens; ("Power-Shot" remains two parts regards)
if true, causes "j2se" to be three tokens; "j" "2" "se"
if true, causes trailing "'s" to be removed for each subword: "O'Neil's" => "O", "Neil"
Advance to the next subword in the string.
index of the next subword, or if all subwords have been returned
Return the type of the current subword.
This currently uses the type of the first character in the subword.
type of the current word
Reset the text to a new value, and reset all state
New text
length of the text
Determines whether the transition from lastType to type indicates a break
Last subword type
Current subword type
true if the transition indicates a break, false otherwise
Determines if the current word contains only one subword. Note, it could be potentially surrounded by delimiters
true if the current word contains only one subword, false otherwise
Set the internal word bounds (remove leading and trailing delimiters). Note, if a possessive is found, don't remove
it yet, simply note it.
Determines if the text at the given position indicates an English possessive which should be removed
Position in the text to check if it indicates an English possessive
true if the text at the position indicates an English posessive, false otherwise
Determines the type of the given character
Character whose type is to be determined
Type of the character
Computes the type of the given character
Character whose type is to be determined
Type of the character
Creates new instances of .
<fieldType name="text_edgngrm" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="1"/>
</analyzer>
</fieldType>
Creates a new
Tokenizes the given token into n-grams of given size(s).
This create n-grams from the beginning edge or ending edge of a input token.
As of Lucene 4.4, this filter does not support
(you can use up-front and
afterward to get the same behavior), handles supplementary characters
correctly and does not update offsets anymore.
Specifies which side of the input the n-gram should be generated from
Get the n-gram from the front of the input
Get the n-gram from the end of the input
Get the appropriate from a string
Creates that can generate n-grams in the sizes of the given range
the Lucene match version - See
holding the input to be tokenized
the from which to chop off an n-gram
the smallest n-gram to generate
the largest n-gram to generate
Creates that can generate n-grams in the sizes of the given range
the Lucene match version - See
holding the input to be tokenized
the name of the from which to chop off an n-gram
the smallest n-gram to generate
the largest n-gram to generate
Creates that can generate n-grams in the sizes of the given range
the Lucene match version - See
holding the input to be tokenized
the smallest n-gram to generate
the largest n-gram to generate
Tokenizes the input from an edge into n-grams of given size(s).
This create n-grams from the beginning edge or ending edge of a input token.
As of Lucene 4.4, this tokenizer
- can handle
maxGram
larger than 1024 chars, but beware that this will result in increased memory usage
- doesn't trim the input,
- sets position increments equal to 1 instead of 1 for the first token and 0 for all other ones
- doesn't support backward n-grams anymore.
- supports pre-tokenization,
- correctly handles supplementary characters.
Although highly discouraged, it is still possible
to use the old behavior through .
Creates that can generate n-grams in the sizes of the given range
the Lucene match version - See
holding the input to be tokenized
the smallest n-gram to generate
the largest n-gram to generate
Creates EdgeNGramTokenizer that can generate n-grams in the sizes of the given range
the Lucene match version - See
to use
holding the input to be tokenized
the smallest n-gram to generate
the largest n-gram to generate
Creates new instances of .
<fieldType name="text_edgngrm" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.EdgeNGramTokenizerFactory" minGramSize="1" maxGramSize="1"/>
</analyzer>
</fieldType>
Creates a new
Old version of which doesn't handle correctly
supplementary characters.
Specifies which side of the input the n-gram should be generated from
Get the n-gram from the front of the input
Get the n-gram from the end of the input
Creates that can generate n-grams in the sizes of the given range
the Lucene match version - See
holding the input to be tokenized
the from which to chop off an n-gram
the smallest n-gram to generate
the largest n-gram to generate
Creates that can generate n-grams in the sizes of the given range
the Lucene match version - See
to use
holding the input to be tokenized
the from which to chop off an n-gram
the smallest n-gram to generate
the largest n-gram to generate
Creates that can generate n-grams in the sizes of the given range
the Lucene match version - See
holding the input to be tokenized
the name of the from which to chop off an n-gram
the smallest n-gram to generate
the largest n-gram to generate
Creates that can generate n-grams in the sizes of the given range
the Lucene match version - See
to use
holding the input to be tokenized
the name of the from which to chop off an n-gram
the smallest n-gram to generate
the largest n-gram to generate
Creates that can generate n-grams in the sizes of the given range
the Lucene match version - See
holding the input to be tokenized
the smallest n-gram to generate
the largest n-gram to generate
Creates that can generate n-grams in the sizes of the given range
the Lucene match version - See
to use
holding the input to be tokenized
the smallest n-gram to generate
the largest n-gram to generate
Returns the next token in the stream, or null at EOS.
Old broken version of .
Creates with given min and max n-grams.
holding the input to be tokenized
the smallest n-gram to generate
the largest n-gram to generate
Creates with given min and max n-grams.
to use
holding the input to be tokenized
the smallest n-gram to generate
the largest n-gram to generate
Creates with default min and max n-grams.
holding the input to be tokenized
Returns the next token in the stream, or null at EOS.
Factory for .
<fieldType name="text_ngrm" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="2"/>
</analyzer>
</fieldType>
Creates a new
Tokenizes the input into n-grams of the given size(s).
You must specify the required compatibility when
creating a . As of Lucene 4.4, this token filters:
- handles supplementary characters correctly,
- emits all n-grams for the same token at the same position,
- does not modify offsets,
- sorts n-grams by their offset in the original token first, then
increasing length (meaning that "abc" will give "a", "ab", "abc", "b", "bc",
"c").
You can make this filter use the old behavior by providing a version <
in the constructor but this is not recommended as
it will lead to broken s that will cause highlighting
bugs.
If you were using this to perform partial highlighting,
this won't work anymore since this filter doesn't update offsets. You should
modify your analysis chain to use , and potentially
override to perform pre-tokenization.
Creates with given min and max n-grams.
Lucene version to enable correct position increments.
See for details.
holding the input to be tokenized
the smallest n-gram to generate
the largest n-gram to generate
Creates with default min and max n-grams.
Lucene version to enable correct position increments.
See for details.
holding the input to be tokenized
Returns the next token in the stream, or null at EOS.
Tokenizes the input into n-grams of the given size(s).
On the contrary to , this class sets offsets so
that characters between startOffset and endOffset in the original stream are
the same as the term chars.
For example, "abcde" would be tokenized as (minGram=2, maxGram=3):
Term
Position increment
Position length
Offsets
-
ab
1
1
[0,2[
-
abc
1
1
[0,3[
-
bc
1
1
[1,3[
-
bcd
1
1
[1,4[
-
cd
1
1
[2,4[
-
cde
1
1
[2,5[
-
de
1
1
[3,5[
This tokenizer changed a lot in Lucene 4.4 in order to:
- tokenize in a streaming fashion to support streams which are larger
than 1024 chars (limit of the previous version),
- count grams based on unicode code points instead of java chars (and
never split in the middle of surrogate pairs),
- give the ability to pre-tokenize the stream ()
before computing n-grams.
Additionally, this class doesn't trim trailing whitespaces and emits
tokens in a different order, tokens are now emitted by increasing start
offsets while they used to be emitted by increasing lengths (which prevented
from supporting large input streams).
Although highly discouraged, it is still possible
to use the old behavior through .
Creates with given min and max n-grams.
the lucene compatibility version
holding the input to be tokenized
the smallest n-gram to generate
the largest n-gram to generate
Creates with given min and max n-grams.
the lucene compatibility version
to use
holding the input to be tokenized
the smallest n-gram to generate
the largest n-gram to generate
Creates with default min and max n-grams.
the lucene compatibility version
holding the input to be tokenized
Consume one code point.
Only collect characters which satisfy this condition.
Factory for .
<fieldType name="text_ngrm" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.NGramTokenizerFactory" minGramSize="1" maxGramSize="2"/>
</analyzer>
</fieldType>
Creates a new
Creates the of n-grams from the given and .
for Dutch language.
Supports an external list of stopwords (words that
will not be indexed at all), an external list of exclusions (word that will
not be stemmed, but indexed) and an external list of word-stem pairs that overrule
the algorithm (dictionary stemming).
A default set of stopwords is used unless an alternative list is specified, but the
exclusion list is empty by default.
You must specify the required
compatibility when creating :
- As of 3.6, and
also populate
the default entries for the stem override dictionary
- As of 3.1, Snowball stemming is done with SnowballFilter,
LowerCaseFilter is used prior to StopFilter, and Snowball
stopwords are used by default.
- As of 2.9, StopFilter preserves position
increments
NOTE: This class uses the same
dependent settings as .
File containing default Dutch stopwords.
Returns an unmodifiable instance of the default stop-words set.
an unmodifiable instance of the default stop-words set.
Contains the stopwords used with the .
Contains words that should be indexed but not stemmed.
Builds an analyzer with the default stop words ()
and a few default entries for the stem exclusion table.
Returns a (possibly reused) which tokenizes all the
text in the provided .
A built from a
filtered with , ,
, if a stem exclusion set is provided,
, and
A that stems Dutch words.
It supports a table of words that should
not be stemmed at all. The stemmer used can be changed at runtime after the
filter object is created (as long as it is a ).
To prevent terms from being stemmed use an instance of
or a custom that sets
the before this .
@deprecated (3.1) Use with
instead, which has the
same functionality. This filter will be removed in Lucene 5.0
The actual token in the input stream.
Input
Input
Dictionary of word stem pairs, that overrule the algorithm
Returns the next token in the stream, or null at EOS
Set a alternative/custom for this filter.
Set dictionary for stemming, this dictionary overrules the algorithm,
so you can correct for a particular unwanted word-stem pair.
A stemmer for Dutch words.
The algorithm is an implementation of
the dutch stemming
algorithm in Martin Porter's snowball project.
@deprecated (3.1) Use instead,
which has the same functionality. This filter will be removed in Lucene 5.0
Buffer for the terms while stemming them.
Stems the given term to an unique discriminator.
The term that should be stemmed.
Discriminator for
Delete suffix e if in R1 and
preceded by a non-vowel, and then undouble the ending
String being stemmed
Delete "heid"
String being stemmed
A d-suffix, or derivational suffix, enables a new word,
often with a different grammatical category, or with a different
sense, to be built from another word. Whether a d-suffix can be
attached is discovered not from the rules of grammar, but by
referring to a dictionary. So in English, ness can be added to
certain adjectives to form corresponding nouns (littleness,
kindness, foolishness ...) but not to all adjectives
(not for example, to big, cruel, wise ...) d-suffixes can be
used to change meaning, often in rather exotic ways.
Remove "ing", "end", "ig", "lijk", "baar" and "bar"
String being stemmed
undouble vowel
If the words ends CVD, where C is a non-vowel, D is a non-vowel other than I, and V is double a, e, o or u, remove one of the vowels from V (for example, maan -> man, brood -> brod).
String being stemmed
Checks if a term could be stemmed.
true if, and only if, the given term consists in letters.
Substitute ä, ë, ï, ö, ü, á , é, í, ó, ú
for Norwegian.
File containing default Norwegian stopwords.
Returns an unmodifiable instance of the default stop words set.
default stop words set.
Atomically loads the in a lazy fashion once the outer class
accesses the static final set the first time.;
Builds an analyzer with the default stop words: .
Builds an analyzer with the given stop words.
lucene compatibility version
a stopword set
Builds an analyzer with the given stop words. If a non-empty stem exclusion set is
provided this analyzer will add a before
stemming.
lucene compatibility version
a stopword set
a set of terms not to be stemmed
Creates a
which tokenizes all the text in the provided .
A
built from an filtered with
, , ,
if a stem exclusion set is
provided and .
A that applies to stem Norwegian
words.
To prevent terms from being stemmed use an instance of
or a custom that sets
the before this .
Calls
- NorwegianLightStemFilter(input, BOKMAAL)
the source to filter
Creates a new
the source to filter
set to ,
, or both.
Factory for .
<fieldType name="text_svlgtstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NorwegianLightStemFilterFactory" variant="nb"/>
</analyzer>
</fieldType>
Creates a new
Constant to remove Bokmål-specific endings
Constant to remove Nynorsk-specific endings
Light Stemmer for Norwegian.
Parts of this stemmer is adapted from , except
that while the Swedish one has a pre-defined rule set and a corresponding
corpus to validate against whereas the Norwegian one is hand crafted.
Creates a new
set to , , or both.
A that applies to stem Norwegian
words.
To prevent terms from being stemmed use an instance of
or a custom that sets
the before this .
Calls -
NorwegianMinimalStemFilter(input, BOKMAAL)
Creates a new
the source to filter
set to ,
, or both.
Factory for .
<fieldType name="text_svlgtstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NorwegianMinimalStemFilterFactory" variant="nb"/>
</analyzer>
</fieldType>
Creates a new
Minimal Stemmer for Norwegian Bokmål (no-nb) and Nynorsk (no-nn)
Stems known plural forms for Norwegian nouns only, together with genitiv -s
Creates a new
set to ,
, or both.
Tokenizer for path-like hierarchies.
Take something like:
/something/something/else
and make:
/something
/something/something
/something/something/else
Factory for .
This factory is typically configured for use only in the index
Analyzer (or only in the query Analyzer, but never both).
For example, in the configuration below a query for
Books/NonFic will match documents indexed with values like
Books/NonFic, Books/NonFic/Law,
Books/NonFic/Science/Physics, etc. But it will not match
documents indexed with values like Books, or
Books/Fic...
<fieldType name="descendent_path" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory" />
</analyzer>
</fieldType>
In this example however we see the oposite configuration, so that a query
for Books/NonFic/Science/Physics would match documents
containing Books/NonFic, Books/NonFic/Science,
or Books/NonFic/Science/Physics, but not
Books/NonFic/Science/Physics/Theory or
Books/NonFic/Law.
<fieldType name="descendent_path" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/" />
</analyzer>
</fieldType>
Creates a new
Tokenizer for domain-like hierarchies.
Take something like:
www.site.co.uk
and make:
www.site.co.uk
site.co.uk
co.uk
uk
Factory for .
<fieldType name="text_ptncapturegroup" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.PatternCaptureGroupFilterFactory" pattern="([^a-z])" preserve_original="true"/>
</analyzer>
</fieldType>
CaptureGroup uses .NET regexes to emit multiple tokens - one for each capture
group in one or more patterns.
For example, a pattern like:
"(https?://([a-zA-Z\-_0-9.]+))"
when matched against the string "http://www.foo.com/index" would return the
tokens "https://www.foo.com" and "www.foo.com".
If none of the patterns match, or if preserveOriginal is true, the original
token will be preserved.
Each pattern is matched as often as it can be, so the pattern
"(...)", when matched against "abcdefghi" would
produce ["abc","def","ghi"]
A camelCaseFilter could be written as:
"([A-Z]{2,})",
"(?<![A-Z])([A-Z][a-z]+)",
"(?:^|\\b|(?<=[0-9_])|(?<=[A-Z]{2}))([a-z]+)",
"([0-9]+)"
plus if is true, it would also return
camelCaseFilter
Creates a new
the input
set to true to return the original token even if one of the
patterns matches
an array of objects to match against each token
that uses a regular expression for the target of replace string.
The pattern match will be done in each "block" in char stream.
ex1) source="aa bb aa bb", pattern="(aa)\\s+(bb)" replacement="$1#$2"
output="aa#bb aa#bb"
NOTE: If you produce a phrase that has different length to source string
and the field is used for highlighting for a term of the phrase, you will
face a trouble.
ex2) source="aa123bb", pattern="(aa)\\d+(bb)" replacement="$1 $2"
output="aa bb"
and you want to search bb and highlight it, you will get
highlight snippet="aa1<em>23bb</em>"
@since Solr 1.5
Replace pattern in input and mark correction offsets.
Factory for .
<fieldType name="text_ptnreplace" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="([^a-z])" replacement=""/>
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
</fieldType>
@since Solr 3.1
Creates a new
A TokenFilter which applies a to each token in the stream,
replacing match occurances with the specified replacement string.
Note: Depending on the input and the pattern used and the input
, this may produce s whose text is the empty
string.
Constructs an instance to replace either the first, or all occurances
the to process
the pattern (a object) to apply to each
the "replacement string" to substitute, if null a
blank string will be used. Note that this is not the literal
string that will be used, '$' and '\' have special meaning.
if true, all matches will be replaced otherwise just the first match.
Factory for .
<fieldType name="text_ptnreplace" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" replacement=""
replace="all"/>
</analyzer>
</fieldType>
Creates a new
This tokenizer uses regex pattern matching to construct distinct tokens
for the input stream. It takes two arguments: "pattern" and "group".
- "pattern" is the regular expression.
- "group" says which group to extract into tokens.
group=-1 (the default) is equivalent to "split". In this case, the tokens will
be equivalent to the output from (without empty tokens):
Using group >= 0 selects the matching group as the token. For example, if you have:
pattern = \'([^\']+)\'
group = 0
input = aaa 'bbb' 'ccc'
the output will be two tokens: 'bbb' and 'ccc' (including the ' marks). With the same input
but using group=1, the output would be: bbb and ccc (no ' marks)
NOTE: This does not output tokens that are of zero length.
creates a new returning tokens from group (-1 for split functionality)
creates a new returning tokens from group (-1 for split functionality)
Factory for .
This tokenizer uses regex pattern matching to construct distinct tokens
for the input stream. It takes two arguments: "pattern" and "group".
- "pattern" is the regular expression.
- "group" says which group to extract into tokens.
group=-1 (the default) is equivalent to "split". In this case, the tokens will
be equivalent to the output from (without empty tokens):
Using group >= 0 selects the matching group as the token. For example, if you have:
pattern = \'([^\']+)\'
group = 0
input = aaa 'bbb' 'ccc'
the output will be two tokens: 'bbb' and 'ccc' (including the ' marks). With the same input
but using group=1, the output would be: bbb and ccc (no ' marks)
NOTE: This Tokenizer does not output tokens that are of zero length.
<fieldType name="text_ptn" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="\'([^\']+)\'" group="1"/>
</analyzer>
</fieldType>
@since solr1.2
Creates a new
Split the input using configured pattern
Base class for payload encoders.
Characters before the delimiter are the "token", those after are the payload.
For example, if the delimiter is '|', then for the string "foo|bar", foo is the token
and "bar" is a payload.
Note, you can also include a to convert the payload in an appropriate way (from characters to bytes).
Note make sure your doesn't split on the delimiter, or this won't work
Factory for .
<fieldType name="text_dlmtd" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.DelimitedPayloadTokenFilterFactory" encoder="float" delimiter="|"/>
</analyzer>
</fieldType>
Creates a new
Encode a character array as a .
NOTE: This was FloatEncoder in Lucene
Does nothing other than convert the char array to a byte array using the specified encoding.
Encode a character array as a .
See .
Assigns a payload to a token based on the
Factory for .
<fieldType name="text_numpayload" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.NumericPayloadTokenFilterFactory" payload="24" typeMatch="word"/>
</analyzer>
</fieldType>
Creates a new
Mainly for use with the , converts char buffers to
.
NOTE: This interface is subject to change
Convert a char array to a
encoded
Utility methods for encoding payloads.
NOTE: This was encodeFloat() in Lucene
NOTE: This was encodeFloat() in Lucene
NOTE: This was encodeInt() in Lucene
NOTE: This was encodeInt() in Lucene
NOTE: This was decodeFloat() in Lucene
the decoded float
Decode the payload that was encoded using .
NOTE: the length of the array must be at least offset + 4 long.
NOTE: This was decodeFloat() in Lucene
The bytes to decode
The offset into the array.
The float that was encoded
NOTE: This was decodeInt() in Lucene
Adds the
and
First 4 bytes are the start
Factory for .
<fieldType name="text_tokenoffset" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.TokenOffsetPayloadTokenFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Makes the a payload.
Encodes the type using System.Text.Encoding.UTF8.GetBytes(string)
Factory for .
<fieldType name="text_typeaspayload" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.TypeAsPayloadTokenFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Set the positionIncrement of all tokens to the "positionIncrement",
except the first return token which retains its original positionIncrement value.
The default positionIncrement value is zero.
@deprecated (4.4) makes graphs inconsistent
which can cause highlighting bugs. Its main use-case being to make
QueryParser
generate boolean queries instead of phrase queries, it is now advised to use
QueryParser.AutoGeneratePhraseQueries = true
(for simple cases) or to override QueryParser.NewFieldQuery.
Position increment to assign to all but the first token - default = 0
The first token must have non-zero positionIncrement *
Constructs a that assigns a position increment of zero to
all but the first token from the given input stream.
the input stream
Constructs a that assigns the given position increment to
all but the first token from the given input stream.
the input stream
position increment to assign to all but the first
token from the input stream
Factory for .
Set the positionIncrement of all tokens to the "positionIncrement", except the first return token which retains its
original positionIncrement value. The default positionIncrement value is zero.
<fieldType name="text_position" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.PositionFilterFactory" positionIncrement="0"/>
</analyzer>
</fieldType>
Creates a new
for Portuguese.
You must specify the required
compatibility when creating :
- As of 3.6, PortugueseLightStemFilter is used for less aggressive stemming.
File containing default Portuguese stopwords.
Returns an unmodifiable instance of the default stop words set.
default stop words set.
Atomically loads the in a lazy fashion once the outer class
accesses the static final set the first time.;
Builds an analyzer with the default stop words: .
Builds an analyzer with the given stop words.
lucene compatibility version
a stopword set
Builds an analyzer with the given stop words. If a non-empty stem exclusion set is
provided this analyzer will add a before
stemming.
lucene compatibility version
a stopword set
a set of terms not to be stemmed
Creates a
which tokenizes all the text in the provided .
A
built from an filtered with
, ,
, if a stem exclusion set is
provided and .
A that applies to stem
Portuguese words.
To prevent terms from being stemmed use an instance of
or a custom that sets
the before this .
Factory for .
<fieldType name="text_ptlgtstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PortugueseLightStemFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Light Stemmer for Portuguese
This stemmer implements the "UniNE" algorithm in:
Light Stemming Approaches for the French, Portuguese, German and Hungarian Languages
Jacques Savoy
A that applies to stem
Portuguese words.
To prevent terms from being stemmed use an instance of
or a custom that sets
the before this .
Factory for .
<fieldType name="text_ptminstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PortugueseMinimalStemFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Minimal Stemmer for Portuguese
This follows the "RSLP-S" algorithm presented in:
A study on the Use of Stemming for Monolingual Ad-Hoc Portuguese
Information Retrieval (Orengo, et al)
which is just the plural reduction step of the RSLP
algorithm from A Stemming Algorithm for the Portuguese Language,
Orengo et al.
A that applies to stem
Portuguese words.
To prevent terms from being stemmed use an instance of
or a custom that sets
the before this .
Factory for .
<fieldType name="text_ptstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PortugueseStemFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Portuguese stemmer implementing the RSLP (Removedor de Sufixos da Lingua Portuguesa)
algorithm. This is sometimes also referred to as the Orengo stemmer.
buffer, oversized to at least len+1
initial valid length of buffer
new valid length, stemmed
Base class for stemmers that use a set of RSLP-like stemming steps.
RSLP (Removedor de Sufixos da Lingua Portuguesa) is an algorithm designed
originally for stemming the Portuguese language, described in the paper
A Stemming Algorithm for the Portuguese Language, Orengo et. al.
Since this time a plural-only modification (RSLP-S) as well as a modification
for the Galician language have been implemented. This class parses a configuration
file that describes s, where each contains a set of s.
The general rule format is:
{ "suffix", N, "replacement", { "exception1", "exception2", ...}}
where:
- suffix is the suffix to be removed (such as "inho").
- N is the min stem size, where stem is defined as the candidate stem
after removing the suffix (but before appending the replacement!)
- replacement is an optimal string to append after removing the suffix.
This can be the empty string.
- exceptions is an optional list of exceptions, patterns that should
not be stemmed. These patterns can be specified as whole word or suffix (ends-with)
patterns, depending upon the exceptions format flag in the step header.
A step is an ordered list of rules, with a structure in this format:
{ "name", N, B, { "cond1", "cond2", ... }
... rules ... };
where:
- name is a name for the step (such as "Plural").
- N is the min word size. Words that are less than this length bypass
the step completely, as an optimization. Note: N can be zero, in this case this
implementation will automatically calculate the appropriate value from the underlying
rules.
- B is a "boolean" flag specifying how exceptions in the rules are matched.
A value of 1 indicates whole-word pattern matching, a value of 0 indicates that
exceptions are actually suffixes and should be matched with ends-with.
- conds are an optional list of conditions to enter the step at all. If
the list is non-empty, then a word must end with one of these conditions or it will
bypass the step completely as an optimization.
RSLP description
@lucene.internal
A basic rule, with no exceptions.
Create a rule.
suffix to remove
minimum stem length
replacement string
true if the word matches this rule.
new valid length of the string after firing this rule.
A rule with a set of whole-word exceptions.
A rule with a set of exceptional suffixes.
A step containing a list of rules.
Create a new step
Step's name.
an ordered list of rules.
minimum word size. if this is 0 it is automatically calculated.
optional list of conditional suffixes. may be null.
new valid length of the string after applying the entire step.
Parse a resource file into an RSLP stemmer description.
a Map containing the named s in this description.
An used primarily at query time to wrap another analyzer and provide a layer of protection
which prevents very common words from being passed into queries.
For very large indexes the cost
of reading TermDocs for a very common word can be high. This analyzer was created after experience with
a 38 million doc index which had a term in around 50% of docs and was causing TermQueries for
this term to take 2 seconds.
Creates a new with stopwords calculated for all
indexed fields from terms with a document frequency percentage greater than
Version to be used in
whose will be filtered
to identify the stopwords from
Can be thrown while reading from the
Creates a new with stopwords calculated for all
indexed fields from terms with a document frequency greater than the given
Version to be used in
whose will be filtered
to identify the stopwords from
Document frequency terms should be above in order to be stopwords
Can be thrown while reading from the
Creates a new with stopwords calculated for all
indexed fields from terms with a document frequency percentage greater than
the given
Version to be used in
whose will be filtered
to identify the stopwords from
The maximum percentage (between 0.0 and 1.0) of index documents which
contain a term, after which the word is considered to be a stop word
Can be thrown while reading from the
Creates a new with stopwords calculated for the
given selection of fields from terms with a document frequency percentage
greater than the given
Version to be used in
whose will be filtered
to identify the stopwords from
Selection of fields to calculate stopwords for
The maximum percentage (between 0.0 and 1.0) of index documents which
contain a term, after which the word is considered to be a stop word
Can be thrown while reading from the
Creates a new with stopwords calculated for the
given selection of fields from terms with a document frequency greater than
the given
Version to be used in
Analyzer whose TokenStream will be filtered
to identify the stopwords from
Selection of fields to calculate stopwords for
Document frequency terms should be above in order to be stopwords
Can be thrown while reading from the
Provides information on which stop words have been identified for a field
The field for which stop words identified in "addStopWords"
method calls will be returned
the stop words identified for a field
Provides information on which stop words have been identified for all fields
the stop words (as terms)
Reverse token string, for example "country" => "yrtnuoc".
If is supplied, then tokens will be also prepended by
that character. For example, with a marker of \u0001, "country" =>
"\u0001yrtnuoc". This is useful when implementing efficient leading
wildcards search.
You must specify the required
compatibility when creating , or when using any of
its static methods:
- As of 3.1, supplementary characters are handled correctly
Example marker character: U+0001 (START OF HEADING)
Example marker character: U+001F (INFORMATION SEPARATOR ONE)
Example marker character: U+EC00 (PRIVATE USE AREA: EC00)
Example marker character: U+200F (RIGHT-TO-LEFT MARK)
Create a new that reverses all tokens in the
supplied .
The reversed tokens will not be marked.
lucene compatibility version
to filter
Create a new that reverses and marks all tokens in the
supplied .
The reversed tokens will be prepended (marked) by the
character.
lucene compatibility version
to filter
A character used to mark reversed tokens
Reverses the given input string
lucene compatibility version
the string to reverse
the given input string in reversed order
Reverses the given input buffer in-place
lucene compatibility version
the input char array to reverse
Partially reverses the given input buffer in-place from offset 0
up to the given length.
lucene compatibility version
the input char array to reverse
the length in the buffer up to where the
buffer should be reversed
@deprecated (3.1) Remove this when support for 3.0 indexes is no longer needed.
Partially reverses the given input buffer in-place from the given offset
up to the given length.
lucene compatibility version
the input char array to reverse
the offset from where to reverse the buffer
the length in the buffer up to where the
buffer should be reversed
Factory for .
<fieldType name="text_rvsstr" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ReverseStringFilterFactory"/>
</analyzer>
</fieldType>
@since solr 1.4
Creates a new
for Romanian.
File containing default Romanian stopwords.
The comment character in the stopwords file.
All lines prefixed with this will be ignored.
Returns an unmodifiable instance of the default stop words set.
default stop words set.
Atomically loads the in a lazy fashion once the outer class
accesses the static final set the first time.;
Builds an analyzer with the default stop words: .
lucene compatibility version
Builds an analyzer with the given stop words.
lucene compatibility version
a stopword set
Builds an analyzer with the given stop words. If a non-empty stem exclusion set is
provided this analyzer will add a before
stemming.
lucene compatibility version
a stopword set
a set of terms not to be stemmed
Creates a
which tokenizes all the text in the provided .
A
built from an filtered with
, , ,
if a stem exclusion set is
provided and .
for Russian language.
Supports an external list of stopwords (words that
will not be indexed at all).
A default set of stopwords is used unless an alternative list is specified.
You must specify the required
compatibility when creating :
- As of 3.1, is used, Snowball stemming is done with
, and Snowball stopwords are used by default.
List of typical Russian stopwords. (for backwards compatibility)
@deprecated (3.1) Remove this for LUCENE 5.0
File containing default Russian stopwords.
@deprecated (3.1) remove this for Lucene 5.0
Returns an unmodifiable instance of the default stop-words set.
an unmodifiable instance of the default stop-words set.
Builds an analyzer with the given stop words
lucene compatibility version
a stopword set
Builds an analyzer with the given stop words
lucene compatibility version
a stopword set
a set of words not to be stemmed
Creates
used to tokenize all the text in the provided .
built from a filtered with
, ,
, if a stem exclusion set is
provided, and
A is a that extends
by also allowing the basic Latin digits 0-9.
You must specify the required compatibility when creating
:
- As of 3.1, uses an int based API to normalize and
detect token characters. See and
for details.
@deprecated (3.1) Use instead, which has the same functionality.
This filter will be removed in Lucene 5.0
Construct a new .
lucene compatibility version
the input to split up into tokens
Construct a new RussianLetterTokenizer using a given
.
lucene compatibility version
the attribute factory to use for this
the input to split up into tokens
Collects only characters which satisfy
.
@deprecated Use instead.
This tokenizer has no Russian-specific functionality.
Creates a new
A that applies to stem Russian
words.
To prevent terms from being stemmed use an instance of
or a custom that sets
the before this .
Factory for .
<fieldType name="text_rulgtstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RussianLightStemFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Light Stemmer for Russian.
This stemmer implements the following algorithm:
Indexing and Searching Strategies for the Russian Language.
Ljiljana Dolamic and Jacques Savoy.
A ShingleAnalyzerWrapper wraps a around another .
A shingle is another name for a token based n-gram.
Creates a new
whose is to be filtered
Min shingle (token ngram) size
Max shingle size
Used to separate input stream tokens in output shingles
Whether or not the filter shall pass the original
tokens to the output stream
Overrides the behavior of outputUnigrams==false for those
times when no shingles are available (because there are fewer than
minShingleSize tokens in the input stream)?
Note that if outputUnigrams==true, then unigrams are always output,
regardless of whether any shingles are available.
filler token to use when positionIncrement is more than 1
Wraps .
Wraps .
The max shingle (token ngram) size
The max shingle (token ngram) size
The min shingle (token ngram) size
The min shingle (token ngram) size
A constructs shingles (token n-grams) from a token stream.
In other words, it creates combinations of tokens as a single token.
For example, the sentence "please divide this sentence into shingles"
might be tokenized into shingles "please divide", "divide this",
"this sentence", "sentence into", and "into shingles".
This filter handles position increments > 1 by inserting filler tokens
(tokens with termtext "_"). It does not handle a position increment of 0.
filler token for when positionIncrement is more than 1
default maximum shingle size is 2.
default minimum shingle size is 2.
default token type attribute value is "shingle"
The default string to use when joining adjacent tokens to form a shingle
The sequence of input stream tokens (or filler tokens, if necessary)
that will be composed to form output shingles.
The number of input tokens in the next output token. This is the "n" in
"token n-grams".
Shingle and unigram text is composed here.
The token type attribute value to use - default is "shingle"
The string to use when joining adjacent tokens to form a shingle
The string to insert for each position at which there is no token
(i.e., when position increment is greater than one).
By default, we output unigrams (individual tokens) as well as shingles
(token n-grams).
By default, we don't override behavior of outputUnigrams.
maximum shingle size (number of tokens)
minimum shingle size (number of tokens)
The remaining number of filler tokens to be inserted into the input stream
from which shingles are composed, to handle position increments greater
than one.
When the next input stream token has a position increment greater than
one, it is stored in this field until sufficient filler tokens have been
inserted to account for the position increment.
Whether or not there is a next input stream token.
Whether at least one unigram or shingle has been output at the current
position.
true if no shingles have been output yet (for outputUnigramsIfNoShingles).
Holds the State after input.end() was called, so we can
restore it in our end() impl.
Constructs a with the specified shingle size from the
input stream
minimum shingle size produced by the filter.
maximum shingle size produced by the filter.
Constructs a with the specified shingle size from the
input stream
maximum shingle size produced by the filter.
Construct a with default shingle size: 2.
input stream
Construct a with the specified token type for shingle tokens
and the default shingle size: 2
input stream
token type for shingle tokens
Set the type of the shingle tokens produced by this filter.
(default: "shingle")
token tokenType
Shall the output stream contain the input tokens (unigrams) as well as
shingles? (default: true.)
Whether or not the output stream shall contain
the input tokens (unigrams)
Shall we override the behavior of outputUnigrams==false for those
times when no shingles are available (because there are fewer than
minShingleSize tokens in the input stream)? (default: false.)
Note that if outputUnigrams==true, then unigrams are always output,
regardless of whether any shingles are available.
Whether or not to output a single
unigram when no shingles are available.
Set the max shingle size (default: 2)
max size of output shingles
Set the min shingle size (default: 2).
This method requires that the passed in minShingleSize is not greater
than maxShingleSize, so make sure that maxShingleSize is set before
calling this method.
The unigram output option is independent of the min shingle size.
min size of output shingles
Sets the string to use when joining adjacent tokens to form a shingle
used to separate input stream tokens in output shingles
Sets the string to insert for each position at which there is no token
(i.e., when position increment is greater than one).
string to insert at each position where there is no token
Get the next token from the input stream.
If the next token has positionIncrement > 1,
positionIncrement - 1 s are
inserted first.
Where to put the new token; if null, a new instance is created.
On success, the populated token; null otherwise
if the input stream has a problem
Fills with input stream tokens, if available,
shifting to the right if the window was previously full.
Resets to its minimum value.
if there's a problem getting the next token
An instance of this class is used to maintain the number of input
stream tokens that will be used to compose the next unigram or shingle:
.
gramSize
will take on values from the circular sequence
{ [ 1, ] [ , ... , ] }.
1 is included in the circular sequence only if
= true.
the current value.
Increments this circular number's value to the next member in the
circular sequence
gramSize
will take on values from the circular sequence
{ [ 1, ] [ , ... , ] }.
1 is included in the circular sequence only if
= true.
Sets this circular number's value to the first member of the
circular sequence
gramSize
will take on values from the circular sequence
{ [ 1, ] [ , ... , ] }.
1 is included in the circular sequence only if
= true.
Returns true if the current value is the first member of the circular
sequence.
If = true, the first member of the circular
sequence will be 1; otherwise, it will be .
true if the current value is the first member of the circular
sequence; false otherwise
the value this instance had before the last call
Factory for .
<fieldType name="text_shingle" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize="2"
outputUnigrams="true" outputUnigramsIfNoShingles="false" tokenSeparator=" " fillerToken="_"/>
</analyzer>
</fieldType>
Creates a new
Attempts to parse the as a Date using either the
or
methods.
If a format is passed,
will be used, and the format must strictly match one of the specified formats as specified in the MSDN documentation.
If the value is a Date, it will add it to the sink.
Creates a new instance of using the current culture and .
Loosely matches standard DateTime formats using .
Creates a new instance of using the supplied culture and .
Loosely matches standard DateTime formats using .
An object that supplies culture-specific format information
Creates a new instance of using the current culture and .
Strictly matches the supplied DateTime formats using .
The allowable format of the .
If supplied, it must match the format of the date exactly to get a match.
Creates a new instance of using the current culture and .
Strictly matches the supplied DateTime formats using .
An array of allowable formats of the .
If supplied, one of them must match the format of the date exactly to get a match.
Creates a new instance of using the supplied culture and .
Loosely matches standard DateTime formats using .
An object that supplies culture-specific format information
A bitwise combination of enumeration values that indicates the permitted format of s.
A typical value to specify is
Creates a new instance of using the supplied format, culture and .
Strictly matches the supplied DateTime formats using .
The allowable format of the .
If supplied, it must match the format of the date exactly to get a match.
An object that supplies culture-specific format information
Creates a new instance of using the supplied formats, culture and .
Strictly matches the supplied DateTime formats using .
An array of allowable formats of the .
If supplied, one of them must match the format of the date exactly to get a match.
An object that supplies culture-specific format information
Creates a new instance of using the supplied format, culture and .
Strictly matches the supplied DateTime formats using .
The allowable format of the .
If supplied, it must match the format of the date exactly to get a match.
An object that supplies culture-specific format information
A bitwise combination of enumeration values that indicates the permitted format of s.
A typical value to specify is
Creates a new instance of using the supplied formats, culture and .
Strictly matches the supplied DateTime formats using .
An array of allowable formats of the .
If supplied, one of them must match the format of the date exactly to get a match.
An object that supplies culture-specific format information
A bitwise combination of enumeration values that indicates the permitted format of s.
A typical value to specify is
This TokenFilter provides the ability to set aside attribute states
that have already been analyzed. This is useful in situations where multiple fields share
many common analysis steps and then go their separate ways.
It is also useful for doing things like entity extraction or proper noun analysis as
part of the analysis workflow and saving off those tokens for use in another field.
TeeSinkTokenFilter source1 = new TeeSinkTokenFilter(new WhitespaceTokenizer(version, reader1));
TeeSinkTokenFilter.SinkTokenStream sink1 = source1.NewSinkTokenStream();
TeeSinkTokenFilter.SinkTokenStream sink2 = source1.NewSinkTokenStream();
TeeSinkTokenFilter source2 = new TeeSinkTokenFilter(new WhitespaceTokenizer(version, reader2));
source2.AddSinkTokenStream(sink1);
source2.AddSinkTokenStream(sink2);
TokenStream final1 = new LowerCaseFilter(version, source1);
TokenStream final2 = source2;
TokenStream final3 = new EntityDetect(sink1);
TokenStream final4 = new URLDetect(sink2);
d.Add(new TextField("f1", final1, Field.Store.NO));
d.Add(new TextField("f2", final2, Field.Store.NO));
d.Add(new TextField("f3", final3, Field.Store.NO));
d.Add(new TextField("f4", final4, Field.Store.NO));
In this example, sink1 and sink2 will both get tokens from both
reader1 and reader2 after whitespace tokenizer
and now we can further wrap any of these in extra analysis, and more "sources" can be inserted if desired.
It is important, that tees are consumed before sinks (in the above example, the field names must be
less the sink's field names). If you are not sure, which stream is consumed first, you can simply
add another sink and then pass all tokens to the sinks at once using .
This is exhausted after this. In the above example, change
the example above to:
...
TokenStream final1 = new LowerCaseFilter(version, source1.NewSinkTokenStream());
TokenStream final2 = source2.NewSinkTokenStream();
sink1.ConsumeAllTokens();
sink2.ConsumeAllTokens();
...
In this case, the fields can be added in any order, because the sources are not used anymore and all sinks are ready.
Note, the EntityDetect and URLDetect TokenStreams are for the example and do not currently exist in Lucene.
Instantiates a new .
Returns a new that receives all tokens consumed by this stream.
Returns a new that receives all tokens consumed by this stream
that pass the supplied filter.
Adds a created by another
to this one. The supplied stream will also receive all consumed tokens.
This method can be used to pass tokens from two different tees to one sink.
passes all tokens to the added sinks
when itself is consumed. To be sure, that all tokens from the input
stream are passed to the sinks, you can call this methods.
This instance is exhausted after this, but all sinks are instant available.
A filter that decides which states to store in the sink.
Returns true, iff the current state of the passed-in shall be stored
in the sink.
Called by . This method does nothing by default
and can optionally be overridden.
output from a tee with optional filtering.
Counts the tokens as they go by and saves to the internal list those between the range of lower and upper, exclusive of upper
Adds a token to the sink if it has a specific type.
Filters with ,
, and .
Available stemmers are listed in org.tartarus.snowball.ext. The name of a
stemmer is the part of the class name before "Stemmer", e.g., the stemmer in
is named "English".
NOTE: This class uses the same
dependent settings as , with the following addition:
- As of 3.1, uses for Turkish language.
@deprecated (3.1) Use the language-specific analyzer in modules/analysis instead.
This analyzer will be removed in Lucene 5.0
Builds the named analyzer with no stop words.
Builds the named analyzer with the given stop words.
Constructs a filtered by a
, a , a ,
and a
A filter that stems words using a Snowball-generated stemmer.
Available stemmers are listed in Lucene.Net.Tartarus.Snowball.Ext.
NOTE: expects lowercased text.
- For the Turkish language, see .
- For other languages, see .
Note: This filter is aware of the . To prevent
certain terms from being passed to the stemmer
should be set to true
in a previous .
Note: For including the original term as well as the stemmed version, see
Construct the named stemming filter.
Available stemmers are listed in Lucene.Net.Tartarus.Snowball.Ext.
The name of a stemmer is the part of the class name before "Stemmer",
e.g., the stemmer in is named "English".
the input tokens to stem
the name of a stemmer
Returns the next input , after being stemmed
Factory for , with configurable language
Note: Use of the "Lovins" stemmer is not recommended, as it is implemented with reflection.
<fieldType name="text_snowballstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" protected="protectedkeyword.txt" language="English"/>
</analyzer>
</fieldType>
Creates a new
Filters with ,
and , using a list of
English stop words.
You must specify the required
compatibility when creating :
- As of 3.1, correctly handles Unicode 4.0
supplementary characters in stopwords
- As of 2.9, preserves position
increments
- As of 2.4, s incorrectly identified as acronyms
are corrected (see LUCENE-1068)
was named in Lucene versions prior to 3.1.
As of 3.1, implements Unicode text segmentation,
as specified by UAX#29.
Default maximum allowed token length
An unmodifiable set containing some common English words that are usually not
useful for searching.
Builds an analyzer with the given stop words.
Lucene compatibility version - See
stop words
Builds an analyzer with the default stop words ().
Lucene compatibility version - See
Builds an analyzer with the stop words from the given reader.
Lucene compatibility version - See
to read stop words from
Gets or sets maximum allowed token length. If a token is seen
that exceeds this length then it is discarded. This
setting only takes effect the next time tokenStream or
tokenStream is called.
Normalizes tokens extracted with .
Construct filtering .
Returns the next token in the stream, or null at EOS.
Removes 's from the end of words.
Removes dots from acronyms.
Factory for .
<fieldType name="text_clssc" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.ClassicFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
A grammar-based tokenizer constructed with JFlex (and then ported to .NET)
This should be a good tokenizer for most European-language documents:
- Splits words at punctuation characters, removing punctuation. However, a
dot that's not followed by whitespace is considered part of a token.
- Splits words at hyphens, unless there's a number in the token, in which case
the whole token is interpreted as a product number and is not split.
- Recognizes email addresses and internet hostnames as one token.
Many applications have specific tokenizer needs. If this tokenizer does
not suit your application, please consider copying this source code
directory to your project and maintaining your own grammar-based tokenizer.
was named in Lucene versions prior to 3.1.
As of 3.1, implements Unicode text segmentation,
as specified by UAX#29.
A private instance of the JFlex-constructed scanner
String token types that correspond to token type int constants
Set the max allowed token length. Any token longer
than this is skipped.
Creates a new instance of the . Attaches
the to the newly created JFlex scanner.
lucene compatibility version
The input reader
See http://issues.apache.org/jira/browse/LUCENE-1068
Creates a new with a given
Factory for .
<fieldType name="text_clssc" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ClassicTokenizerFactory" maxTokenLength="120"/>
</analyzer>
</fieldType>
Creates a new
This class implements the classic lucene up until 3.0
This character denotes the end of file
initial size of the lookahead buffer
lexical states
ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l
ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l
at the beginning of a line
l is of the form l = 2*k, k a non negative integer
Translates characters to character classes
Translates characters to character classes
Translates DFA states to action switch labels.
Translates a state to a row index in the transition table
The transition table of the DFA
ZZ_ATTRIBUTE[aState] contains the attributes of state aState
the input device
the current state of the DFA
the current lexical state
this buffer contains the current text to be matched and is
the source of the YyText string
the textposition at the last accepting state
the current text position in the buffer
startRead marks the beginning of the YyText string in the buffer
endRead marks the last character in the buffer, that has been read
from input
the number of characters up to the start of the matched text
zzAtEOF == true <=> the scanner is at the EOF
Fills ICharTermAttribute with the current token text.
Creates a new scanner
the TextReader to read input from.
Unpacks the compressed character translation table.
the packed character translation table
the unpacked character translation table
Refills the input buffer.
false, iff there was new input.
if any I/O-Error occurs
Disposes the input stream.
Resets the scanner to read from a new input stream.
Does not close the old reader.
All internal variables are reset, the old input stream
cannot be reused (internal buffer is discarded and lost).
Lexical state is set to .
Internal scan buffer is resized down to its initial length, if it has grown.
the new input stream
Returns the current lexical state.
Enters a new lexical state
the new lexical state
Returns the text matched by the current regular expression.
Returns the character at position pos from the
matched text.
It is equivalent to YyText[pos], but faster
the position of the character to fetch.
A value from 0 to YyLength-1.
the character at position pos
Returns the length of the matched text region.
Reports an error that occured while scanning.
In a wellformed scanner (no or only correct usage of
YyPushBack(int) and a match-all fallback rule) this method
will only be called with things that "Can't Possibly Happen".
If this method is called, something is seriously wrong
(e.g. a JFlex bug producing a faulty scanner etc.).
Usual syntax/scanner level error handling should be done
in error fallback rules.
the code of the errormessage to display
Pushes the specified amount of characters back into the input stream.
They will be read again by then next call of the scanning method
the number of characters to be read again.
This number must not be greater than YyLength!
Resumes scanning until the next regular expression is matched,
the end of input is encountered or an I/O-Error occurs.
the next token
if any I/O-Error occurs
Filters with ,
and , using a list of
English stop words.
You must specify the required
compatibility when creating :
- As of 3.4, Hiragana and Han characters are no longer wrongly split
from their combining characters. If you use a previous version number,
you get the exact broken behavior for backwards compatibility.
- As of 3.1, implements Unicode text segmentation,
and correctly handles Unicode 4.0 supplementary characters
in stopwords. and
are the pre-3.1 implementations of and
.
- As of 2.9, preserves position increments
- As of 2.4, s incorrectly identified as acronyms
are corrected (see LUCENE-1068)
Default maximum allowed token length
An unmodifiable set containing some common English words that are usually not
useful for searching.
Builds an analyzer with the given stop words.
Lucene compatibility version - See
stop words
Builds an analyzer with the default stop words ().
Lucene compatibility version - See
Builds an analyzer with the stop words from the given reader.
Lucene compatibility version - See
to read stop words from
Set maximum allowed token length. If a token is seen
that exceeds this length then it is discarded. This
setting only takes effect the next time tokenStream or
tokenStream is called.
Normalizes tokens extracted with .
Factory for .
<fieldType name="text_stndrd" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
A grammar-based tokenizer constructed with JFlex.
As of Lucene version 3.1, this class implements the Word Break rules from the
Unicode Text Segmentation algorithm, as specified in
Unicode Standard Annex #29.
Many applications have specific tokenizer needs. If this tokenizer does
not suit your application, please consider copying this source code
directory to your project and maintaining your own grammar-based tokenizer.
You must specify the required
compatibility when creating :
- As of 3.4, Hiragana and Han characters are no longer wrongly split
from their combining characters. If you use a previous version number,
you get the exact broken behavior for backwards compatibility.
- As of 3.1, StandardTokenizer implements Unicode text segmentation.
If you use a previous version number, you get the exact behavior of
for backwards compatibility.
A private instance of the JFlex-constructed scanner
@deprecated (3.1)
@deprecated (3.1)
@deprecated (3.1)
@deprecated (3.1)
@deprecated (3.1)
@deprecated (3.1)
String token types that correspond to token type int constants
Set the max allowed token length. Any token longer
than this is skipped.
Creates a new instance of the . Attaches
the to the newly created JFlex-generated (then ported to .NET) scanner.
Lucene compatibility version - See
The input reader
See http://issues.apache.org/jira/browse/LUCENE-1068
Creates a new with a given
Factory for .
<fieldType name="text_stndrd" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory" maxTokenLength="255"/>
</analyzer>
</fieldType>
Creates a new
This class implements Word Break rules from the Unicode Text Segmentation
algorithm, as specified in
Unicode Standard Annex #29.
Tokens produced are of the following types:
- <ALPHANUM>: A sequence of alphabetic and numeric characters
- <NUM>: A number
- <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast
Asian languages, including Thai, Lao, Myanmar, and Khmer
- <IDEOGRAPHIC>: A single CJKV ideographic character
- <HIRAGANA>: A single hiragana character
- <KATAKANA>: A sequence of katakana characters
- <HANGUL>: A sequence of Hangul characters
This character denotes the end of file
initial size of the lookahead buffer
lexical states
ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l
ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l
at the beginning of a line
l is of the form l = 2*k, k a non negative integer
Translates characters to character classes
Translates characters to character classes
Translates DFA states to action switch labels.
Translates a state to a row index in the transition table
The transition table of the DFA
ZZ_ATTRIBUTE[aState] contains the attributes of state aState
the input device
the current state of the DFA
the current lexical state
this buffer contains the current text to be matched and is
the source of the YyText() string
the textposition at the last accepting state
the current text position in the buffer
startRead marks the beginning of the YyText() string in the buffer
endRead marks the last character in the buffer, that has been read
from input
the number of characters up to the start of the matched text
zzAtEOF == true <=> the scanner is at the EOF
Alphanumeric sequences
Numbers
Chars in class \p{Line_Break = Complex_Context} are from South East Asian
scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept
together as as a single token rather than broken up, because the logic
required to break them at word boundaries is too complex for UAX#29.
See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA
Fills with the current token text.
Creates a new scanner
the to read input from.
Unpacks the compressed character translation table.
the packed character translation table
the unpacked character translation table
Refills the input buffer.
false
, iff there was new input.
if any I/O-Error occurs
Disposes the input stream.
Resets the scanner to read from a new input stream.
Does not close the old reader.
All internal variables are reset, the old input stream
cannot be reused (internal buffer is discarded and lost).
Lexical state is set to .
Internal scan buffer is resized down to its initial length, if it has grown.
the new input stream
Returns the current lexical state.
Enters a new lexical state
the new lexical state
Returns the text matched by the current regular expression.
Returns the character at position from the
matched text.
It is equivalent to YyText[pos], but faster
the position of the character to fetch.
A value from 0 to YyLength-1.
the character at position pos
Returns the length of the matched text region.
Reports an error that occured while scanning.
In a wellformed scanner (no or only correct usage of
YyPushBack(int) and a match-all fallback rule) this method
will only be called with things that "Can't Possibly Happen".
If this method is called, something is seriously wrong
(e.g. a JFlex bug producing a faulty scanner etc.).
Usual syntax/scanner level error handling should be done
in error fallback rules.
the code of the errormessage to display
Pushes the specified amount of characters back into the input stream.
They will be read again by then next call of the scanning method
the number of characters to be read again.
This number must not be greater than YyLength!
Resumes scanning until the next regular expression is matched,
the end of input is encountered or an I/O-Error occurs.
the next token
if any I/O-Error occurs
Internal interface for supporting versioned grammars.
@lucene.internal
Copies the matched text into the
Returns the current position.
Resets the scanner to read from a new input stream.
Does not close the old reader.
All internal variables are reset, the old input stream
cannot be reused (internal buffer is discarded and lost).
Lexical state is set to YYINITIAL.
the new input stream
Returns the length of the matched text region.
Resumes scanning until the next regular expression is matched,
the end of input is encountered or an I/O-Error occurs.
the next token, on end of stream
if any I/O-Error occurs
This character denotes the end of file
This class implements StandardTokenizer, except with a bug
(https://issues.apache.org/jira/browse/LUCENE-3358) where Han and Hiragana
characters would be split from combining characters:
@deprecated This class is only for exact backwards compatibility
This character denotes the end of file
initial size of the lookahead buffer
lexical states
ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l
ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l
at the beginning of a line
l is of the form l = 2*k, k a non negative integer
Translates characters to character classes
Translates characters to character classes
Translates DFA states to action switch labels.
Translates a state to a row index in the transition table
The transition table of the DFA
ZZ_ATTRIBUTE[aState] contains the attributes of state aState
the input device
the current state of the DFA
the current lexical state
this buffer contains the current text to be matched and is
the source of the YyText() string
the textposition at the last accepting state
the current text position in the buffer
startRead marks the beginning of the YyText string in the buffer
endRead marks the last character in the buffer, that has been read
from input
the number of characters up to the start of the matched text
zzAtEOF == true <=> the scanner is at the EOF
Alphanumeric sequences
Numbers
Chars in class \p{Line_Break = Complex_Context} are from South East Asian
scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept
together as as a single token rather than broken up, because the logic
required to break them at word boundaries is too complex for UAX#29.
See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA
Fills ICharTermAttribute with the current token text.
Creates a new scanner
the TextReader to read input from.
Unpacks the compressed character translation table.
the packed character translation table
the unpacked character translation table
Refills the input buffer.
false, iff there was new input.
if any I/O-Error occurs
Disposes the input stream.
Resets the scanner to read from a new input stream.
Does not close the old reader.
All internal variables are reset, the old input stream
cannot be reused (internal buffer is discarded and lost).
Lexical state is set to .
Internal scan buffer is resized down to its initial length, if it has grown.
the new input stream
Returns the current lexical state.
Enters a new lexical state
the new lexical state
Returns the text matched by the current regular expression.
Returns the character at position from the
matched text.
It is equivalent to YyText[pos], but faster
the position of the character to fetch.
A value from 0 to YyLength-1.
the character at position pos
Returns the length of the matched text region.
Reports an error that occured while scanning.
In a wellformed scanner (no or only correct usage of
YyPushBack(int) and a match-all fallback rule) this method
will only be called with things that "Can't Possibly Happen".
If this method is called, something is seriously wrong
(e.g. a JFlex bug producing a faulty scanner etc.).
Usual syntax/scanner level error handling should be done
in error fallback rules.
the code of the errormessage to display
Pushes the specified amount of characters back into the input stream.
They will be read again by then next call of the scanning method
the number of characters to be read again.
This number must not be greater than YyLength!
Resumes scanning until the next regular expression is matched,
the end of input is encountered or an I/O-Error occurs.
the next token
if any I/O-Error occurs
This class implements UAX29URLEmailTokenizer, except with a bug
(https://issues.apache.org/jira/browse/LUCENE-3358) where Han and Hiragana
characters would be split from combining characters:
@deprecated This class is only for exact backwards compatibility
This character denotes the end of file
initial size of the lookahead buffer
lexical states
ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l
ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l
at the beginning of a line
l is of the form l = 2*k, k a non negative integer
Translates characters to character classes
Translates characters to character classes
Translates DFA states to action switch labels.
The transition table of the DFA
error codes
error messages for the codes above
ZZ_ATTRIBUTE[aState] contains the attributes of state aState
the input device
the current state of the DFA
the current lexical state
this buffer contains the current text to be matched and is
the source of the YyText string
the textposition at the last accepting state
the current text position in the buffer
startRead marks the beginning of the YyText string in the buffer
endRead marks the last character in the buffer, that has been read
from input
the number of characters up to the start of the matched text
zzAtEOF == true <=> the scanner is at the EOF
Alphanumeric sequences
Numbers
Chars in class \p{Line_Break = Complex_Context} are from South East Asian
scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept
together as as a single token rather than broken up, because the logic
required to break them at word boundaries is too complex for UAX#29.
See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA
Fills with the current token text.
Creates a new scanner
the to read input from.
Unpacks the compressed character translation table.
the packed character translation table
the unpacked character translation table
Refills the input buffer.
false
, iff there was new input.
if any I/O-Error occurs
Closes the input stream.
Resets the scanner to read from a new input stream.
Does not close the old reader.
All internal variables are reset, the old input stream
cannot be reused (internal buffer is discarded and lost).
Lexical state is set to .
Internal scan buffer is resized down to its initial length, if it has grown.
the new input stream
Returns the current lexical state.
Enters a new lexical state
the new lexical state
Returns the text matched by the current regular expression.
Returns the character at position pos from the
matched text.
It is equivalent to YyText[pos], but faster
the position of the character to fetch.
A value from 0 to YyLength-1.
the character at position pos
Returns the length of the matched text region.
Reports an error that occured while scanning.
In a wellformed scanner (no or only correct usage of
YyPushBack(int) and a match-all fallback rule) this method
will only be called with things that "Can't Possibly Happen".
If this method is called, something is seriously wrong
(e.g. a JFlex bug producing a faulty scanner etc.).
Usual syntax/scanner level error handling should be done
in error fallback rules.
the code of the errormessage to display
Pushes the specified amount of characters back into the input stream.
They will be read again by then next call of the scanning method
the number of characters to be read again.
This number must not be greater than YyLength!
Resumes scanning until the next regular expression is matched,
the end of input is encountered or an I/O-Error occurs.
the next token
if any I/O-Error occurs
This class implements StandardTokenizer using Unicode 6.0.0.
@deprecated This class is only for exact backwards compatibility
This character denotes the end of file
initial size of the lookahead buffer
lexical states
ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l
ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l
at the beginning of a line
l is of the form l = 2*k, k a non negative integer
Translates characters to character classes
Translates characters to character classes
Translates DFA states to action switch labels.
Translates a state to a row index in the transition table
The transition table of the DFA
ZZ_ATTRIBUTE[aState] contains the attributes of state aState
the input device
the current state of the DFA
the current lexical state
this buffer contains the current text to be matched and is
the source of the YyText string
the textposition at the last accepting state
the current text position in the buffer
startRead marks the beginning of the YyText string in the buffer
endRead marks the last character in the buffer, that has been read
from input
the number of characters up to the start of the matched text
zzAtEOF == true <=> the scanner is at the EOF
Alphanumeric sequences
Numbers
Chars in class \p{Line_Break = Complex_Context} are from South East Asian
scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept
together as as a single token rather than broken up, because the logic
required to break them at word boundaries is too complex for UAX#29.
See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA
Fills ICharTermAttribute with the current token text.
Creates a new scanner
the TextReader to read input from.
Unpacks the compressed character translation table.
the packed character translation table
the unpacked character translation table
Refills the input buffer.
false, iff there was new input.
if any I/O-Error occurs
Disposes the input stream.
Resets the scanner to read from a new input stream.
Does not close the old reader.
All internal variables are reset, the old input stream
cannot be reused (internal buffer is discarded and lost).
Lexical state is set to .
Internal scan buffer is resized down to its initial length, if it has grown.
the new input stream
Returns the current lexical state.
Enters a new lexical state
the new lexical state
Returns the text matched by the current regular expression.
Returns the character at position from the
matched text.
It is equivalent to YyText[pos], but faster
the position of the character to fetch.
A value from 0 to YyLength-1.
the character at position pos
Returns the length of the matched text region.
Reports an error that occured while scanning.
In a wellformed scanner (no or only correct usage of
YyPushBack(int) and a match-all fallback rule) this method
will only be called with things that "Can't Possibly Happen".
If this method is called, something is seriously wrong
Usual syntax/scanner level error handling should be done
in error fallback rules.
the code of the errormessage to display
Pushes the specified amount of characters back into the input stream.
They will be read again by then next call of the scanning method
the number of characters to be read again.
This number must not be greater than YyLength!
Resumes scanning until the next regular expression is matched,
the end of input is encountered or an I/O-Error occurs.
the next token
if any I/O-Error occurs
This class implements UAX29URLEmailTokenizer, except with a bug
(https://issues.apache.org/jira/browse/LUCENE-3880) where "mailto:"
URI scheme prepended to an email address will disrupt recognition
of the email address.
@deprecated This class is only for exact backwards compatibility
This character denotes the end of file
initial size of the lookahead buffer
lexical states
ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l
ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l
at the beginning of a line
l is of the form l = 2*k, k a non negative integer
Translates characters to character classes
Translates characters to character classes
Translates DFA states to action switch labels.
Translates a state to a row index in the transition table
The transition table of the DFA
ZZ_ATTRIBUTE[aState] contains the attributes of state aState
the input device
the current state of the DFA
the current lexical state
this buffer contains the current text to be matched and is
the source of the YyText string
the textposition at the last accepting state
the current text position in the buffer
startRead marks the beginning of the YyText string in the buffer
endRead marks the last character in the buffer, that has been read
from input
the number of characters up to the start of the matched text
zzAtEOF == true <=> the scanner is at the EOF
Alphanumeric sequences
Numbers
Chars in class \p{Line_Break = Complex_Context} are from South East Asian
scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept
together as as a single token rather than broken up, because the logic
required to break them at word boundaries is too complex for UAX#29.
See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA
Fills ICharTermAttribute with the current token text.
Creates a new scanner
the TextReader to read input from.
Unpacks the compressed character translation table.
the packed character translation table
the unpacked character translation table
Refills the input buffer.
false, iff there was new input.
if any I/O-Error occurs
Disposes the input stream.
Resets the scanner to read from a new input stream.
Does not close the old reader.
All internal variables are reset, the old input stream
cannot be reused (internal buffer is discarded and lost).
Lexical state is set to .
Internal scan buffer is resized down to its initial length, if it has grown.
the new input stream
Returns the current lexical state.
Enters a new lexical state
the new lexical state
Returns the text matched by the current regular expression.
Returns the character at position from the
matched text.
It is equivalent to YyText[pos], but faster
the position of the character to fetch.
A value from 0 to YyLength-1.
the character at position pos
Returns the length of the matched text region.
Reports an error that occured while scanning.
In a wellformed scanner (no or only correct usage of
YyPushBack(int) and a match-all fallback rule) this method
will only be called with things that "Can't Possibly Happen".
If this method is called, something is seriously wrong
(e.g. a JFlex bug producing a faulty scanner etc.).
Usual syntax/scanner level error handling should be done
in error fallback rules.
the code of the errormessage to display
Pushes the specified amount of characters back into the input stream.
They will be read again by then next call of the scanning method
the number of characters to be read again.
This number must not be greater than YyLength!
Resumes scanning until the next regular expression is matched,
the end of input is encountered or an I/O-Error occurs.
the next token
if any I/O-Error occurs
This class implements UAX29URLEmailTokenizer using Unicode 6.0.0.
@deprecated This class is only for exact backwards compatibility
This character denotes the end of file
initial size of the lookahead buffer
lexical states
ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l
ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l
at the beginning of a line
l is of the form l = 2*k, k a non negative integer
Translates characters to character classes
Translates characters to character classes
Translates DFA states to action switch labels.
Translates a state to a row index in the transition table
The transition table of the DFA
ZZ_ATTRIBUTE[aState] contains the attributes of state aState
the input device
the current state of the DFA
the current lexical state
this buffer contains the current text to be matched and is
the source of the YyText string
the textposition at the last accepting state
the current text position in the buffer
startRead marks the beginning of the YyText string in the buffer
endRead marks the last character in the buffer, that has been read
from input
the number of characters up to the start of the matched text
zzAtEOF == true <=> the scanner is at the EOF
Alphanumeric sequences
Numbers
Chars in class \p{Line_Break = Complex_Context} are from South East Asian
scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept
together as as a single token rather than broken up, because the logic
required to break them at word boundaries is too complex for UAX#29.
See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA
Fills ICharTermAttribute with the current token text.
Creates a new scanner
the TextReader to read input from.
Unpacks the compressed character translation table.
the packed character translation table
the unpacked character translation table
Refills the input buffer.
false, iff there was new input.
if any I/O-Error occurs
Disposes the input stream.
Resets the scanner to read from a new input stream.
Does not close the old reader.
All internal variables are reset, the old input stream
cannot be reused (internal buffer is discarded and lost).
Lexical state is set to .
Internal scan buffer is resized down to its initial length, if it has grown.
the new input stream
Returns the current lexical state.
Enters a new lexical state
the new lexical state
Returns the text matched by the current regular expression.
Returns the character at position from the
matched text.
It is equivalent to YyText[pos], but faster
the position of the character to fetch.
A value from 0 to YyLength-1.
the character at position pos
Returns the length of the matched text region.
Reports an error that occured while scanning.
In a wellformed scanner (no or only correct usage of
YyPushBack(int) and a match-all fallback rule) this method
will only be called with things that "Can't Possibly Happen".
If this method is called, something is seriously wrong
(e.g. a JFlex bug producing a faulty scanner etc.).
Usual syntax/scanner level error handling should be done
in error fallback rules.
the code of the errormessage to display
Pushes the specified amount of characters back into the input stream.
They will be read again by then next call of the scanning method
the number of characters to be read again.
This number must not be greater than YyLength!
Resumes scanning until the next regular expression is matched,
the end of input is encountered or an I/O-Error occurs.
the next token
if any I/O-Error occurs
This class implements StandardTokenizer using Unicode 6.1.0.
@deprecated This class is only for exact backwards compatibility
This character denotes the end of file
initial size of the lookahead buffer
lexical states
ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l
ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l
at the beginning of a line
l is of the form l = 2*k, k a non negative integer
Translates characters to character classes
Translates characters to character classes
Translates DFA states to action switch labels.
Translates a state to a row index in the transition table
The transition table of the DFA
ZZ_ATTRIBUTE[aState] contains the attributes of state aState
the input device
the current state of the DFA
the current lexical state
this buffer contains the current text to be matched and is
the source of the YyText string
the textposition at the last accepting state
the current text position in the buffer
startRead marks the beginning of the YyText string in the buffer
endRead marks the last character in the buffer, that has been read
from input
the number of characters up to the start of the matched text
zzAtEOF == true <=> the scanner is at the EOF
Alphanumeric sequences
Numbers
Chars in class \p{Line_Break = Complex_Context} are from South East Asian
scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept
together as as a single token rather than broken up, because the logic
required to break them at word boundaries is too complex for UAX#29.
See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA
Fills ICharTermAttribute with the current token text.
Creates a new scanner
the TextReader to read input from.
Unpacks the compressed character translation table.
the packed character translation table
the unpacked character translation table
Refills the input buffer.
false, iff there was new input.
if any I/O-Error occurs
Disposes the input stream.
Resets the scanner to read from a new input stream.
Does not close the old reader.
All internal variables are reset, the old input stream
cannot be reused (internal buffer is discarded and lost).
Lexical state is set to .
Internal scan buffer is resized down to its initial length, if it has grown.
the new input stream
Returns the current lexical state.
Enters a new lexical state
the new lexical state
Returns the text matched by the current regular expression.
Returns the character at position from the
matched text.
It is equivalent to YyText[pos], but faster
the position of the character to fetch.
A value from 0 to YyLength-1.
the character at position pos
Returns the length of the matched text region.
Reports an error that occured while scanning.
In a wellformed scanner (no or only correct usage of
YyPushBack(int) and a match-all fallback rule) this method
will only be called with things that "Can't Possibly Happen".
If this method is called, something is seriously wrong
(e.g. a JFlex bug producing a faulty scanner etc.).
Usual syntax/scanner level error handling should be done
in error fallback rules.
the code of the errormessage to display
Pushes the specified amount of characters back into the input stream.
They will be read again by then next call of the scanning method
the number of characters to be read again.
This number must not be greater than YyLength!
Resumes scanning until the next regular expression is matched,
the end of input is encountered or an I/O-Error occurs.
the next token
if any I/O-Error occurs
This class implements using Unicode 6.1.0.
@deprecated This class is only for exact backwards compatibility
This character denotes the end of file
initial size of the lookahead buffer
lexical states
ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l
ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l
at the beginning of a line
l is of the form l = 2*k, k a non negative integer
Translates characters to character classes
Translates characters to character classes
Translates DFA states to action switch labels.
Translates a state to a row index in the transition table
The transition table of the DFA
ZZ_ATTRIBUTE[aState] contains the attributes of state aState
the input device
the current state of the DFA
the current lexical state
this buffer contains the current text to be matched and is
the source of the YyText string
the textposition at the last accepting state
the current text position in the buffer
startRead marks the beginning of the YyText string in the buffer
endRead marks the last character in the buffer, that has been read
from input
the number of characters up to the start of the matched text
zzAtEOF == true <=> the scanner is at the EOF
Alphanumeric sequences
Numbers
Chars in class \p{Line_Break = Complex_Context} are from South East Asian
scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept
together as as a single token rather than broken up, because the logic
required to break them at word boundaries is too complex for UAX#29.
See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA
Fills ICharTermAttribute with the current token text.
Creates a new scanner
the TextReader to read input from.
Unpacks the compressed character translation table.
the packed character translation table
the unpacked character translation table
Refills the input buffer.
false, iff there was new input.
if any I/O-Error occurs
Disposes the input stream.
Resets the scanner to read from a new input stream.
Does not close the old reader.
All internal variables are reset, the old input stream
cannot be reused (internal buffer is discarded and lost).
Lexical state is set to .
Internal scan buffer is resized down to its initial length, if it has grown.
the new input stream
Returns the current lexical state.
Enters a new lexical state
the new lexical state
Returns the text matched by the current regular expression.
Returns the character at position from the
matched text.
It is equivalent to YyText[pos], but faster
the position of the character to fetch.
A value from 0 to YyLength-1.
the character at position pos
Returns the length of the matched text region.
Reports an error that occured while scanning.
In a wellformed scanner (no or only correct usage of
YyPushBack(int) and a match-all fallback rule) this method
will only be called with things that "Can't Possibly Happen".
If this method is called, something is seriously wrong
(e.g. a JFlex bug producing a faulty scanner etc.).
Usual syntax/scanner level error handling should be done
in error fallback rules.
the code of the errormessage to display
Pushes the specified amount of characters back into the input stream.
They will be read again by then next call of the scanning method
the number of characters to be read again.
This number must not be greater than YyLength!
Resumes scanning until the next regular expression is matched,
the end of input is encountered or an I/O-Error occurs.
the next token
if any I/O-Error occurs
Filters
with ,
and
, using a list of
English stop words.
You must specify the required
compatibility when creating
Default maximum allowed token length
An unmodifiable set containing some common English words that are usually not
useful for searching.
Builds an analyzer with the given stop words.
Lucene version to match - See
stop words
Builds an analyzer with the default stop words (.
Lucene version to match - See
Builds an analyzer with the stop words from the given reader.
Lucene version to match - See
to read stop words from
Set maximum allowed token length. If a token is seen
that exceeds this length then it is discarded. This
setting only takes effect the next time tokenStream or
tokenStream is called.
This class implements Word Break rules from the Unicode Text Segmentation
algorithm, as specified in `
Unicode Standard Annex #29
URLs and email addresses are also tokenized according to the relevant RFCs.
Tokens produced are of the following types:
- <ALPHANUM>: A sequence of alphabetic and numeric characters
- <NUM>: A number
- <URL>: A URL
- <EMAIL>: An email address
- <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast
Asian languages, including Thai, Lao, Myanmar, and Khmer
- <IDEOGRAPHIC>: A single CJKV ideographic character
- <HIRAGANA>: A single hiragana character
You must specify the required
compatibility when creating :
- As of 3.4, Hiragana and Han characters are no longer wrongly split
from their combining characters. If you use a previous version number,
you get the exact broken behavior for backwards compatibility.
A private instance of the JFlex-constructed scanner
String token types that correspond to token type int constants
Set the max allowed token length. Any token longer
than this is skipped.
Creates a new instance of the . Attaches
the to the newly created JFlex scanner.
Lucene compatibility version
The input reader
Creates a new with a given
LUCENENET specific: This method was added in .NET to prevent having to repeat code in the constructors.
Factory for .
<fieldType name="text_urlemail" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.UAX29URLEmailTokenizerFactory" maxTokenLength="255"/>
</analyzer>
</fieldType>
Creates a new
This class implements Word Break rules from the Unicode Text Segmentation
algorithm, as specified in
Unicode Standard Annex #29
URLs and email addresses are also tokenized according to the relevant RFCs.
Tokens produced are of the following types:
- <ALPHANUM>: A sequence of alphabetic and numeric characters
- <NUM>: A number
- <URL>: A URL
- <EMAIL>: An email address
- <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast
Asian languages, including Thai, Lao, Myanmar, and Khmer
- <IDEOGRAPHIC>: A single CJKV ideographic character
- <HIRAGANA>: A single hiragana character
- <KATAKANA>: A sequence of katakana characters
- <HANGUL>: A sequence of Hangul characters
This character denotes the end of file
initial size of the lookahead buffer
lexical states
ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l
ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l
at the beginning of a line
l is of the form l = 2*k, k a non negative integer
Translates characters to character classes
Translates characters to character classes
Translates DFA states to action switch labels.
Translates a state to a row index in the transition table
The transition table of the DFA
ZZ_ATTRIBUTE[aState] contains the attributes of state aState
the input device
the current state of the DFA
the current lexical state
this buffer contains the current text to be matched and is
the source of the YyText string
the textposition at the last accepting state
the current text position in the buffer
startRead marks the beginning of the YyText string in the buffer
endRead marks the last character in the buffer, that has been read
from input
the number of characters up to the start of the matched text
zzAtEOF == true <=> the scanner is at the EOF
Alphanumeric sequences
Numbers
Chars in class \p{Line_Break = Complex_Context} are from South East Asian
scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept
together as as a single token rather than broken up, because the logic
required to break them at word boundaries is too complex for UAX#29.
See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA
Fills ICharTermAttribute with the current token text.
Creates a new scanner
the TextReader to read input from.
Unpacks the compressed character translation table.
the packed character translation table
the unpacked character translation table
Refills the input buffer.
false, iff there was new input.
if any I/O-Error occurs
Disposes the input stream.
Resets the scanner to read from a new input stream.
Does not close the old reader.
All internal variables are reset, the old input stream
cannot be reused (internal buffer is discarded and lost).
Lexical state is set to .
Internal scan buffer is resized down to its initial length, if it has grown.
the new input stream
Returns the current lexical state.
Enters a new lexical state
the new lexical state
Returns the text matched by the current regular expression.
Returns the character at position from the
matched text.
It is equivalent to YyText[pos], but faster
the position of the character to fetch.
A value from 0 to YyLength-1.
the character at position pos
Returns the length of the matched text region.
Reports an error that occured while scanning.
In a wellformed scanner (no or only correct usage of
YyPushBack(int) and a match-all fallback rule) this method
will only be called with things that "Can't Possibly Happen".
If this method is called, something is seriously wrong
(e.g. a JFlex bug producing a faulty scanner etc.).
Usual syntax/scanner level error handling should be done
in error fallback rules.
the code of the errormessage to display
Pushes the specified amount of characters back into the input stream.
They will be read again by then next call of the scanning method
the number of characters to be read again.
This number must not be greater than YyLength!
Resumes scanning until the next regular expression is matched,
the end of input is encountered or an I/O-Error occurs.
the next token
if any I/O-Error occurs
for Swedish.
File containing default Swedish stopwords.
Returns an unmodifiable instance of the default stop words set.
default stop words set.
Atomically loads the in a lazy fashion once the outer class
accesses the static final set the first time.;
Builds an analyzer with the default stop words: .
lucene compatibility version
Builds an analyzer with the given stop words.
lucene compatibility version
a stopword set
Builds an analyzer with the given stop words. If a non-empty stem exclusion set is
provided this analyzer will add a before
stemming.
lucene compatibility version
a stopword set
a set of terms not to be stemmed
Creates a
which tokenizes all the text in the provided .
A
built from an filtered with
, , ,
if a stem exclusion set is
provided and .
A that applies to stem Swedish
words.
To prevent terms from being stemmed use an instance of
or a custom that sets
the before this .
Factory for .
<fieldType name="text_svlgtstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SwedishLightStemFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Light Stemmer for Swedish.
This stemmer implements the algorithm described in:
Report on CLEF-2003 Monolingual Tracks
Jacques Savoy
Load synonyms with the given class.
handles multi-token synonyms with variable position increment offsets.
The matched tokens from the input stream may be optionally passed through (includeOrig=true)
or discarded. If the original tokens are included, the position increments may be modified
to retain absolute positions after merging with the synonym tokenstream.
Generated synonyms will start at the same position as the first matched source token.
@deprecated (3.4) use SynonymFilterFactory instead. only for precise index backwards compatibility. this factory will be removed in Lucene 5.0
Factory for (only used with luceneMatchVersion < 3.4)
<fieldType name="text_synonym" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="false"
expand="true" tokenizerFactory="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
@deprecated (3.4) use SynonymFilterFactory instead. only for precise index backwards compatibility. this factory will be removed in Lucene 5.0
a list of all rules
Splits a backslash escaped string on the separator.
Current backslash escaping supported:
\n \t \r \b \f are escaped the same as a .NET string
Other characters following a backslash are produced verbatim (\c => c)
the string to split
the separator to split on
decode backslash escaping
Mapping rules for use with
@deprecated (3.4) use instead. only for precise index backwards compatibility. this factory will be removed in Lucene 5.0
@lucene.internal
@lucene.internal
, the sequence of strings to match
the list of tokens to use on a match
sets a flag on this mapping signaling the generation of matched tokens in addition to the replacement tokens
merge the replacement tokens with any other mappings that exist
Produces a from a
Merge two lists of tokens, producing a single list with manipulated positionIncrements so that
the tokens end up at the same position.
Example: [a b] merged with [c d] produces [a/b c/d] ('/' denotes tokens in the same position)
Example: [a,5 b,2] merged with [c d,4 e,4] produces [c a,5/d b,2 e,2] (a,n means a has posInc=n)
Parser for the Solr synonyms format.
- Blank lines and lines starting with '#' are comments.
- Explicit mappings match any token sequence on the LHS of "=>"
and replace with all alternatives on the RHS. These types of mappings
ignore the expand parameter in the constructor.
Example:
i-pod, i pod => ipod
- Equivalent synonyms may be separated with commas and give
no explicit mapping. In this case the mapping behavior will
be taken from the expand parameter in the constructor. This allows
the same synonym file to be used in different synonym handling strategies.
Example:
ipod, i-pod, i pod
- Multiple synonym mapping entries are merged.
Example:
foo => foo bar
foo => baz
is equivalent to
foo => foo bar, baz
@lucene.experimental
Matches single or multi word synonyms in a token stream.
This token stream cannot properly handle position
increments != 1, ie, you should place this filter before
filtering out stop words.
Note that with the current implementation, parsing is
greedy, so whenever multiple parses would apply, the rule
starting the earliest and parsing the most tokens wins.
For example if you have these rules:
a -> x
a b -> y
b c d -> z
Then input a b c d e parses to y b c
d, ie the 2nd rule "wins" because it started
earliest and matched the most input tokens of other rules
starting at that point.
A future improvement to this filter could allow
non-greedy parsing, such that the 3rd rule would win, and
also separately allow multiple parses, such that all 3
rules would match, perhaps even on a rule by rule
basis.
NOTE: when a match occurs, the output tokens
associated with the matching rule are "stacked" on top of
the input stream (if the rule had
keepOrig=true) and also on top of another
matched rule's output tokens. This is not a correct
solution, as really the output should be an arbitrary
graph/lattice. For example, with the above match, you
would expect an exact "y b
c" to match the parsed tokens, but it will fail to
do so. This limitation is necessary because Lucene's
(and index) cannot yet represent an arbitrary
graph.
NOTE: If multiple incoming tokens arrive on the
same position, only the first token at that position is
used for parsing. Subsequent tokens simply pass through
and are not parsed. A future improvement would be to
allow these tokens to also be matched.
input tokenstream
synonym map
case-folds input for matching with
in using .
Note, if you set this to true, its your responsibility to lowercase
the input entries when you create the .
Factory for .
<fieldType name="text_synonym" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
format="solr" ignoreCase="false" expand="true"
tokenizerFactory="solr.WhitespaceTokenizerFactory"
[optional tokenizer factory parameters]/>
</analyzer>
</fieldType>
An optional param name prefix of "tokenizerFactory." may be used for any
init params that the needs to pass to the specified
. If the expects an init parameters with
the same name as an init param used by the , the prefix
is mandatory.
The optional format parameter controls how the synonyms will be parsed:
It supports the short names of solr for
and wordnet for and , or your own
class name. The default is solr.
A custom is expected to have a constructor taking:
- dedup - true if duplicates should be ignored, false otherwise
- expand - true if conflation groups should be expanded, false if they are one-directional
- analyzer - an analyzer used for each raw synonym
Access to the delegator for test verification
@deprecated Method exists only for testing 4x, will be removed in 5.0
@lucene.internal
A map of synonyms, keys and values are phrases.
@lucene.experimental
for multiword support, you must separate words with this separator
map<input word, list<ord>>
map<ord, outputword>
maxHorizontalContext: maximum context we need on the tokenstream
Builds an FSTSynonymMap.
Call until you have added all the mappings, then call to get an FSTSynonymMap
@lucene.experimental
If dedup is true then identical rules (same input,
same output) will be added only once.
Sugar: just joins the provided terms with
. reuse and its chars
must not be null.
only used for asserting!
Add a phrase->phrase synonym mapping.
Phrases are character sequences where words are
separated with character zero (U+0000). Empty words
(two U+0000s in a row) are not allowed in the input nor
the output!
input phrase
output phrase
true if the original should be included
Builds an and returns it.
Abstraction for parsing synonym files.
@lucene.experimental
Parse the given input, adding synonyms to the inherited .
The input to parse
Sugar: analyzes the text with the analyzer and
separates by .
reuse and its chars must not be null.
Parser for wordnet prolog format
See http://wordnet.princeton.edu/man/prologdb.5WN.html for a description of the format.
@lucene.experimental
Strips all characters after an apostrophe (including the apostrophe itself).
In Turkish, apostrophe is used to separate suffixes from proper names
(continent, sea, river, lake, mountain, upland, proper names related to
religion and mythology). This filter intended to be used before stem filters.
For more information, see
Role of Apostrophes in Turkish Information Retrieval
Factory for .
<fieldType name="text_tr_lower_apostrophes" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ApostropheFilterFactory"/>
<filter class="solr.TurkishLowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
for Turkish.
File containing default Turkish stopwords.
The comment character in the stopwords file.
All lines prefixed with this will be ignored.
Returns an unmodifiable instance of the default stop words set.
default stop words set.
Atomically loads the in a lazy fashion once the outer class
accesses the static final set the first time.;
Builds an analyzer with the default stop words: .
Builds an analyzer with the given stop words.
lucene compatibility version
a stopword set
Builds an analyzer with the given stop words. If a non-empty stem exclusion set is
provided this analyzer will add a before
stemming.
lucene compatibility version
a stopword set
a set of terms not to be stemmed
Creates a
which tokenizes all the text in the provided .
A
built from an filtered with
, ,
, if a stem
exclusion set is provided and .
Normalizes Turkish token text to lower case.
Turkish and Azeri have unique casing behavior for some characters. This
filter applies Turkish lowercase rules. For more information, see http://en.wikipedia.org/wiki/Turkish_dotted_and_dotless_I
Create a new , that normalizes Turkish token text
to lower case.
to filter
lookahead for a combining dot above.
other NSMs may be in between.
delete a character in-place.
rarely happens, only if is found after an i
Factory for .
<fieldType name="text_trlwr" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.TurkishLowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Creates a new
Abstract parent class for analysis factories ,
and .
The typical lifecycle for a factory consumer is:
- Create factory via its constructor (or via XXXFactory.ForName)
- (Optional) If the factory uses resources such as files,
is called to initialize those resources.
- Consumer calls create() to obtain instances.
The original args, before any processing
the luceneVersion arg
Initialize this factory via a set of key-value pairs.
this method can be called in the
or methods,
to inform user, that for this factory a is required
NOTE: This was requireInt() in Lucene
NOTE: This was getInt() in Lucene
NOTE: This was requireFloat() in Lucene
NOTE: This was getFloat() in Lucene
Returns whitespace- and/or comma-separated set of values, or null if none are found
Compiles a pattern for the value of the specified argument key
Gets a value of the specified argument key .
To specify the invariant culture, pass the string "invariant".
LUCENENET specific
Returns as from wordFiles, which
can be a comma-separated list of filenames
Returns the resource's lines (with content treated as UTF-8)
Same as ,
except the input is in snowball format.
Splits file names separated by comma character.
File names can contain comma characters escaped by backslash '\'
the string containing file names
a list of file names with the escaping backslashed removed
the string used to specify the concrete class name in a serialized representation: the class arg.
If the concrete class name was not specified via a class arg, returns GetType().Name.
Helper class for loading named SPIs from classpath (e.g. Tokenizers, TokenStreams).
@lucene.internal
Reloads the internal SPI list.
Changes to the service list are visible after the method ends, all
iterators (e.g, from ,...) stay consistent.
NOTE: Only new service providers are added, existing ones are
never removed or replaced.
this method is expensive and should only be called for discovery
of new service providers on the given classpath/classloader!
LUCENENET specific class to mimic Java's BufferedReader (that is, a reader that is seekable)
so it supports Mark() and Reset() (which are part of the Java Reader class), but also
provide the Correct() method of BaseCharFilter.
The object used to synchronize access to the reader.
The characters that can be read and refilled in bulk. We maintain three
indices into this buffer:
{ X X X X X X X X X X X X - - }
^ ^ ^
| | |
mark pos end
Pos points to the next readable character.End is one greater than the
last readable character.When pos == end, the buffer is empty and
must be before characters can be read.
Mark is the value pos will be set to on calls to
. Its value is in the range [0...pos]. If the mark is -1, the
buffer cannot be reset.
MarkLimit limits the distance between the mark and the pos.When this
limit is exceeded, is permitted (but not required) to
throw an exception. For shorter distances, shall not throw
(unless the reader is closed).
LUCENENET specific to throw an exception if the user calls instead of
Creates a buffering character-input stream that uses a default-sized input buffer.
A TextReader
Creates a buffering character-input stream that uses an input buffer of the specified size.
A TextReader
Input-buffer size
Disposes this reader. This implementation closes the buffered source reader
and releases the buffer. Nothing is done if this reader has already been
disposed.
if an error occurs while closing this reader.
Populates the buffer with data. It is an error to call this method when
the buffer still contains data; ie. if pos < end.
the number of bytes read into the buffer, or -1 if the end of the
source stream has been reached.
Checks to make sure that the stream has not been closed
Indicates whether or not this reader is closed.
Sets a mark position in this reader. The parameter
indicates how many characters can be read before the mark is invalidated.
Calling will reposition the reader back to the marked
position if has not been surpassed.
the number of characters that can be read before the mark is
invalidated.
if markLimit < 0
if an error occurs while setting a mark in this reader.
Indicates whether this reader supports the and
methods. This implementation returns true.
Reads a single character from this reader and returns it with the two
higher-order bytes set to 0. If possible, returns a
character from the buffer. If there are no characters available in the
buffer, it fills the buffer and then returns a character. It returns -1
if there are no more characters in the source reader.
The character read or -1 if the end of the source reader has been reached.
If this reader is disposed or some other I/O error occurs.
Reads at most characters from this reader and stores them
at in the character array . Returns the
number of characters actually read or -1 if the end of the source reader
has been reached. If all the buffered characters have been used, a mark
has not been set and the requested number of characters is larger than
this readers buffer size, BufferedReader bypasses the buffer and simply
places the results directly into .
the character array to store the characters read.
the initial position in to store the bytes read from this reader.
the maximum number of characters to read, must be non-negative.
number of characters read or -1 if the end of the source reader has been reached.
if offset < 0 or length < 0, or if
offset + length is greater than the size of
.
if this reader is disposed or some other I/O error occurs.
Returns the next line of text available from this reader. A line is
represented by zero or more characters followed by '\n',
'\r', "\r\n" or the end of the reader. The string does
not include the newline sequence.
The contents of the line or null if no characters were
read before the end of the reader has been reached.
if this reader is disposed or some other I/O error occurs.
Indicates whether this reader is ready to be read without blocking.
true if this reader will not block when is
called, false if unknown or blocking will occur.
Resets this reader's position to the last location.
Invocations of and will occur from this new
location.
If this reader is disposed or no mark has been set.
Skips characters in this reader. Subsequent
s will not return these characters unless
is used. Skipping characters may invalidate a mark if
is surpassed.
the maximum number of characters to skip.
the number of characters actually skipped.
if amount < 0.
If this reader is disposed or some other I/O error occurs.
Reads a single character from this reader and returns it with the two
higher-order bytes set to 0. If possible, returns a
character from the buffer. If there are no characters available in the
buffer, it fills the buffer and then returns a character. It returns -1
if there are no more characters in the source reader. Unlike ,
this method does not advance the current position.
The character read or -1 if the end of the source reader has been reached.
If this reader is disposed or some other I/O error occurs.
Not supported.
In all cases.
Not supported.
In all cases.
Not supported.
In all cases.
Not supported.
In all cases.
Not supported.
In all cases.
Not supported.
In all cases.
Not supported.
In all cases.
Not supported.
In all cases.
Not supported.
In all cases.
Not supported.
In all cases.
Not supported.
In all cases.
Not supported.
The call didn't originate from within .
provides a unified interface to Character-related
operations to implement backwards compatible character operations based on a
instance.
@lucene.internal
Returns a implementation according to the given
instance.
a version instance
a implementation according to the given
instance.
Return a instance compatible with Java 1.4.
Returns the code point at the given index of the .
Depending on the passed to
this method mimics the behavior
of Character.CodePointAt(char[], int) as it would have been
available on a Java 1.4 JVM or on a later virtual machine version.
a character sequence
the offset to the char values in the chars array to be converted
the Unicode code point at the given index
- if the sequence is null.
- if the value offset is negative or not less than the length of
the character sequence.
Returns the code point at the given index of the .
Depending on the passed to
this method mimics the behavior
of Character.CodePointAt(char[], int) as it would have been
available on a Java 1.4 JVM or on a later virtual machine version.
a character sequence
the offset to the char values in the chars array to be converted
the Unicode code point at the given index
- if the sequence is null.
- if the value offset is negative or not less than the length of
the character sequence.
Returns the code point at the given index of the char array where only elements
with index less than the limit are used.
Depending on the passed to
this method mimics the behavior
of Character.CodePointAt(char[], int) as it would have been
available on a Java 1.4 JVM or on a later virtual machine version.
a character array
the offset to the char values in the chars array to be converted
the index afer the last element that should be used to calculate
codepoint.
the Unicode code point at the given index
- if the array is null.
- if the value offset is negative or not less than the length of
the char array.
Return the number of characters in .
Return the number of characters in .
Return the number of characters in .
Return the number of characters in .
Creates a new and allocates a
of the given bufferSize.
the internal char buffer size, must be >= 2
a new instance.
Converts each unicode codepoint to lowerCase via in the invariant culture starting
at the given offset.
the char buffer to lowercase
the offset to start at
the number of characters in the buffer to lower case
Converts each unicode codepoint to UpperCase via in the invariant culture starting
at the given offset.
the char buffer to UPPERCASE
the offset to start at
the number of characters in the buffer to lower case
Converts a sequence of .NET characters to a sequence of unicode code points.
The number of code points written to the destination buffer.
Converts a sequence of unicode code points to a sequence of .NET characters.
the number of chars written to the destination buffer
Fills the with characters read from the given
reader . This method tries to read numChars
characters into the , each call to fill will start
filling the buffer from offset 0 up to .
In case code points can span across 2 java characters, this method may
only fill numChars - 1 characters in order not to split in
the middle of a surrogate pair, even if there are remaining characters in
the .
Depending on the passed to
this method implements
supplementary character awareness when filling the given buffer. For all
> 3.0 guarantees
that the given will never contain a high surrogate
character as the last element in the buffer unless it is the last available
character in the reader. In other words, high and low surrogate pairs will
always be preserved across buffer boarders.
A return value of false means that this method call exhausted
the reader, but there may be some bytes which have been read, which can be
verified by checking whether buffer.Length > 0.
the buffer to fill.
the reader to read characters from.
the number of chars to read
false
if and only if reader.read returned -1 while trying to fill the buffer
if the reader throws an .
Convenience method which calls Fill(buffer, reader, buffer.Buffer.Length).
Return the index within buf[start:start+count] which is by
code points from .
A simple IO buffer to use with
.
Returns the internal buffer
the buffer
Returns the data offset in the internal buffer.
the offset
Return the length of the data in the internal buffer starting at
the length
Resets the CharacterBuffer. All internals are reset to its default
values.
A simple class that stores text s as 's in a
hash table. Note that this is not a general purpose
class. For example, it cannot remove items from the
dictionary, nor does it resize its hash table to be smaller,
etc. It is designed to be quick to retrieve items
by keys without the necessity of converting
to a first.
You must specify the required
compatibility when creating :
- As of 3.1, supplementary characters are
properly lowercased.
Before 3.1 supplementary characters could not be
lowercased correctly due to the lack of Unicode 4
support in JDK 1.4. To use instances of
with the behavior before Lucene
3.1 pass a < 3.1 to the constructors.
Returns an empty, read-only dictionary.
LUCENENET: Moved this from CharArraySet so it doesn't need to know the generic type of CharArrayDictionary
LUCENENET SPECIFIC type used to act as a placeholder. Since null
means that our value is not populated, we need an instance of something
to indicate it is. Using an instance of would only work if
we could constrain it with the new() constraint, which isn't possible because
some types such as don't have a default constructor.
So, this is a workaround that allows any type regardless of the type of constructor.
Note also that we gain the ability to use value types for , but
also create a difference in behavior from Java Lucene where the actual values
returned could be null.
Create dictionary with enough capacity to hold terms.
lucene compatibility version - see for details.
the initial capacity
false if and only if the set should be case sensitive;
otherwise true.
is less than zero.
Creates a dictionary from the mappings in another dictionary.
compatibility match version see for details.
a dictionary () whose mappings to be copied.
false if and only if the set should be case sensitive;
otherwise true.
is null.
Creates a dictionary from the mappings in another dictionary.
compatibility match version see for details.
a dictionary () whose mappings to be copied.
false if and only if the set should be case sensitive;
otherwise true.
is null.
Creates a dictionary from the mappings in another dictionary.
compatibility match version see for details.
a dictionary () whose mappings to be copied.
false if and only if the set should be case sensitive;
otherwise true.
is null.
Create set from the supplied dictionary (used internally for readonly maps...)
Adds the for the passed in .
Note that the instance is not added to the dictionary.
A whose
will be added for the corresponding .
Adds the for the passed in .
The string-able type to be added/updated in the dictionary.
The corresponding value for the given .
is null.
An element with already exists in the dictionary.
Adds the for the passed in .
The string-able type to be added/updated in the dictionary.
The corresponding value for the given .
is null.
An element with already exists in the dictionary.
Adds the for the passed in .
The string-able type to be added/updated in the dictionary.
The corresponding value for the given .
is null.
-or-
The 's property returns false.
An element with already exists in the dictionary.
Adds the for the passed in .
The string-able type to be added/updated in the dictionary.
The corresponding value for the given .
is null.
An element with already exists in the dictionary.
Returns an unmodifiable . This allows to provide
unmodifiable views of internal dictionary for "read-only" use.
an new unmodifiable .
Clears all entries in this dictionary. This method is supported for reusing, but not
.
Not supported.
Copies all items in the current dictionary the starting at the .
The array is assumed to already be dimensioned to fit the elements in this dictionary; otherwise a
will be thrown.
The array to copy the items into.
A 32-bit integer that represents the index in at which copying begins.
is null.
is less than zero.
The number of elements in the source is greater
than the available space from to the end of the destination array.
Copies all items in the current dictionary the starting at the .
The array is assumed to already be dimensioned to fit the elements in this dictionary; otherwise a
will be thrown.
The array to copy the items into.
A 32-bit integer that represents the index in at which copying begins.
is null.
is less than zero.
The number of elements in the source is greater
than the available space from to the end of the destination array.
Copies all items in the current dictionary the starting at the .
The array is assumed to already be dimensioned to fit the elements in this dictionary; otherwise a
will be thrown.
The array to copy the items into.
A 32-bit integer that represents the index in at which copying begins.
is null.
is less than zero.
The number of elements in the source is greater
than the available space from to the end of the destination array.
true if the chars of starting at
are in the
is null.
or is less than zero.
and refer to a position outside of .
true if the entire is the same as the
being passed in;
otherwise false.
is null.
true if the is in the ;
otherwise false
is null.
true if the is in the ;
otherwise false
is null.
-or-
The 's property returns false.
true if the (in the invariant culture)
is in the ; otherwise false
is null.
Returns the value of the mapping of chars of
starting at .
is null.
or is less than zero.
and refer to a position outside of .
The effective text is not found in the dictionary.
Returns the value of the mapping of the chars inside this .
is null.
is not found in the dictionary.
Returns the value of the mapping of the chars inside this .
is null.
-or-
The 's property returns false.
is not found in the dictionary.
Returns the value of the mapping of the chars inside this .
is null.
is not found in the dictionary.
Returns the value of the mapping of the chars inside this .
is null.
is not found in the dictionary.
Returns true if the is in the set.
is null.
-or-
The 's property returns false.
Returns true if the is in the set.
is null.
Add the given mapping.
If ignoreCase is true for this dictionary, the text array will be directly modified.
Note: The setter is more efficient than this method if
the is not required.
A text with which the specified is associated.
The position of the where the target text begins.
The total length of the .
The value to be associated with the specified .
The previous value associated with the text, or the default for the type of
parameter if there was no mapping for .
true if the mapping was added, false if the text already existed. The
will be populated if the result is false.
is null.
or is less than zero.
and refer to a position outside of .
Add the given mapping.
If ignoreCase is true for this dictionary, the text array will be directly modified.
The user should never modify this text array after calling this method.
Note: The setter is more efficient than this method if
the is not required.
A text with which the specified is associated.
The value to be associated with the specified .
The previous value associated with the text, or the default for the type of
parameter if there was no mapping for .
true if the mapping was added, false if the text already existed. The
will be populated if the result is false.
is null.
Add the given mapping.
Note: The setter is more efficient than this method if
the is not required.
A text with which the specified is associated.
The value to be associated with the specified .
The previous value associated with the text, or the default for the type of
parameter if there was no mapping for .
true if the mapping was added, false if the text already existed. The
will be populated if the result is false.
is null.
Add the given mapping.
Note: The setter is more efficient than this method if
the is not required.
A text with which the specified is associated.
The value to be associated with the specified .
The previous value associated with the text, or the default for the type of
parameter if there was no mapping for .
true if the mapping was added, false if the text already existed. The
will be populated if the result is false.
is null.
-or-
The 's property returns false.
Add the given mapping using the representation
of in the .
Note: The setter is more efficient than this method if
the is not required.
A text with which the specified is associated.
The value to be associated with the specified .
The previous value associated with the text, or the default for the type of
parameter if there was no mapping for .
true if the mapping was added, false if the text already existed. The
will be populated if the result is false.
is null.
Add the given mapping.
is null.
-or-
The 's property returns false.
Add the given mapping.
is null.
Add the given mapping.
is null.
LUCENENET specific. Centralizes the logic between Put()
implementations that accept a value and those that don't. This value is
so we know whether or not the value was set, since we can't reliably do
a check for null on the type.
is null.
Sets the value of the mapping of the chars inside this .
is null.
Sets the value of the mapping of the chars inside this .
is null.
Sets the value of the mapping of the chars inside this .
is null.
Sets the value of the mapping of the chars inside this .
is null.
Sets the value of the mapping of the chars inside this .
is null.
Sets the value of the mapping of chars of
starting at .
If ignoreCase is true for this dictionary, the text array will be directly modified.
A text with which the specified is associated.
The position of the where the target text begins.
The total length of the .
The value to be associated with the specified .
is null.
or is less than zero.
and refer to a position outside of .
Sets the value of the mapping of the chars inside this .
If ignoreCase is true for this dictionary, the text array will be directly modified.
The user should never modify this text array after calling this method.
is null.
Sets the value of the mapping of the chars inside this .
is null.
-or-
The 's property returns false.
Sets the value of the mapping of the chars inside this .
is null.
Sets the value of the mapping of the chars inside this .
is null.
LUCENENET specific. Like PutImpl, but doesn't have a return value or lookup to get the old value.
is null.
-or-
The 's property returns false.
LUCENENET specific. Like PutImpl, but doesn't have a return value or lookup to get the old value.
is null.
LUCENENET specific. Like PutImpl, but doesn't have a return value or lookup to get the old value.
is null.
LUCENENET specific. Like PutImpl, but doesn't have a return value or lookup to get the old value.
is null.
or is less than zero.
and refer to a position outside of .
This implementation enumerates over the specified 's
entries, and calls this dictionary's operation once for each entry.
If ignoreCase is true for this dictionary, the text arrays will be directly modified.
The user should never modify the text arrays after calling this method.
A dictionary of values to add/update in the current dictionary.
is null.
-or-
An element in the collection is null.
This implementation enumerates over the specified 's
entries, and calls this dictionary's operation once for each entry.
A dictionary of values to add/update in the current dictionary.
is null.
-or-
An element in the collection is null.
This implementation enumerates over the specified 's
entries, and calls this dictionary's operation once for each entry.
A dictionary of values to add/update in the current dictionary.
is null.
-or-
An element in the collection has a null text.
-or-
The text's property for a given element in the collection returns false.
This implementation enumerates over the specified 's
entries, and calls this dictionary's operation once for each entry.
A dictionary of values to add/update in the current dictionary.
is null.
-or-
An element in the collection is null.
This implementation enumerates over the specified 's
entries, and calls this dictionary's operation once for each entry.
The values to add/update in the current dictionary.
is null.
-or-
An element in the collection is null.
This implementation enumerates over the specified 's
entries, and calls this dictionary's operation once for each entry.
The values to add/update in the current dictionary.
is null.
-or-
An element in the collection is null.
This implementation enumerates over the specified 's
entries, and calls this dictionary's operation once for each entry.
The values to add/update in the current dictionary.
is null.
-or-
An element in the collection has a null text.
-or-
The text's property for a given element in the collection returns false.
This implementation enumerates over the specified 's
entries, and calls this dictionary's operation once for each entry.
The values to add/update in the current dictionary.
is null.
-or-
An element in the collection is null.
LUCENENET Specific - test for value equality similar to how it is done in Java
Another dictionary to test the values of
true if the given object is an that contains
the same text value pairs as the current dictionary
LUCENENET Specific - override required by .NET because we override Equals
to simulate Java's value equality checking.
The Lucene version corresponding to the compatibility behavior
that this instance emulates
Adds a placeholder with the given as the text.
Primarily for internal use by .
NOTE: If ignoreCase is true for this , the text array will be directly modified.
A key with which the placeholder is associated.
The position of the where the target text begins.
The total length of the .
true if the text was added, false if the text already existed.
is null.
or is less than zero.
and refer to a position outside of .
Adds a placeholder with the given as the text.
Primarily for internal use by .
NOTE: If ignoreCase is true for this , the text array will be directly modified.
The user should never modify this text array after calling this method.
true if the text was added, false if the text already existed.
is null.
Adds a placeholder with the given as the text.
Primarily for internal use by .
true if the text was added, false if the text already existed.
is null.
-or-
The 's property returns false.
Adds a placeholder with the given as the text.
Primarily for internal use by .
true if the text was added, false if the text already existed.
is null.
Adds a placeholder with the given as the text.
Primarily for internal use by .
true if the text was added, false if the text already existed.
is null.
Returns a copy of the current as a new instance of .
Preserves the value of matchVersion and ignoreCase from the current instance.
A copy of the current as a .
Returns a copy of the current as a new instance of
using the specified value. Preserves the value of ignoreCase from the current instance.
compatibility match version see Version
note above for details.
A copy of the current as a .
Returns a copy of the current as a new instance of
using the specified and values.
compatibility match version see Version
note above for details.
false if and only if the set should be case sensitive otherwise true.
A copy of the current as a .
Gets the value associated with the specified text.
The text of the value to get.
The position of the where the target text begins.
The total length of the .
When this method returns, contains the value associated with the specified text,
if the text is found; otherwise, the default value for the type of the value parameter.
This parameter is passed uninitialized.
true if the contains an element with the specified text; otherwise, false.
is null.
or is less than zero.
and refer to a position outside of .
Gets the value associated with the specified text.
The text of the value to get.
When this method returns, contains the value associated with the specified text,
if the text is found; otherwise, the default value for the type of the value parameter.
This parameter is passed uninitialized.
true if the contains an element with the specified text; otherwise, false.
is null.
Gets the value associated with the specified text.
The text of the value to get.
When this method returns, contains the value associated with the specified text,
if the text is found; otherwise, the default value for the type of the value parameter.
This parameter is passed uninitialized.
true if the contains an element with the specified text; otherwise, false.
is null.
-or-
The 's property returns false.
Gets the value associated with the specified text.
The text of the value to get.
When this method returns, contains the value associated with the specified text,
if the text is found; otherwise, the default value for the type of the value parameter.
This parameter is passed uninitialized.
true if the contains an element with the specified text; otherwise, false.
is null.
Gets the value associated with the specified text.
The text of the value to get.
When this method returns, contains the value associated with the specified text,
if the text is found; otherwise, the default value for the type of the value parameter.
This parameter is passed uninitialized.
true if the contains an element with the specified text; otherwise, false.
is null.
Gets or sets the value associated with the specified text.
Note: If ignoreCase is true for this dictionary, the text array will be directly modified.
The text of the value to get or set.
The position of the where the target text begins.
The total length of the .
is null.
or is less than zero.
and refer to a position outside of .
Gets or sets the value associated with the specified text.
Note: If ignoreCase is true for this dictionary, the text array will be directly modified.
The user should never modify this text array after calling this setter.
The text of the value to get or set.
is null.
Gets or sets the value associated with the specified text.
The text of the value to get or set.
is null.
-or-
The 's property returns false.
Gets or sets the value associated with the specified text.
The text of the value to get or set.
is null.
Gets or sets the value associated with the specified text.
The text of the value to get or set.
is null.
Gets a collection containing the keys in the .
Gets a collection containing the values in the .
This specialized collection can be enumerated in order to read its values and
overrides in order to display a string
representation of the values.
Class that represents the values in the .
Initializes a new instance of for the provided .
The dictionary to read the values from.
is null.
Gets the number of elements contained in the .
Retrieving the value of this property is an O(1) operation.
Determines whether the set contains a specific element.
The element to locate in the set.
true if the set contains item; otherwise, false.
Copies the elements to an existing one-dimensional
array, starting at the specified array index.
The one-dimensional array that is the destination of the elements copied from
the . The array must have zero-based indexing.
The zero-based index in at which copying begins.
is null.
is less than 0.
The number of elements in the source
is greater than the available space from to the end of the destination
.
The elements are copied to the array in the same order in which the enumerator iterates through the
.
This method is an O(n) operation, where n is .
Returns an enumerator that iterates through the .
An enumerator that iterates through the .
An enumerator remains valid as long as the collection remains unchanged. If changes are made to
the collection, such as adding, modifying, or deleting elements, the enumerator is irrecoverably
invalidated and the next call to or
throws an .
This method is an O(log n) operation.
Returns a string that represents the current collection.
The presentation has a specific format. It is enclosed by square
brackets ("[]"). Elements are separated by ', ' (comma and space).
null values are represented as the string "null".
A string that represents the current collection.
Enumerates the elements of a .
The foreach statement of the C# language (for each in C++, For Each in Visual Basic)
hides the complexity of enumerators. Therefore, using foreach is recommended instead of directly manipulating the enumerator.
Enumerators can be used to read the data in the collection, but they cannot be used to modify the underlying collection.
Initially, the enumerator is positioned before the first element in the collection. At this position, the
property is undefined. Therefore, you must call the
method to advance the enumerator to the first element
of the collection before reading the value of .
The property returns the same object until
is called.
sets to the next element.
If passes the end of the collection, the enumerator is
positioned after the last element in the collection and
returns false. When the enumerator is at this position, subsequent calls to
also return false. If the last call to returned false,
is undefined. You cannot set
to the first element of the collection again; you must create a new enumerator object instead.
An enumerator remains valid as long as the collection remains unchanged. If changes are made to the collection,
such as adding, modifying, or deleting elements, the enumerator is irrecoverably invalidated and the next call
to or throws an
.
The enumerator does not have exclusive access to the collection; therefore, enumerating through a collection is
intrinsically not a thread-safe procedure. To guarantee thread safety during enumeration, you can lock the
collection during the entire enumeration. To allow the collection to be accessed by multiple threads for
reading and writing, you must implement your own synchronization.
Gets the element at the current position of the enumerator.
is undefined under any of the following conditions:
-
The enumerator is positioned before the first element of the collection. That happens after an
enumerator is created or after the method is called. The
method must be called to advance the enumerator to the first element of the collection before reading the value of
the property.
-
The last call to returned false, which indicates the end of the collection and that the
enumerator is positioned after the last element of the collection.
-
The enumerator is invalidated due to changes made in the collection, such as adding, modifying, or deleting elements.
does not move the position of the enumerator, and consecutive calls to return
the same object until either or is called.
Releases all resources used by the .
Advances the enumerator to the next element of the .
true if the enumerator was successfully advanced to the next element;
false if the enumerator has passed the end of the collection.
The collection was modified after the enumerator was created.
After an enumerator is created, the enumerator is positioned before the first element in the collection,
and the first call to the method advances the enumerator to the first element
of the collection.
If MoveNext passes the end of the collection, the enumerator is positioned after the last element in the
collection and returns false. When the enumerator is at this position,
subsequent calls to also return false.
An enumerator remains valid as long as the collection remains unchanged. If changes are made to the
collection, such as adding, modifying, or deleting elements, the enumerator is irrecoverably invalidated
and the next call to or throws an
.
true if the is read-only; otherwise false.
Returns an enumerator that iterates through the .
A for the
.
For purposes of enumeration, each item is a structure
representing a value and its text. There are also properties allowing direct access
to the array of each element and quick conversions to or .
The foreach statement of the C# language (for each in C++, For Each in Visual Basic)
hides the complexity of enumerators. Therefore, using foreach is recommended instead of directly manipulating the enumerator.
This enumerator can be used to read the data in the collection, or modify the corresponding value at the current position.
Initially, the enumerator is positioned before the first element in the collection. At this position, the
property is undefined. Therefore, you must call the
method to advance the enumerator to the first element
of the collection before reading the value of .
The property returns the same object until
is called.
sets to the next element.
If passes the end of the collection, the enumerator is
positioned after the last element in the collection and
returns false. When the enumerator is at this position, subsequent calls to
also return false. If the last call to returned false,
is undefined. You cannot set
to the first element of the collection again; you must create a new enumerator object instead.
An enumerator remains valid as long as the collection remains unchanged. If changes are made to the collection,
such as adding, modifying, or deleting elements (other than through the method),
the enumerator is irrecoverably invalidated and the next call
to or throws an
.
The enumerator does not have exclusive access to the collection; therefore, enumerating through a collection is
intrinsically not a thread-safe procedure. To guarantee thread safety during enumeration, you can lock the
collection during the entire enumeration. To allow the collection to be accessed by multiple threads for
reading and writing, you must implement your own synchronization.
Default implementations of collections in the namespace are not synchronized.
This method is an O(1) operation.
Gets the number of text/value pairs contained in the .
Returns a string that represents the current object. (Inherited from .)
Returns an view on the dictionary's keys.
The set will use the same as this dictionary.
Enumerates the elements of a .
This enumerator exposes efficient access to the
underlying . It also has ,
, and properties for
convenience.
The foreach statement of the C# language (for each in C++, For Each in Visual Basic)
hides the complexity of enumerators. Therefore, using foreach is recommended instead of directly manipulating the enumerator.
This enumerator can be used to read the data in the collection, or modify the corresponding value at the current position.
Initially, the enumerator is positioned before the first element in the collection. At this position, the
property is undefined. Therefore, you must call the
method to advance the enumerator to the first element
of the collection before reading the value of .
The property returns the same object until
is called.
sets to the next element.
If passes the end of the collection, the enumerator is
positioned after the last element in the collection and
returns false. When the enumerator is at this position, subsequent calls to
also return false. If the last call to returned false,
is undefined. You cannot set
to the first element of the collection again; you must create a new enumerator object instead.
An enumerator remains valid as long as the collection remains unchanged. If changes are made to the collection,
such as adding, modifying, or deleting elements, the enumerator is irrecoverably invalidated and the next call
to or throws an
.
The enumerator does not have exclusive access to the collection; therefore, enumerating through a collection is
intrinsically not a thread-safe procedure. To guarantee thread safety during enumeration, you can lock the
collection during the entire enumeration. To allow the collection to be accessed by multiple threads for
reading and writing, you must implement your own synchronization.
Gets the current text as a .
Gets the current text... do not modify the returned char[].
Gets the current text as a newly created object.
Gets the value associated with the current text.
Sets the value associated with the current text.
Returns the value prior to the update.
Releases all resources used by the .
Advances the enumerator to the next element of the .
true if the enumerator was successfully advanced to the next element;
false if the enumerator has passed the end of the collection.
The collection was modified after the enumerator was created.
After an enumerator is created, the enumerator is positioned before the first element in the collection,
and the first call to the method advances the enumerator to the first element
of the collection.
If passes the end of the collection, the enumerator is positioned after the last element in the
collection and returns false. When the enumerator is at this position,
subsequent calls to also return false.
An enumerator remains valid as long as the collection remains unchanged. If changes are made to the
collection, such as adding, modifying, or deleting elements, the enumerator is irrecoverably invalidated
and the next call to or throws an
.
Gets the element at the current position of the enumerator.
is undefined under any of the following conditions:
-
The enumerator is positioned before the first element of the collection. That happens after an
enumerator is created or after the method is called. The
method must be called to advance the enumerator to the first element of the collection before reading the value of
the property.
-
The last call to returned false, which indicates the end of the collection and that the
enumerator is positioned after the last element of the collection.
-
The enumerator is invalidated due to changes made in the collection, such as adding, modifying, or deleting elements.
does not move the position of the enumerator, and consecutive calls to return
the same object until either or is called.
LUCENENET specific interface used so
can hold a reference to without
knowing its generic closing type for TValue.
LUCENENET specific interface used so can
iterate the keys of without
knowing its generic closing type for TValue.
Returns a copy of the given dictionary as a . If the given dictionary
is a the ignoreCase property will be preserved.
Note: If you intend to create a copy of another where
the of the source dictionary differs from its copy
should be used instead.
The will preserve the of the
source dictionary if it is an instance of .
compatibility match version see Version
note above for details. This argument will be ignored if the
given dictionary is a .
a dictionary to copy
a copy of the given dictionary as a . If the given dictionary
is a the ignoreCase property as well as the
will be of the given dictionary will be preserved.
Used by to copy without knowing
its generic type.
Returns an unmodifiable . This allows to provide
unmodifiable views of internal dictionary for "read-only" use.
a dictionary for which the unmodifiable dictionary is returned.
an new unmodifiable .
if the given dictionary is null.
Used by to create an instance
without knowing the type of .
Empty optimized for speed.
Contains checks will always return false or throw
NPE if necessary.
Extensions to for .
Returns a copy of the current as a
using the specified value.
The type of dictionary value.
A to copy.
compatibility match version see Version
note above for details.
A copy of the current dictionary as a .
is null.
Returns a copy of the current as a
using the specified and values.
The type of dictionary value.
A to copy.
compatibility match version see Version
note above for details.
false if and only if the set should be case sensitive otherwise true.
A copy of the current dictionary as a .
is null.
LUCENENET specific. Just a class to make error messages easier to manage in one place.
Ideally, these would be in resources so they can be localized (eventually), but at least
this half-measure will make that somewhat easier to do and is guaranteed not to cause
performance issues.
A simple class that stores s as 's in a
hash table. Note that this is not a general purpose
class. For example, it cannot remove items from the
set, nor does it resize its hash table to be smaller,
etc. It is designed to be quick to test if a
is in the set without the necessity of converting it
to a first.
You must specify the required
compatibility when creating :
- As of 3.1, supplementary characters are
properly lowercased.
Before 3.1 supplementary characters could not be
lowercased correctly due to the lack of Unicode 4
support in JDK 1.4. To use instances of
with the behavior before Lucene
3.1 pass a to the constructors.
Please note: This class implements but
does not behave like it should in all cases. The generic type is
, because you can add any object to it,
that has a string representation (which is converted to a string). The add methods will use
and store the result using a
buffer. The same behavior have the methods.
The returns an
Create set with enough capacity to hold terms
compatibility match version see for details.
the initial capacity
false if and only if the set should be case sensitive
otherwise true.
is less than zero.
Creates a set from a collection of s.
Compatibility match version see for details.
A collection whose elements to be placed into the set.
false if and only if the set should be case sensitive
otherwise true.
is null.
-or-
A given element within the is null.
Creates a set from a collection of s.
NOTE: If is true, the text arrays will be directly modified.
The user should never modify these text arrays after calling this method.
Compatibility match version see for details.
A collection whose elements to be placed into the set.
false if and only if the set should be case sensitive
otherwise true.
is null.
-or-
A given element within the is null.
Creates a set from a collection of s.
Compatibility match version see for details.
A collection whose elements to be placed into the set.
false if and only if the set should be case sensitive
otherwise true.
is null.
-or-
A given element within the is null.
-or-
The property for a given element in the returns false.
Create set from the specified map (internal only), used also by
Clears all entries in this set. This method is supported for reusing, but not .
true if the chars of starting at
are in the set.
is null.
or is less than zero.
and refer to a position outside of .
true if the s
are in the set
is null.
true if the is in the set.
is null.
-or-
The 's property returns false.
true if the is in the set.
is null.
true if the representation of is in the set.
is null.
Adds the representation of into the set.
The method is called after setting the thread to .
If the type of is a value type, it will be converted using the
.
A string-able object.
true if was added to the set; false if it already existed prior to this call.
is null.
Adds a into the set
The text to be added to the set.
true if was added to the set; false if it already existed prior to this call.
is null.
Adds a into the set
The text to be added to the set.
true if was added to the set; false if it already existed prior to this call.
is null.
Adds a directly to the set.
NOTE: If ignoreCase is true for this , the text array will be directly modified.
The user should never modify this text array after calling this method.
The text to be added to the set.
true if was added to the set; false if it already existed prior to this call.
is null.
Adds a to the set using the specified and .
NOTE: If ignoreCase is true for this , the text array will be directly modified.
The text to be added to the set.
The position of the where the target text begins.
The total length of the .
true if was added to the set; false if it already existed prior to this call.
is null.
or is less than zero.
and refer to a position outside of .
LUCENENET specific for supporting .
Gets the number of elements contained in the .
true if the is read-only; otherwise false.
Returns an unmodifiable . This allows to provide
unmodifiable views of internal sets for "read-only" use.
a set for which the unmodifiable set is returned.
an new unmodifiable .
if the given set is null.
Returns an unmodifiable . This allows to provide
unmodifiable views of internal sets for "read-only" use.
A new unmodifiable .
Returns a copy of this set as a new instance .
The and ignoreCase property will be preserved.
A copy of this set as a new instance of .
The field as well as the
will be preserved.
Returns a copy of this set as a new instance
with the provided .
The ignoreCase property will be preserved from this .
A copy of this set as a new instance of .
The field will be preserved.
Returns a copy of this set as a new instance
with the provided and values.
A copy of this set as a new instance of .
Returns a copy of the given set as a . If the given set
is a the ignoreCase property will be preserved.
Note: If you intend to create a copy of another where
the of the source set differs from its copy
should be used instead.
The method will preserve the of the
source set it is an instance of .
compatibility match version. This argument will be ignored if the
given set is a .
a set to copy
A copy of the given set as a . If the given set
is a the field as well as the
will be preserved.
is null.
-or-
A given element within the is null.
-or-
The property for a given element in the returns false.
Returns an enumerator that iterates through the .
An enumerator that iterates through the .
An enumerator remains valid as long as the collection remains unchanged. If changes are made to
the collection, such as adding, modifying, or deleting elements, the enumerator is irrecoverably
invalidated and the next call to or
throws an .
This method is an O(log n) operation.
Enumerates the elements of a object.
This implementation provides direct access to the array of the underlying collection
as well as convenience properties for converting to and .
The foreach statement of the C# language (for each in C++, For Each in Visual Basic)
hides the complexity of enumerators. Therefore, using foreach is recommended instead of directly manipulating the enumerator.
Enumerators can be used to read the data in the collection, but they cannot be used to modify the underlying collection.
Initially, the enumerator is positioned before the first element in the collection. At this position, the
property is undefined. Therefore, you must call the
method to advance the enumerator to the first element
of the collection before reading the value of .
The property returns the same object until
is called.
sets to the next element.
If passes the end of the collection, the enumerator is
positioned after the last element in the collection and
returns false. When the enumerator is at this position, subsequent calls to
also return false. If the last call to returned false,
is undefined. You cannot set
to the first element of the collection again; you must create a new enumerator object instead.
An enumerator remains valid as long as the collection remains unchanged. If changes are made to the collection,
such as adding, modifying, or deleting elements, the enumerator is irrecoverably invalidated and the next call
to or throws an
.
The enumerator does not have exclusive access to the collection; therefore, enumerating through a collection is
intrinsically not a thread-safe procedure. To guarantee thread safety during enumeration, you can lock the
collection during the entire enumeration. To allow the collection to be accessed by multiple threads for
reading and writing, you must implement your own synchronization.
This method is an O(1) operation.
Gets the current value as a .
Gets the current value... do not modify the returned char[].
Gets the current value as a newly created object.
Releases all resources used by the .
Advances the enumerator to the next element of the .
true if the enumerator was successfully advanced to the next element;
false if the enumerator has passed the end of the collection.
The collection was modified after the enumerator was created.
After an enumerator is created, the enumerator is positioned before the first element in the collection,
and the first call to the method advances the enumerator to the first element
of the collection.
If passes the end of the collection, the enumerator is positioned after the last element in the
collection and returns false. When the enumerator is at this position,
subsequent calls to also return false.
An enumerator remains valid as long as the collection remains unchanged. If changes are made to the
collection, such as adding, modifying, or deleting elements, the enumerator is irrecoverably invalidated
and the next call to or throws an
.
Returns a string that represents the current collection.
The presentation has a specific format. It is enclosed by curly
brackets ("{}"). Keys and values are separated by '=',
KeyValuePairs are separated by ', ' (comma and space).
null values are represented as the string "null".
A string that represents the current collection.
Compares the specified object with this set for equality. Returns true if the
given object is also a set, the two sets have the same size, and every member of the
given set is contained in this set. This ensures that the equals method works properly
across different implementations of the interface.
This implementation first checks if the specified object is this set; if so it
returns true. Then, it checks if the specified object is a set whose
size is identical to the size of this set; if not, it returns false. If so,
it uses the enumerator of this set and the specified object to determine if all of the
contained values are present (using ).
object to be compared for equality with this set
true if the specified object is equal to this set
Returns the hash code value for this set. The hash code of a set
is defined to be the sum of the hash codes of the elements in the
set, where the hash code of a null element is defined to be zero.
This ensures that s1.Equals(s2) implies that
s1.GetHashCode()==s2.GetHashCode() for any two sets s1 and s2.
This implementation iterates over the set, calling the GetHashCode()
method on each element in the set, and adding up the results.
the hash code value for this set
Copies the entire to a one-dimensional array,
starting at the specified index of the target array.
The one-dimensional Array that is the destination of the
elements copied from . The Array must have zero-based indexing.
is null.
The number of elements in the source is greater
than the available space in the destination array.
Copies the entire to a one-dimensional array,
starting at the specified index of the target array.
The one-dimensional Array that is the destination of the
elements copied from . The Array must have zero-based indexing.
The zero-based index in array at which copying begins.
is null.
is less than zero.
The number of elements in the source is greater
than the available space from to the end of the destination array.
Copies the entire to a one-dimensional array,
starting at the specified index of the target array.
The one-dimensional Array that is the destination of the
elements copied from . The Array must have zero-based indexing.
The zero-based index in array at which copying begins.
is null.
or is less than zero.
is greater than the length of the destination .
-or-
is greater than the available space from the
to the end of the destination .
Copies the entire to a jagged array or of type char[],
starting at the specified index of the target array.
The jagged array or of type char[] that is the destination of the
elements copied from . The Array must have zero-based indexing.
is null.
The number of elements in the source is greater
than the available space in the destination array.
Copies the entire to a jagged array or of type char[]
starting at the specified index of the target array.
The jagged array or of type char[] that is the destination of the
elements copied from . The Array must have zero-based indexing.
The zero-based index in array at which copying begins.
is null.
is less than zero.
The number of elements in the source is greater
than the available space from to the end of the destination array.
Copies the entire to a jagged array or of type char[]
starting at the specified index of the target array.
The jagged array or of type char[] that is the destination of the
elements copied from . The Array must have zero-based indexing.
The zero-based index in array at which copying begins.
is null.
or is less than zero.
is greater than the length of the destination .
-or-
is greater than the available space from the
to the end of the destination .
Copies the entire to a one-dimensional array,
starting at the specified index of the target array.
The one-dimensional Array that is the destination of the
elements copied from . The Array must have zero-based indexing.
is null.
The number of elements in the source is greater
than the available space in the destination array.
Copies the entire to a one-dimensional array,
starting at the specified index of the target array.
The one-dimensional Array that is the destination of the
elements copied from . The Array must have zero-based indexing.
The zero-based index in array at which copying begins.
is null.
is less than zero.
The number of elements in the source is greater
than the available space from to the end of the destination array.
Copies the entire to a one-dimensional array,
starting at the specified index of the target array.
The one-dimensional Array that is the destination of the
elements copied from . The Array must have zero-based indexing.
The zero-based index in array at which copying begins.
is null.
or is less than zero.
is greater than the length of the destination .
-or-
is greater than the available space from the
to the end of the destination .
Determines whether the current set and the specified collection contain the same elements.
The collection to compare to the current set.
true if the current set is equal to other; otherwise, false.
is null.
Determines whether the current set and the specified collection contain the same elements.
The collection to compare to the current set.
true if the current set is equal to other; otherwise, false.
is null.
Determines whether the current set and the specified collection contain the same elements.
The collection to compare to the current set.
true if the current set is equal to other; otherwise, false.
is null.
Determines whether the current set and the specified collection contain the same elements.
The collection to compare to the current set.
true if the current set is equal to other; otherwise, false.
is null.
Modifies the current to contain all elements that are present
in itself, the specified collection, or both.
NOTE: If ignoreCase is true for this , the text arrays will be directly modified.
The user should never modify these text arrays after calling this method.
The collection whose elements should be merged into the .
true if this changed as a result of the call.
is null.
This set instance is read-only.
Modifies the current to contain all elements that are present
in itself, the specified collection, or both.
The collection whose elements should be merged into the .
true if this changed as a result of the call.
is null.
-or-
A given element within the collection is null.
-or-
The property for a given element in the collection returns false.
This set instance is read-only.
Modifies the current to contain all elements that are present
in itself, the specified collection, or both.
The collection whose elements should be merged into the .
true if this changed as a result of the call.
is null.
This set instance is read-only.
Modifies the current to contain all elements that are present
in itself, the specified collection, or both.
The collection whose elements should be merged into the .
true if this changed as a result of the call.
is null.
This set instance is read-only.
Determines whether a object is a subset of the specified collection.
The collection to compare to the current object.
true if this object is a subset of ; otherwise, false.
is null.
Determines whether a object is a subset of the specified collection.
The collection to compare to the current object.
true if this object is a subset of ; otherwise, false.
is null.
Determines whether a object is a subset of the specified collection.
The collection to compare to the current object.
true if this object is a subset of ; otherwise, false.
is null.
Determines whether a object is a subset of the specified collection.
The collection to compare to the current object.
true if this object is a subset of ; otherwise, false.
is null.
Determines whether a object is a superset of the specified collection.
The collection to compare to the current object.
true if this object is a superset of ; otherwise, false.
is null.
Determines whether a object is a superset of the specified collection.
The collection to compare to the current object.
true if this object is a superset of ; otherwise, false.
is null.
Determines whether a object is a superset of the specified collection.
The collection to compare to the current object.
true if this object is a superset of ; otherwise, false.
is null.
Determines whether a object is a superset of the specified collection.
The collection to compare to the current object.
true if this object is a superset of ; otherwise, false.
is null.
Determines whether a object is a proper subset of the specified collection.
The collection to compare to the current object.
true if this object is a proper subset of ; otherwise, false.
is null.
Determines whether a object is a proper subset of the specified collection.
The collection to compare to the current object.
true if this object is a proper subset of ; otherwise, false.
is null.
Determines whether a object is a proper subset of the specified collection.
The collection to compare to the current object.
true if this object is a proper subset of ; otherwise, false.
is null.
Determines whether a object is a proper subset of the specified collection.
The collection to compare to the current object.
true if this object is a proper subset of ; otherwise, false.
is null.
Determines whether a object is a proper superset of the specified collection.
The collection to compare to the current object.
true if this object is a proper superset of ; otherwise, false.
is null.
Determines whether a object is a proper superset of the specified collection.
The collection to compare to the current object.
true if this object is a proper superset of ; otherwise, false.
is null.
Determines whether a object is a proper superset of the specified collection.
The collection to compare to the current object.
true if this object is a proper superset of ; otherwise, false.
is null.
Determines whether a object is a proper superset of the specified collection.
The collection to compare to the current object.
true if this object is a proper superset of ; otherwise, false.
is null.
Determines whether the current object and a specified collection share common elements.
The collection to compare to the current object.
true if the object and share at least one common element; otherwise, false.
is null.
Determines whether the current object and a specified collection share common elements.
The collection to compare to the current object.
true if the object and share at least one common element; otherwise, false.
is null.
Determines whether the current object and a specified collection share common elements.
The collection to compare to the current object.
true if the object and share at least one common element; otherwise, false.
is null.
Determines whether the current object and a specified collection share common elements.
The collection to compare to the current object.
true if this object and share at least one common element; otherwise, false.
is null.
Returns true if this collection contains all of the elements
in the specified collection.
collection to be checked for containment in this collection
true if this contains all of the elements in the specified collection; otherwise, false.
Returns true if this collection contains all of the elements
in the specified collection.
collection to be checked for containment in this collection
true if this contains all of the elements in the specified collection; otherwise, false.
Returns true if this collection contains all of the elements
in the specified collection.
collection to be checked for containment in this collection
true if this contains all of the elements in the specified collection; otherwise, false.
Returns true if this collection contains all of the elements
in the specified collection.
collection to be checked for containment in this collection
true if this contains all of the elements in the specified collection; otherwise, false.
Returns true if this collection contains all of the elements
in the specified collection.
collection to be checked for containment in this collection
true if this contains all of the elements in the specified collection; otherwise, false.
Returns true if this collection contains all of the elements
in the specified collection.
collection to be checked for containment in this collection
true if this contains all of the elements in the specified collection; otherwise, false.
Extensions to for .
Returns a copy of this as a new instance of with the
specified and ignoreCase set to false.
The type of collection. Typically a or .
This collection.
Compatibility match version.
A copy of this as a .
is null.
Returns a copy of this as a new instance of with the
specified and .
The type of collection. Typically a or .
This collection.
Compatibility match version.
false if and only if the set should be case sensitive otherwise true.
A copy of this as a .
is null.
Abstract parent class for analysis factories that create
instances.
looks up a charfilter by name from the host project's dependent assemblies
looks up a charfilter class by name from the host project's dependent assemblies
returns a list of all available charfilter names
Reloads the factory list.
Changes to the factories are visible after the method ends, all
iterators (,...) stay consistent.
NOTE: Only new factories are added, existing ones are
never removed or replaced.
This method is expensive and should only be called for discovery
of new factories on the given classpath/classloader!
Initialize this factory via a set of key-value pairs.
Wraps the given with a .
An abstract base class for simple, character-oriented tokenizers.
You must specify the required compatibility
when creating :
- As of 3.1, uses an int based API to normalize and
detect token codepoints. See and
for details.
A new API has been introduced with Lucene 3.1. This API
moved from UTF-16 code units to UTF-32 codepoints to eventually add support
for supplementary characters. The old char based API has been
deprecated and should be replaced with the int based methods
and .
As of Lucene 3.1 each - constructor expects a
argument. Based on the given either the new
API or a backwards compatibility layer is used at runtime. For
< 3.1 the backwards compatibility layer ensures correct
behavior even for indexes build with previous versions of Lucene. If a
>= 3.1 is used requires the new API to
be implemented by the instantiated class. Yet, the old char based API
is not required anymore even if backwards compatibility must be preserved.
subclasses implementing the new API are fully backwards
compatible if instantiated with < 3.1.
Note: If you use a subclass of with >=
3.1 on an index build with a version < 3.1, created tokens might not be
compatible with the terms in your index.
Creates a new instance
Lucene version to match
the input to split up into tokens
Creates a new instance
Lucene version to match
the attribute factory to use for this
the input to split up into tokens
LUCENENET specific - Added in the .NET version to assist with setting the attributes
from multiple constructors.
Returns true iff a codepoint should be included in a token. This tokenizer
generates as tokens adjacent sequences of codepoints which satisfy this
predicate. Codepoints for which this is false are used to define token
boundaries and are not included in tokens.
Called on each token character to normalize it before it is added to the
token. The default implementation does nothing. Subclasses may use this to,
e.g., lowercase tokens.
Simple that uses
and to open resources and
s, respectively.
Creates an instance using the System.Assembly of the given class to load Resources and classes
Resource paths must be absolute.
Removes elisions from a . For example, "l'avion" (the plane) will be
tokenized as "avion" (plane).
Elision in Wikipedia
Constructs an elision filter with a of stop words
the source
a set of stopword articles
Increments the with a without elisioned start
Factory for .
<fieldType name="text_elsn" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ElisionFilterFactory"
articles="stopwordarticles.txt" ignoreCase="true"/>
</analyzer>
</fieldType>
Creates a new
Simple that opens resource files
from the local file system, optionally resolving against
a base directory.
This loader wraps a delegate
that is used to resolve all files, the current base directory
does not contain. is always resolved
against the delegate, as an is needed.
You can chain several s
to allow lookup of files in more than one base directory.
Creates a resource loader that requires absolute filenames or relative to CWD
to resolve resources. Files not found in file system and class lookups
are delegated to context classloader.
Creates a resource loader that resolves resources against the given
base directory (may be null to refer to CWD).
Files not found in file system and class lookups are delegated to context
classloader.
Creates a resource loader that resolves resources against the given
base directory (may be null to refer to CWD).
Files not found in file system and class lookups are delegated
to the given delegate .
Abstract base class for TokenFilters that may remove tokens.
You have to implement and return a boolean if the current
token should be preserved. uses this method
to decide if a token should be passed to the caller.
As of Lucene 4.4, an
is thrown when trying to disable position
increments when filtering terms.
Create a new .
the Lucene match version
whether to increment position increments when filtering out terms
the input to consume
@deprecated enablePositionIncrements=false is not supported anymore as of Lucene 4.4
Create a new .
the Lucene match version
the to consume
Override this method and return if the current input token should be returned by .
If true, this will preserve
positions of the incoming tokens (ie, accumulate and
set position increments of the removed tokens).
Generally, true is best as it does not
lose information (positions of the original tokens)
during indexing.
When set, when a token is stopped
(omitted), the position increment of the following
token is incremented.
Add to any analysis factory component to allow returning an
analysis component factory for use with partial terms in prefix queries,
wildcard queries, range query endpoints, regex queries, etc.
@lucene.experimental
Returns an analysis component to handle analysis if multi-term queries.
The returned component must be a , or .
A StringBuilder that allows one to access the array.
Abstraction for loading resources (streams, files, and classes).
Opens a named resource
Finds class of the name
NOTE: This was findClass() in Lucene
Creates an instance of the name and expected type
Interface for a component that needs to be initialized by
an implementation of .
Initializes this component with the provided
(used for loading types, embedded resources, files, etc).
Acts like a forever growing as you read
characters into it from the provided reader, but
internally it uses a circular buffer to only hold the
characters that haven't been freed yet. This is like a
PushbackReader, except you don't have to specify
up-front the max size of the buffer, but you do have to
periodically call .
Clear array and switch to new reader.
Absolute position read. NOTE: pos must not jump
ahead by more than 1! Ie, it's OK to read arbitarily
far back (just not prior to the last ,
but NOT ok to read arbitrarily far
ahead. Returns -1 if you hit EOF.
Call this to notify us that no chars before this
absolute position are needed anymore.
Some commonly-used stemming functions
@lucene.internal
Returns true if the character array starts with the prefix.
Input Buffer
length of input buffer
Prefix string to test
true if starts with
Returns true if the character array ends with the suffix.
Input Buffer
length of input buffer
Suffix string to test
true if ends with
Returns true if the character array ends with the suffix.
Input Buffer
length of input buffer
Suffix string to test
true if ends with
Delete a character in-place
Input Buffer
Position of character to delete
length of input buffer
length of input buffer after deletion
Delete n characters in-place
Input Buffer
Position of character to delete
Length of input buffer
number of characters to delete
length of input buffer after deletion
Base class for s that need to make use of stopword sets.
An immutable stopword set
Returns the analyzer's stopword set or an empty set if the analyzer has no
stopwords
the analyzer's stopword set or an empty set if the analyzer has no
stopwords
Creates a new instance initialized with the given stopword set
the Lucene version for cross version compatibility
the analyzer's stopword set
Creates a new with an empty stopword set
the Lucene version for cross version compatibility
Creates a from an embedded resource associated with a class. (See
).
true if the set should ignore the case of the
stopwords, otherwise false
a class that is associated with the given stopwordResource
name of the resource file associated with the given class
comment string to ignore in the stopword file
a containing the distinct stopwords from the given
file
if loading the stopwords throws an
Creates a from a file.
the stopwords file to load
the Lucene version for cross version compatibility
a containing the distinct stopwords from the given
file
if loading the stopwords throws an
Creates a from a file.
the stopwords reader to load
the Lucene version for cross version compatibility
a containing the distinct stopwords from the given
reader
if loading the stopwords throws an
Abstract parent class for analysis factories that create
instances.
looks up a tokenfilter by name from the host project's referenced assemblies
looks up a tokenfilter class by name from the host project's referenced assemblies
returns a list of all available tokenfilter names from the host project's referenced assemblies
Reloads the factory list.
Changes to the factories are visible after the method ends, all
iterators (,...) stay consistent.
NOTE: Only new factories are added, existing ones are
never removed or replaced.
This method is expensive and should only be called for discovery
of new factories on the given classpath/classloader!
Initialize this factory via a set of key-value pairs.
Transform the specified input
Abstract parent class for analysis factories that create
instances.
looks up a tokenizer by name from the host project's referenced assemblies
looks up a tokenizer class by name from the host project's referenced assemblies
returns a list of all available tokenizer names from the host project's referenced assemblies
Reloads the factory list.
Changes to the factories are visible after the method ends, all
iterators (,...) stay consistent.
NOTE: Only new factories are added, existing ones are
never removed or replaced.
This method is expensive and should only be called for discovery
of new factories on the given classpath/classloader!
Initialize this factory via a set of key-value pairs.
Creates a of the specified input using the default attribute factory.
Creates a of the specified input using the given
Loader for text files that represent a list of stopwords.
to obtain instances.
@lucene.internal
Reads lines from a and adds every line as an entry to a (omitting
leading and trailing whitespace). Every line of the should contain only
one word. The words need to be in lowercase if you make use of an
which uses (like ).
containing the wordlist
the to fill with the readers words
the given with the reader's words
Reads lines from a and adds every line as an entry to a (omitting
leading and trailing whitespace). Every line of the should contain only
one word. The words need to be in lowercase if you make use of an
which uses (like ).
containing the wordlist
the
A with the reader's words
Reads lines from a and adds every non-comment line as an entry to a (omitting
leading and trailing whitespace). Every line of the should contain only
one word. The words need to be in lowercase if you make use of an
which uses (like ).
containing the wordlist
The string representing a comment.
the
A CharArraySet with the reader's words
Reads lines from a and adds every non-comment line as an entry to a (omitting
leading and trailing whitespace). Every line of the should contain only
one word. The words need to be in lowercase if you make use of an
which uses (like ).
containing the wordlist
The string representing a comment.
the to fill with the readers words
the given with the reader's words
Reads stopwords from a stopword list in Snowball format.
The snowball format is the following:
- Lines may contain multiple words separated by whitespace.
- The comment character is the vertical line (|).
- Lines may contain trailing comments.
containing a Snowball stopword list
the to fill with the readers words
the given with the reader's words
Reads stopwords from a stopword list in Snowball format.
The snowball format is the following:
- Lines may contain multiple words separated by whitespace.
- The comment character is the vertical line (|).
- Lines may contain trailing comments.
containing a Snowball stopword list
the Lucene
A with the reader's words
Reads a stem dictionary. Each line contains:
word\tstem
(i.e. two tab separated words)
stem dictionary that overrules the stemming algorithm
If there is a low-level I/O error.
Accesses a resource by name and returns the (non comment) lines containing
data using the given character encoding.
A comment line is any line that starts with the character "#"
a list of non-blank non-comment lines with whitespace trimmed
If there is a low-level I/O error.
Extension of that is aware of Wikipedia syntax. It is based off of the
Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
@lucene.experimental
String token types that correspond to token type int constants
Only output tokens
Only output untokenized tokens, which are tokens that would normally be split into several tokens
Output the both the untokenized token and the splits
This flag is used to indicate that the produced "Token" would, if was used, produce multiple tokens.
A private instance of the JFlex-constructed scanner
Creates a new instance of the . Attaches the
to a newly created JFlex scanner.
The Input
Creates a new instance of the . Attaches the
to a the newly created JFlex scanner.
The input
One of , ,
Untokenized types
Creates a new instance of the . Attaches the
to a the newly created JFlex scanner. Uses the given .
The
The input
One of , ,
Untokenized types
Factory for .
<fieldType name="text_wiki" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WikipediaTokenizerFactory"/>
</analyzer>
</fieldType>
Creates a new
JFlex-generated tokenizer that is aware of Wikipedia syntax.
This character denotes the end of file
initial size of the lookahead buffer
lexical states
ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l
ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l
at the beginning of a line
l is of the form l = 2*k, k a non negative integer
Translates characters to character classes
Translates characters to character classes
Translates DFA states to action switch labels.
Translates a state to a row index in the transition table
The transition table of the DFA
ZZ_ATTRIBUTE[aState] contains the attributes of state aState
the input device
the current state of the DFA
the current lexical state
this buffer contains the current text to be matched and is
the source of the YyText string
the textposition at the last accepting state
the current text position in the buffer
startRead marks the beginning of the YyText string in the buffer
endRead marks the last character in the buffer, that has been read
from input
the number of characters up to the start of the matched text
zzAtEOF == true <=> the scanner is at the EOF
Returns the number of tokens seen inside a category or link, etc.
the number of tokens seen inside the context of wiki syntax.
Fills Lucene token with the current token text.
Creates a new scanner
the TextReader to read input from.
Unpacks the compressed character translation table.
the packed character translation table
the unpacked character translation table
Refills the input buffer.
false, iff there was new input.
if any I/O-Error occurs
Disposes the input stream.
Resets the scanner to read from a new input stream.
Does not close the old reader.
All internal variables are reset, the old input stream
cannot be reused (internal buffer is discarded and lost).
Lexical state is set to .
Internal scan buffer is resized down to its initial length, if it has grown.
the new input stream
Returns the current lexical state.
Enters a new lexical state
the new lexical state
Returns the text matched by the current regular expression.
Returns the character at position from the
matched text.
It is equivalent to YyText[pos], but faster
the position of the character to fetch.
A value from 0 to YyLength-1.
the character at position pos
Returns the length of the matched text region.
Reports an error that occured while scanning.
In a wellformed scanner (no or only correct usage of
YyPushBack(int) and a match-all fallback rule) this method
will only be called with things that "Can't Possibly Happen".
If this method is called, something is seriously wrong
(e.g. a JFlex bug producing a faulty scanner etc.).
Usual syntax/scanner level error handling should be done
in error fallback rules.
the code of the errormessage to display
Pushes the specified amount of characters back into the input stream.
They will be read again by then next call of the scanning method
the number of characters to be read again.
This number must not be greater than YyLength!
Resumes scanning until the next regular expression is matched,
the end of input is encountered or an I/O-Error occurs.
the next token
if any I/O-Error occurs
Snowball's among construction.
Search string.
Index to longest matching substring.
Result of the lookup.
Action to be invoked.
Initializes a new instance of the class.
The search string.
The index to the longest matching substring.
The result of the lookup.
Initializes a new instance of the class.
The search string.
The index to the longest matching substring.
The result of the lookup.
The action to be performed, if any.
Returns a that represents this instance.
A that represents this instance.
This class was automatically generated by a Snowball to Java compiler
It implements the stemming algorithm defined by a snowball script.
This class was automatically generated by a Snowball to Java compiler
It implements the stemming algorithm defined by a snowball script.
This class was automatically generated by a Snowball to Java compiler
It implements the stemming algorithm defined by a snowball script.
This class was automatically generated by a Snowball to Java compiler
It implements the stemming algorithm defined by a snowball script.
This class was automatically generated by a Snowball to Java compiler
It implements the stemming algorithm defined by a snowball script.
This class was automatically generated by a Snowball to Java compiler
It implements the stemming algorithm defined by a snowball script.
This class was automatically generated by a Snowball to Java compiler
It implements the stemming algorithm defined by a snowball script.
This class was automatically generated by a Snowball to Java compiler
It implements the stemming algorithm defined by a snowball script.
This class was automatically generated by a Snowball to Java compiler
It implements the stemming algorithm defined by a snowball script.
This class was automatically generated by a Snowball to Java compiler
It implements the stemming algorithm defined by a snowball script.
This class was automatically generated by a Snowball to Java compiler
It implements the stemming algorithm defined by a snowball script.
This class was automatically generated by a Snowball to Java compiler
It implements the stemming algorithm defined by a snowball script.
This class was automatically generated by a Snowball to Java compiler
It implements the stemming algorithm defined by a snowball script.
This class was automatically generated by a Snowball to Java compiler
It implements the stemming algorithm defined by a snowball script.
This class was automatically generated by a Snowball to Java compiler
It implements the stemming algorithm defined by a snowball script.
This class was automatically generated by a Snowball to Java compiler
It implements the stemming algorithm defined by a snowball script.
This class was automatically generated by a Snowball to Java compiler
It implements the stemming algorithm defined by a snowball script.
This class was automatically generated by a Snowball to Java compiler
It implements the stemming algorithm defined by a snowball script.
This class was automatically generated by a Snowball to Java compiler
It implements the stemming algorithm defined by a snowball script.
This class was automatically generated by a Snowball to Java compiler
It implements the stemming algorithm defined by a snowball script.
This class was automatically generated by a Snowball to Java compiler
It implements the stemming algorithm defined by a snowball script.
This class was automatically generated by a Snowball to Java compiler
It implements the stemming algorithm defined by a snowball script.
This class was automatically generated by a Snowball to Java compiler
It implements the stemming algorithm defined by a snowball script.
This is the rev 502 of the Snowball SVN trunk,
but modified:
made abstract and introduced abstract method stem to avoid expensive reflection in filter class.
refactored StringBuffers to StringBuilder
uses char[] as buffer instead of StringBuffer/StringBuilder
eq_s,eq_s_b,insert,replace_s take CharSequence like eq_v and eq_v_b
reflection calls (Lovins, etc) use EMPTY_ARGS/EMPTY_PARAMS
Set the current string.
Set the current string.
character array containing input
valid length of text.
Get the current string.
Get the current buffer containing the stem.
NOTE: this may be a reference to a different character array than the
one originally provided with setCurrent, in the exceptional case that
stemming produced a longer intermediate or result string.
It is necessary to use to determine
the valid length of the returned buffer. For example, many words are
stemmed simply by subtracting from the length to remove suffixes.
Get the valid length of the character array in
to replace chars between and in current by the
chars in .
Copy the slice into the supplied