Lucene.Net.Analysis.Common for Arabic. This analyzer implements light-stemming as specified by: Light Stemming for Arabic Information Retrieval http://www.mtholyoke.edu/~lballest/Pubs/arab_stem05.pdf The analysis package contains three primary components: : Arabic orthographic normalization. : Arabic light stemming Arabic stop words file: a set of default Arabic stop words. File containing default Arabic stopwords. Default stopword list is from http://members.unine.ch/jacques.savoy/clef/index.html The stopword list is BSD-Licensed. Returns an unmodifiable instance of the default stop-words set. an unmodifiable instance of the default stop-words set. Atomically loads the DEFAULT_STOP_SET in a lazy fashion once the outer class accesses the static final set the first time.; Builds an analyzer with the default stop words: . Builds an analyzer with the given stop words lucene compatibility version a stopword set Builds an analyzer with the given stop word. If a none-empty stem exclusion set is provided this analyzer will add a before . lucene compatibility version a stopword set a set of terms not to be stemmed Creates used to tokenize all the text in the provided . built from an filtered with , , , if a stem exclusion set is provided and . Tokenizer that breaks text into runs of letters and diacritics. The problem with the standard Letter tokenizer is that it fails on diacritics. Handling similar to this is necessary for Indic Scripts, Hebrew, Thaana, etc. You must specify the required compatibility when creating : As of 3.1, uses an int based API to normalize and detect token characters. See and for details. @deprecated (3.1) Use instead. Construct a new ArabicLetterTokenizer. to match the input to split up into tokens Construct a new using a given . Lucene version to match - See . the attribute factory to use for this Tokenizer the input to split up into tokens Allows for Letter category or NonspacingMark category Factory for @deprecated (3.1) Use StandardTokenizerFactory instead. Creates a new A that applies to normalize the orthography. Factory for . <fieldType name="text_arnormal" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.ArabicNormalizationFilterFactory"/> </analyzer> </fieldType> Creates a new Normalizer for Arabic. Normalization is done in-place for efficiency, operating on a termbuffer. Normalization is defined as: Normalization of hamza with alef seat to a bare alef. Normalization of teh marbuta to heh Normalization of dotless yeh (alef maksura) to yeh. Removal of Arabic diacritics (the harakat) Removal of tatweel (stretching character). Normalize an input buffer of Arabic text input buffer length of input buffer length of input buffer after normalization A that applies to stem Arabic words.. To prevent terms from being stemmed use an instance of or a custom that sets the before this . Factory for . <fieldType name="text_arstem" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.ArabicNormalizationFilterFactory"/> <filter class="solr.ArabicStemFilterFactory"/> </analyzer> </fieldType> Creates a new Stemmer for Arabic. Stemming is done in-place for efficiency, operating on a termbuffer. Stemming is defined as: Removal of attached definite article, conjunction, and prepositions. Stemming of common suffixes. Stem an input buffer of Arabic text. input buffer length of input buffer length of input buffer after normalization Stem a prefix off an Arabic word. input buffer length of input buffer new length of input buffer after stemming. Stem suffix(es) off an Arabic word. input buffer length of input buffer new length of input buffer after stemming Returns true if the prefix matches and can be stemmed input buffer length of input buffer prefix to check true if the prefix matches and can be stemmed Returns true if the suffix matches and can be stemmed input buffer length of input buffer suffix to check true if the suffix matches and can be stemmed for Bulgarian. This analyzer implements light-stemming as specified by: Searching Strategies for the Bulgarian Language http://members.unine.ch/jacques.savoy/Papers/BUIR.pdf File containing default Bulgarian stopwords. Default stopword list is from http://members.unine.ch/jacques.savoy/clef/index.html The stopword list is BSD-Licensed. Returns an unmodifiable instance of the default stop-words set. an unmodifiable instance of the default stop-words set. Atomically loads the DEFAULT_STOP_SET in a lazy fashion once the outer class accesses the static final set the first time.; Builds an analyzer with the default stop words: . Builds an analyzer with the given stop words. Builds an analyzer with the given stop words and a stem exclusion set. If a stem exclusion set is provided this analyzer will add a before . Creates a which tokenizes all the text in the provided . A built from an filtered with , , , if a stem exclusion set is provided and . A that applies to stem Bulgarian words. To prevent terms from being stemmed use an instance of or a custom that sets the before this . Factory for . <fieldType name="text_bgstem" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.BulgarianStemFilterFactory"/> </analyzer> </fieldType> Creates a new Light Stemmer for Bulgarian. Implements the algorithm described in: Searching Strategies for the Bulgarian Language http://members.unine.ch/jacques.savoy/Papers/BUIR.pdf Stem an input buffer of Bulgarian text. input buffer length of input buffer length of input buffer after normalization Mainly remove the definite article input buffer length of input buffer new stemmed length for Brazilian Portuguese language. Supports an external list of stopwords (words that will not be indexed at all) and an external list of exclusions (words that will not be stemmed, but indexed). NOTE: This class uses the same dependent settings as . File containing default Brazilian Portuguese stopwords. Returns an unmodifiable instance of the default stop-words set. an unmodifiable instance of the default stop-words set. Contains words that should be indexed but not stemmed. Builds an analyzer with the default stop words (). Builds an analyzer with the given stop words lucene compatibility version a stopword set Builds an analyzer with the given stop words and stemming exclusion words lucene compatibility version a stopword set a set of terms not to be stemmed Creates used to tokenize all the text in the provided . built from a filtered with , , , and . A that applies . To prevent terms from being stemmed use an instance of or a custom that sets the before this . in use by this filter. Creates a new the source Factory for . <fieldType name="text_brstem" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.BrazilianStemFilterFactory"/> </analyzer> </fieldType> Creates a new A stemmer for Brazilian Portuguese words. Changed term Stems the given term to an unique discriminator. The term that should be stemmed. Discriminator for Checks a term if it can be processed correctly. true if, and only if, the given term consists in letters. Checks a term if it can be processed indexed. true if it can be indexed See if string is 'a','e','i','o','u' true if is vowel Gets R1 R1 - is the region after the first non-vowel following a vowel, or is the null region at the end of the word if there is no such non-vowel. null or a string representing R1 Gets RV RV - IF the second letter is a consonant, RV is the region after the next following vowel, OR if the first two letters are vowels, RV is the region after the next consonant, AND otherwise (consonant-vowel case) RV is the region after the third letter. BUT RV is the end of the word if this positions cannot be found. null or a string representing RV 1) Turn to lowercase 2) Remove accents 3) ã -> a ; õ -> o 4) ç -> c null or a string transformed Check if a string ends with a suffix true if the string ends with the specified suffix Replace a suffix by another the replaced Remove a suffix the without the suffix See if a suffix is preceded by a true if the suffix is preceded Creates CT (changed term) , substituting * 'ã' and 'õ' for 'a~' and 'o~'. Standard suffix removal. Search for the longest among the following suffixes, and perform the following actions: false if no ending was removed Verb suffixes. Search for the longest among the following suffixes in RV, and if found, delete. false if no ending was removed Delete suffix 'i' if in RV and preceded by 'c' Residual suffix If the word ends with one of the suffixes (os a i o á í ó) in RV, delete it If the word ends with one of ( e é ê) in RV,delete it, and if preceded by 'gu' (or 'ci') with the 'u' (or 'i') in RV, delete the 'u' (or 'i') Or if the word ends ç remove the cedilha For log and debug purpose TERM, CT, RV, R1 and R2 for Catalan. You must specify the required compatibility when creating CatalanAnalyzer: As of 3.6, with a set of Catalan contractions is used by default. File containing default Catalan stopwords. Returns an unmodifiable instance of the default stop words set. default stop words set. Atomically loads the in a lazy fashion once the outer class accesses the static final set the first time.; Builds an analyzer with the default stop words: . Builds an analyzer with the given stop words. lucene compatibility version a stopword set Builds an analyzer with the given stop words. If a non-empty stem exclusion set is provided this analyzer will add a before stemming. lucene compatibility version a stopword set a set of terms not to be stemmed Creates a which tokenizes all the text in the provided . A built from an filtered with , , , , if a stem exclusion set is provided and . Base utility class for implementing a . You subclass this, and then record mappings by calling , and then invoke the correct method to correct an offset. Retrieve the corrected offset. Adds an offset correction mapping at the given output stream offset. Assumption: the offset given with each successive call to this method will not be smaller than the offset given at the previous invocation. The output stream offset at which to apply the correction The input offset is given by adding this to the output offset A that wraps another and attempts to strip out HTML constructs. This character denotes the end of file initial size of the lookahead buffer ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l at the beginning of a line l is of the form l = 2*k, k a non negative integer Translates characters to character classes Translates characters to character classes Translates DFA states to action switch labels. Translates a state to a row index in the transition table The transition table of the DFA error codes error messages for the codes above ZZ_ATTRIBUTE[aState] contains the attributes of state aState the input device the current state of the DFA the current lexical state this buffer contains the current text to be matched and is the source of the YyText() string the textposition at the last accepting state the current text position in the buffer startRead marks the beginning of the YyText() string in the buffer endRead marks the last character in the buffer, that has been read from input the number of characters up to the start of the matched text zzAtEOF == true <=> the scanner is at the EOF denotes if the user-EOF-code has already been executed user code: Creates a new HTMLStripCharFilter over the provided TextReader. to strip html tags from. Creates a new over the provided with the specified start and end tags. to strip html tags from. Tags in this set (both start and end tags) will not be filtered out. LUCENENET: Copied this method from the WordlistLoader class - this class requires readers with a Reset() method (which .NET readers don't support). So, we use the Java BufferedReader as a wrapper for whatever reader the user passes (unless it is already a BufferedReader). The position from which the next char will be read. Wraps the given and sets this.len to the given . Allocates an internal buffer of the given size. Sets len = 0 and pos = 0. Sets pos = 0 Returns the next char in the segment. Returns true when all characters in the text segment have been read Unpacks the compressed character translation table. the packed character translation table the unpacked character translation table Refills the input buffer. false, iff there was new input. if any I/O-Error occurs Disposes the input stream. Resets the scanner to read from a new input stream. Does not close the old reader. All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to . Internal scan buffer is resized down to its initial length, if it has grown. the new input stream Returns the current lexical state. Enters a new lexical state the new lexical state Returns the character at position pos from the matched text. It is equivalent to YyText[pos], but faster the position of the character to fetch. A value from 0 to YyLength()-1. the character at position pos Returns the length of the matched text region. Reports an error that occured while scanning. In a wellformed scanner (no or only correct usage of YyPushBack(int) and a match-all fallback rule) this method will only be called with things that "Can't Possibly Happen". If this method is called, something is seriously wrong (e.g. a JFlex bug producing a faulty scanner etc.). Usual syntax/scanner level error handling should be done in error fallback rules. the code of the errormessage to display Pushes the specified amount of characters back into the input stream. They will be read again by then next call of the scanning method the number of characters to be read again. This number must not be greater than YyLength()! Contains user EOF-code, which will be executed exactly once, when the end of file is reached Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs. the next token if any I/O-Error occurs Factory for . <fieldType name="text_html" class="solr.TextField" positionIncrementGap="100"> <analyzer> <charFilter class="solr.HTMLStripCharFilterFactory" escapedTags="a, title" /> <tokenizer class="solr.WhitespaceTokenizerFactory"/> </analyzer> </fieldType> Creates a new Simplistic that applies the mappings contained in a to the character stream, and correcting the resulting changes to the offsets. Matching is greedy (longest pattern matching at a given point wins). Replacement is allowed to be the empty string. LUCENENET specific support to buffer the reader. Default constructor that takes a . LUCENENET: Copied this method from the class - this class requires readers with a Reset() method (which .NET readers don't support). So, we use the (which is similar to Java BufferedReader) as a wrapper for whatever reader the user passes (unless it is already a ). Factory for . <fieldType name="text_map" class="solr.TextField" positionIncrementGap="100"> <analyzer> <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/> <tokenizer class="solr.WhitespaceTokenizerFactory"/> </analyzer> </fieldType> @since Solr 1.4 Creates a new Holds a map of input to output, to be used with . Use the to create this. Builds an NormalizeCharMap. Call add() until you have added all the mappings, then call build() to get a NormalizeCharMap @lucene.experimental Records a replacement to be applied to the input stream. Whenever singleMatch occurs in the input, it will be replaced with replacement. input String to be replaced output String if match is the empty string, or was already previously added Builds the ; call this once you are done calling . An that tokenizes text with , normalizes content with , folds case with , forms bigrams of CJK with , and filters stopwords with File containing default CJK stopwords. Currently it contains some common English words that are not usually useful for searching and some double-byte interpunctions. Returns an unmodifiable instance of the default stop-words set. an unmodifiable instance of the default stop-words set. Builds an analyzer which removes words in . Builds an analyzer with the given stop words lucene compatibility version a stopword set bigram flag for Han Ideographs bigram flag for Hiragana bigram flag for Katakana bigram flag for Hangul bigram flag for all scripts Forms bigrams of CJK terms that are generated from or ICUTokenizer. CJK types are set by these tokenizers, but you can also use to explicitly control which of the CJK scripts are turned into bigrams. By default, when a CJK character has no adjacent characters to form a bigram, it is output in unigram form. If you want to always output both unigrams and bigrams, set the outputUnigrams flag in . This can be used for a combined unigram+bigram approach. In all cases, all non-CJK input is passed thru unmodified. when we emit a bigram, its then marked as this type when we emit a unigram, its then marked as this type Calls CJKBigramFilter(@in, CJKScript.HAN | CJKScript.HIRAGANA | CJKScript.KATAKANA | CJKScript.HANGUL) Input Calls CJKBigramFilter(in, flags, false) Input OR'ed set from , , , Create a new , specifying which writing systems should be bigrammed, and whether or not unigrams should also be output. Input OR'ed set from , , , true if unigrams for the selected writing systems should also be output. when this is false, this is only done when there are no adjacent characters to form a bigram. looks at next input token, returning false is none is available refills buffers with new data from the current token. Flushes a bigram token to output from our buffer This is the normal case, e.g. ABC -> AB BC Flushes a unigram token to output from our buffer. This happens when we encounter isolated CJK characters, either the whole CJK string is a single character, or we encounter a CJK character surrounded by space, punctuation, english, etc, but not beside any other CJK. True if we have multiple codepoints sitting in our buffer True if we have a single codepoint sitting in our buffer, where its future (whether it is emitted as unigram or forms a bigram) depends upon not-yet-seen inputs. Factory for . <fieldType name="text_cjk" class="solr.TextField"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.CJKWidthFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" outputUnigrams="false" /> </analyzer> </fieldType> Creates a new CJKTokenizer is designed for Chinese, Japanese, and Korean languages. The tokens returned are every two adjacent characters with overlap match. Example: "java C1C2C3C4" will be segmented to: "java" "C1C2" "C2C3" "C3C4". Additionally, the following is applied to Latin text (such as English): Text is converted to lowercase. Numeric digits, '+', '#', and '_' are tokenized as letters. Full-width forms are converted to half-width forms. For more info on Asian language (Chinese, Japanese, and Korean) text segmentation: please search google @deprecated Use StandardTokenizer, CJKWidthFilter, CJKBigramFilter, and LowerCaseFilter instead. Word token type Single byte token type Double byte token type Names for token types Max word length buffer size: Regular expression for testing Unicode character class \p{IsHalfwidthandFullwidthForms}. Regular expression for testing Unicode character class \p{IsBasicLatin}. word offset, used to imply which character(in ) is parsed the index used only for ioBuffer data length character buffer, store the characters which are used to compose the returned Token I/O buffer, used to store the content of the input(one of the members of Tokenizer) word type: single=>ASCII double=>non-ASCII word=>default tag: previous character is a cached double-byte character "C1C2C3C4" ----(set the C1 isTokened) C1C2 "C2C3C4" ----(set the C2 isTokened) C1C2 C2C3 "C3C4" ----(set the C3 isTokened) "C1C2 C2C3 C3C4" Construct a token stream processing the given input. I/O reader Returns true for the next token in the stream, or false at EOS. See http://java.sun.com/j2se/1.3/docs/api/java/lang/Character.UnicodeBlock.html for detail. false for end of stream, true otherwise when read error happened in the InputStream Factory for . <fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.CJKTokenizerFactory"/> </analyzer> </fieldType> @deprecated Use instead. Creates a new A that normalizes CJK width differences: Folds fullwidth ASCII variants into the equivalent basic latin Folds halfwidth Katakana variants into the equivalent kana NOTE: this filter can be viewed as a (practical) subset of NFKC/NFKD Unicode normalization. See the normalization support in the ICU package for full normalization. halfwidth kana mappings: 0xFF65-0xFF9D note: 0xFF9C and 0xFF9D are only mapped to 0x3099 and 0x309A as a fallback when they cannot properly combine with a preceding character into a composed form. kana combining diffs: 0x30A6-0x30FD returns true if we successfully combined the voice mark Factory for . <fieldType name="text_cjk" class="solr.TextField"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.CJKWidthFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.CJKBigramFilterFactory"/> </analyzer> </fieldType> Creates a new for Sorani Kurdish. File containing default Kurdish stopwords. Returns an unmodifiable instance of the default stop words set. default stop words set. Atomically loads the in a lazy fashion once the outer class accesses the static final set the first time.; Builds an analyzer with the default stop words: . Builds an analyzer with the given stop words. lucene compatibility version a stopword set Builds an analyzer with the given stop words. If a non-empty stem exclusion set is provided this analyzer will add a before stemming. lucene compatibility version a stopword set a set of terms not to be stemmed Creates a which tokenizes all the text in the provided . A built from an filtered with , , , , if a stem exclusion set is provided and . A that applies to normalize the orthography. Factory for . <fieldType name="text_ckbnormal" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.SoraniNormalizationFilterFactory"/> </analyzer> </fieldType> Creates a new Normalizes the Unicode representation of Sorani text. Normalization consists of: Alternate forms of 'y' (0064, 0649) are converted to 06CC (FARSI YEH) Alternate form of 'k' (0643) is converted to 06A9 (KEHEH) Alternate forms of vowel 'e' (0647+200C, word-final 0647, 0629) are converted to 06D5 (AE) Alternate (joining) form of 'h' (06BE) is converted to 0647 Alternate forms of 'rr' (0692, word-initial 0631) are converted to 0695 (REH WITH SMALL V BELOW) Harakat, tatweel, and formatting characters such as directional controls are removed. Normalize an input buffer of Sorani text input buffer length of input buffer length of input buffer after normalization A that applies to stem Sorani words. To prevent terms from being stemmed use an instance of or a custom that sets the before this . Factory for . <fieldType name="text_ckbstem" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.SoraniNormalizationFilterFactory"/> <filter class="solr.SoraniStemFilterFactory"/> </analyzer> </fieldType> Creates a new Light stemmer for Sorani Stem an input buffer of Sorani text. input buffer length of input buffer length of input buffer after normalization An that tokenizes text with and filters with @deprecated (3.1) Use instead, which has the same functionality. This analyzer will be removed in Lucene 5.0 Creates used to tokenize all the text in the provided . built from a filtered with A with a stop word table. Numeric tokens are removed. English tokens must be larger than 1 character. One Chinese character as one Chinese word. TO DO: Add Chinese stop words, such as \ue400 Dictionary based Chinese word extraction Intelligent Chinese word extraction @deprecated (3.1) Use instead, which has the same functionality. This filter will be removed in Lucene 5.0 Factory for @deprecated Use instead. Creates a new Tokenize Chinese text as individual chinese characters. The difference between and is that they have different token parsing logic. For example, if the Chinese text "C1C2C3C4" is to be indexed: The tokens returned from ChineseTokenizer are C1, C2, C3, C4. The tokens returned from the CJKTokenizer are C1C2, C2C3, C3C4. Therefore the index created by is much larger. The problem is that when searching for C1, C1C2, C1C3, C4C2, C1C2C3 ... the works, but the will not work. @deprecated (3.1) Use instead, which has the same functionality. This filter will be removed in Lucene 5.0 Factory for @deprecated Use instead. Creates a new Construct bigrams for frequently occurring terms while indexing. Single terms are still indexed too, with bigrams overlaid. This is achieved through the use of . Bigrams have a type of Example: input:"the quick brown fox" output:|"the","the-quick"|"brown"|"fox"| "the-quick" has a position increment of 0 so it is in the same position as "the" "the-quick" has a term.type() of "gram" Construct a token stream filtering the given input using a Set of common words to create bigrams. Outputs both unigrams with position increment and bigrams with position increment 0 type=gram where one or both of the words in a potential bigram are in the set of common words . lucene compatibility version input in filter chain The set of common words. Inserts bigrams for common words into a token stream. For each input token, output the token. If the token and/or the following token are in the list of common words also output a bigram with position increment 0 and type="gram" TODO:Consider adding an option to not emit unigram stopwords as in CDL XTF BigramStopFilter, would need to be changed to work with this. TODO: Consider optimizing for the case of three commongrams i.e "man of the year" normally produces 3 bigrams: "man-of", "of-the", "the-year" but with proper management of positions we could eliminate the middle bigram "of-the"and save a disk seek and a whole set of position lookups. This method is called by a consumer before it begins consumption using . Resets this stream to a clean state. Stateful implementations must implement this method so that they can be reused, just as if they had been created fresh. If you override this method, always call base.Reset(), otherwise some internal state will not be correctly reset (e.g., will throw on further usage). NOTE: The default implementation chains the call to the input , so be sure to call base.Reset() when overriding this method. Determines if the current token is a common term true if the current token is a common term, false otherwise Saves this information to form the left part of a gram Constructs a compound token. Constructs a . <fieldType name="text_cmmngrms" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.CommonGramsFilterFactory" words="commongramsstopwords.txt" ignoreCase="false"/> </analyzer> </fieldType> Creates a new Wrap a optimizing phrase queries by only returning single words when they are not a member of a bigram. Example: query input to CommonGramsFilter: "the rain in spain falls mainly" output of CommomGramsFilter/input to CommonGramsQueryFilter: |"the, "the-rain"|"rain" "rain-in"|"in, "in-spain"|"spain"|"falls"|"mainly" output of CommonGramsQueryFilter:"the-rain", "rain-in" ,"in-spain", "falls", "mainly" See:http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//all/org/apache/lucene/analysis/TokenStream.html and http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/analysis/package.html?revision=718798 Constructs a new CommonGramsQueryFilter based on the provided CommomGramsFilter CommonGramsFilter the QueryFilter will use This method is called by a consumer before it begins consumption using . Resets this stream to a clean state. Stateful implementations must implement this method so that they can be reused, just as if they had been created fresh. If you override this method, always call base.Reset(), otherwise some internal state will not be correctly reset (e.g., will throw on further usage). NOTE: The default implementation chains the call to the input , so be sure to call base.Reset() when overriding this method. Output bigrams whenever possible to optimize queries. Only output unigrams when they are not a member of a bigram. Example: input: "the rain in spain falls mainly" output:"the-rain", "rain-in" ,"in-spain", "falls", "mainly" Convenience method to check if the current type is a gram type true if the current type is a gram type, false otherwise Construct . <fieldType name="text_cmmngrmsqry" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.CommonGramsQueryFilterFactory" words="commongramsquerystopwords.txt" ignoreCase="false"/> </analyzer> </fieldType> Creates a new Create a and wrap it with a Base class for decomposition token filters. You must specify the required compatibility when creating : As of 3.1, CompoundWordTokenFilterBase correctly handles Unicode 4.0 supplementary characters in strings and char arrays provided as compound word dictionaries. As of 4.4, doesn't update offsets. The default for minimal word length that gets decomposed The default for minimal length of subwords that get propagated to the output of this filter The default for maximal length of subwords that get propagated to the output of this filter Decomposes the current and places instances in the list. The original token may not be placed in the list, as it is automatically passed through this filter. Helper class to hold decompounded token information Construct the compound token based on a slice of the current . A that decomposes compound words found in many Germanic languages. "Donaudampfschiff" becomes Donau, dampf, schiff so that you can find "Donaudampfschiff" even when you only enter "schiff". It uses a brute-force algorithm to achieve this. You must specify the required compatibility when creating : As of 3.1, CompoundWordTokenFilterBase correctly handles Unicode 4.0 supplementary characters in strings and char arrays provided as compound word dictionaries. Creates a new Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details. the to process the word dictionary to match against. Creates a new Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details. the to process the word dictionary to match against. only words longer than this get processed only subwords longer than this get to the output stream only subwords shorter than this get to the output stream Add only the longest matching subword to the stream Factory for . <fieldType name="text_dictcomp" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="dictionary.txt" minWordSize="5" minSubwordSize="2" maxSubwordSize="15" onlyLongestMatch="true"/> </analyzer> </fieldType> Creates a new A that decomposes compound words found in many Germanic languages. "Donaudampfschiff" becomes Donau, dampf, schiff so that you can find "Donaudampfschiff" even when you only enter "schiff". It uses a hyphenation grammar and a word dictionary to achieve this. You must specify the required compatibility when creating : As of 3.1, CompoundWordTokenFilterBase correctly handles Unicode 4.0 supplementary characters in strings and char arrays provided as compound word dictionaries. Creates a new instance. Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details. the to process the hyphenation pattern tree to use for hyphenation the word dictionary to match against. Creates a new instance. Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details. the to process the hyphenation pattern tree to use for hyphenation the word dictionary to match against. only words longer than this get processed only subwords longer than this get to the output stream only subwords shorter than this get to the output stream Add only the longest matching subword to the stream Create a with no dictionary. Calls Create a with no dictionary. Calls Create a hyphenator tree the filename of the XML grammar to load An object representing the hyphenation patterns If there is a low-level I/O error. Create a hyphenator tree the filename of the XML grammar to load The character encoding to use An object representing the hyphenation patterns If there is a low-level I/O error. Create a hyphenator tree the file of the XML grammar to load An object representing the hyphenation patterns If there is a low-level I/O error. Create a hyphenator tree the file of the XML grammar to load The character encoding to use An object representing the hyphenation patterns If there is a low-level I/O error. Create a hyphenator tree the InputSource pointing to the XML grammar An object representing the hyphenation patterns If there is a low-level I/O error. Create a hyphenator tree the InputSource pointing to the XML grammar The character encoding to use An object representing the hyphenation patterns If there is a low-level I/O error. Factory for . This factory accepts the following parameters: hyphenator (mandatory): path to the FOP xml hyphenation pattern. See http://offo.sourceforge.net/hyphenation/. encoding (optional): encoding of the xml hyphenation file. defaults to UTF-8. dictionary (optional): dictionary of words. defaults to no dictionary. minWordSize (optional): minimal word length that gets decomposed. defaults to 5. minSubwordSize (optional): minimum length of subwords. defaults to 2. maxSubwordSize (optional): maximum length of subwords. defaults to 15. onlyLongestMatch (optional): if true, adds only the longest matching subword to the stream. defaults to false. <fieldType name="text_hyphncomp" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.HyphenationCompoundWordTokenFilterFactory" hyphenator="hyphenator.xml" encoding="UTF-8" dictionary="dictionary.txt" minWordSize="5" minSubwordSize="2" maxSubwordSize="15" onlyLongestMatch="false"/> </analyzer> </fieldType> Creates a new This class implements a simple byte vector with access to the underlying array. This class has been taken from the Apache FOP project (http://xmlgraphics.apache.org/fop/). They have been slightly modified. Capacity increment size The encapsulated array Points to next free item LUCENENET indexer for .NET return number of items in array returns current capacity of array This is to implement memory allocation in the array. Like malloc(). This class implements a simple char vector with access to the underlying array. This class has been taken from the Apache FOP project (http://xmlgraphics.apache.org/fop/). They have been slightly modified. Capacity increment size The encapsulated array Points to next free item Reset Vector but don't resize or clear elements LUCENENET indexer for .NET return number of items in array returns current capacity of array This class represents a hyphen. A 'full' hyphen is made of 3 parts: the pre-break text, post-break text and no-break. If no line-break is generated at this position, the no-break text is used, otherwise, pre-break and post-break are used. Typically, pre-break is equal to the hyphen character and the others are empty. However, this general scheme allows support for cases in some languages where words change spelling if they're split across lines, like german's 'backen' which hyphenates 'bak-ken'. BTW, this comes from TeX. This class has been taken from the Apache FOP project (http://xmlgraphics.apache.org/fop/). They have been slightly modified. This class represents a hyphenated word. This class has been taken from the Apache FOP project (http://xmlgraphics.apache.org/fop/). They have been slightly modified. rawWord as made of alternating strings and instances the number of hyphenation points in the word the hyphenation points This tree structure stores the hyphenation patterns in an efficient way for fast lookup. It provides the provides the method to hyphenate a word. This class has been taken from the Apache FOP project (http://xmlgraphics.apache.org/fop/). They have been slightly modified. Lucene.NET specific note: If you are going to extend this class by inheriting from it, you should be aware that the base class TernaryTree initializes its state in the constructor by calling its protected Init() method. If your subclass needs to initialize its own state, you add your own "Initialize()" method and call it both from the inside of your constructor and you will need to override the Balance() method and call "Initialize()" before the call to base.Balance(). Your class can use the data that is initialized in the base class after the call to base.Balance(). value space: stores the interletter values This map stores hyphenation exceptions This map stores the character classes Temporary map to store interletter values on pattern loading. Packs the values by storing them in 4 bits, two values into a byte Values range is from 0 to 9. We use zero as terminator, so we'll add 1 to the value. a string of digits from '0' to '9' representing the interletter values. the index into the vspace array where the packed values are stored. Read hyphenation patterns from an XML file. the filename In case the parsing fails Read hyphenation patterns from an XML file. the filename The character encoding to use In case the parsing fails Read hyphenation patterns from an XML file. a object representing the file In case the parsing fails Read hyphenation patterns from an XML file. a object representing the file The character encoding to use In case the parsing fails Read hyphenation patterns from an XML file. input source for the file In case the parsing fails Read hyphenation patterns from an XML file. input source for the file The character encoding to use In case the parsing fails Read hyphenation patterns from an . input source for the file In case the parsing fails String compare, returns 0 if equal or t is a substring of s Search for all possible partial matches of word starting at index an update interletter values. In other words, it does something like: for (i=0; i<patterns.Length; i++) { if (word.Substring(index).StartsWith(patterns[i], StringComparison.Ordinal)) update_interletter_values(patterns[i]); } But it is done in an efficient way since the patterns are stored in a ternary tree. In fact, this is the whole purpose of having the tree: doing this search without having to test every single pattern. The number of patterns for languages such as English range from 4000 to 10000. Thus, doing thousands of string comparisons for each word to hyphenate would be really slow without the tree. The tradeoff is memory, but using a ternary tree instead of a trie, almost halves the the memory used by Lout or TeX. It's also faster than using a hash table null terminated word to match start index from word interletter values array to update Hyphenate word and return a object. the word to be hyphenated Minimum number of characters allowed before the hyphenation point. Minimum number of characters allowed after the hyphenation point. a object representing the hyphenated word or null if word is not hyphenated. Hyphenate word and return an array of hyphenation points. w = "****nnllllllnnn*****", where n is a non-letter, l is a letter, all n may be absent, the first n is at offset, the first l is at offset + iIgnoreAtBeginning; word = ".llllll.'\0'***", where all l in w are copied into word. In the first part of the routine len = w.length, in the second part of the routine len = word.length. Three indices are used: index(w), the index in w, index(word), the index in word, letterindex(word), the index in the letter part of word. The following relations exist: index(w) = offset + i - 1 index(word) = i - iIgnoreAtBeginning letterindex(word) = index(word) - 1 (see first loop). It follows that: index(w) - index(word) = offset - 1 + iIgnoreAtBeginning index(w) = letterindex(word) + offset + iIgnoreAtBeginning char array that contains the word Offset to first character in word Length of word Minimum number of characters allowed before the hyphenation point. Minimum number of characters allowed after the hyphenation point. a object representing the hyphenated word or null if word is not hyphenated. Add a character class to the tree. It is used by as callback to add character classes. Character classes define the valid word characters for hyphenation. If a word contains a character not defined in any of the classes, it is not hyphenated. It also defines a way to normalize the characters in order to compare them with the stored patterns. Usually pattern files use only lower case characters, in this case a class for letter 'a', for example, should be defined as "aA", the first character being the normalization char. Add an exception to the tree. It is used by class as callback to store the hyphenation exceptions. normalized word a vector of alternating strings and objects. Add a pattern to the tree. Mainly, to be used by class as callback to add a pattern to the tree. the hyphenation pattern interletter weight values indicating the desirability and priority of hyphenating at a given point within the pattern. It should contain only digit characters. (i.e. '0' to '9'). This interface is used to connect the XML pattern file parser to the hyphenation tree. This interface has been taken from the Apache FOP project (http://xmlgraphics.apache.org/fop/). They have been slightly modified. Add a character class. A character class defines characters that are considered equivalent for the purpose of hyphenation (e.g. "aA"). It usually means to ignore case. character group Add a hyphenation exception. An exception replaces the result obtained by the algorithm for cases for which this fails or the user wants to provide his own hyphenation. A hyphenatedword is a vector of alternating String's and instances Add hyphenation patterns. the pattern interletter values expressed as a string of digit characters. A XMLReader document handler to read and parse hyphenation patterns from a XML file. LUCENENET: This class has been refactored from its Java counterpart to use XmlReader rather than a SAX parser. Parses a hyphenation pattern file. The complete file path to be read. In case of an exception while parsing Parses a hyphenation pattern file. The complete file path to be read. The character encoding to use In case of an exception while parsing Parses a hyphenation pattern file. a object representing the file In case of an exception while parsing Parses a hyphenation pattern file. a object representing the file The character encoding to use In case of an exception while parsing Parses a hyphenation pattern file. The stream containing the XML data. The scans the first bytes of the stream looking for a byte order mark or other sign of encoding. When encoding is determined, the encoding is used to continue reading the stream, and processing continues parsing the input as a stream of (Unicode) characters. In case of an exception while parsing Parses a hyphenation pattern file. input source for the file In case of an exception while parsing LUCENENET specific helper class to force the DTD file to be read from the embedded resource rather than from the file system. Receive notification of the beginning of an element. The Parser will invoke this method at the beginning of every element in the XML document; there will be a corresponding event for every event (even when the element is empty). All of the element's content will be reported, in order, before the corresponding endElement event. the Namespace URI, or the empty string if the element has no Namespace URI or if Namespace processing is not being performed the local name (without prefix), or the empty string if Namespace processing is not being performed the attributes attached to the element. If there are no attributes, it shall be an empty Attributes object. The value of this object after startElement returns is undefined Receive notification of the end of an element. The parser will invoke this method at the end of every element in the XML document; there will be a corresponding event for every event (even when the element is empty). the Namespace URI, or the empty string if the element has no Namespace URI or if Namespace processing is not being performed the local name (without prefix), or the empty string if Namespace processing is not being performed Receive notification of character data. The Parser will call this method to report each chunk of character data. Parsers may return all contiguous character data in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity so that the Locator provides useful information. The application must not attempt to read from the array outside of the specified range.

Ternary Search Tree.

A ternary search tree is a hybrid between a binary tree and a digital search tree (trie). Keys are limited to strings. A data value of type char is stored in each leaf node. It can be used as an index (or pointer) to the data. Branches that only contain one key are compressed to one node by storing a pointer to the trailer substring of the key. This class is intended to serve as base class or helper class to implement Dictionary collections or the like. Ternary trees have some nice properties as the following: the tree can be traversed in sorted order, partial matches (wildcard) can be implemented, retrieval of all keys within a given distance from the target, etc. The storage requirements are higher than a binary tree but a lot less than a trie. Performance is comparable with a hash table, sometimes it outperforms a hash function (most of the time can determine a miss faster than a hash). The main purpose of this java port is to serve as a base for implementing TeX's hyphenation algorithm (see The TeXBook, appendix H). Each language requires from 5000 to 15000 hyphenation patterns which will be keys in this tree. The strings patterns are usually small (from 2 to 5 characters), but each char in the tree is stored in a node. Thus memory usage is the main concern. We will sacrifice 'elegance' to keep memory requirements to the minimum. Using java's char type as pointer (yes, I know pointer it is a forbidden word in java) we can keep the size of the node to be just 8 bytes (3 pointers and the data char). This gives room for about 65000 nodes. In my tests the english patterns took 7694 nodes and the german patterns 10055 nodes, so I think we are safe. All said, this is a map with strings as keys and char as value. Pretty limited!. It can be extended to a general map by using the string representation of an object and using the char value as an index to an array that contains the object values. This class has been taken from the Apache FOP project (http://xmlgraphics.apache.org/fop/). They have been slightly modified.
Pointer to low branch and to rest of the key when it is stored directly in this node, we don't have unions in java! Pointer to high branch. Pointer to equal branch and to data when this node is a string terminator. The character stored in this node: splitchar. Two special values are reserved: 0x0000 as string terminator 0xFFFF to indicate that the branch starting at this node is compressed This shouldn't be a problem if we give the usual semantics to strings since 0xFFFF is guaranteed not to be an Unicode character. This vector holds the trailing of the keys when the branch is compressed. Branches are initially compressed, needing one node per key plus the size of the string key. They are decompressed as needed when another key with same prefix is inserted. This saves a lot of space, specially for long keys. The actual insertion function, recursive version. Compares 2 null terminated char arrays Compares a string with null terminated char array Recursively insert the median first and then the median of the lower and upper halves, and so on in order to get a balanced tree. The array of keys is assumed to be sorted in ascending order. Balance the tree for best search performance Each node stores a character (splitchar) which is part of some key(s). In a compressed branch (one that only contain a single string key) the trailer of the key which is not already in nodes is stored externally in the kv array. As items are inserted, key substrings decrease. Some substrings may completely disappear when the whole branch is totally decompressed. The tree is traversed to find the key substrings actually used. In addition, duplicate substrings are removed using a map (implemented with a TernaryTree!). Gets an enumerator over the keys of this . NOTE: This was keys() in Lucene. An enumerator over the keys of this . Enumerator for TernaryTree LUCENENET NOTE: This differs a bit from its Java counterpart to adhere to .NET IEnumerator semantics. In Java, when the is instantiated, it is already positioned at the first element. However, to act like a .NET IEnumerator, the initial state is undefined and considered to be before the first element until is called, and if a move took place it will return true; current node index current key Node stack key stack implemented with a traverse upwards traverse the tree to find next key "Tokenizes" the entire stream as a single token. This is useful for data like zip codes, ids, and some product names. Emits the entire input as a single token. Default read buffer size Factory for . <fieldType name="text_keyword" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.KeywordTokenizerFactory"/> </analyzer> </fieldType> Creates a new A is a tokenizer that divides text at non-letters. That's to say, it defines tokens as maximal strings of adjacent letters, as defined by predicate. Note: this does a decent job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces. You must specify the required compatibility when creating : As of 3.1, uses an based API to normalize and detect token characters. See and for details. Construct a new . to match. the input to split up into tokens Construct a new using a given . to match the attribute factory to use for this the input to split up into tokens Collects only characters which satisfy . Factory for . <fieldType name="text_letter" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.LetterTokenizerFactory"/> </analyzer> </fieldType> Creates a new Normalizes token text to lower case. You must specify the required compatibility when creating LowerCaseFilter: As of 3.1, supplementary characters are properly lowercased. Create a new , that normalizes token text to lower case. See to filter Factory for . <fieldType name="text_lwrcase" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> Creates a new performs the function of and together. It divides text at non-letters and converts them to lower case. While it is functionally equivalent to the combination of and , there is a performance advantage to doing the two tasks at once, hence this (redundant) implementation. Note: this does a decent job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces. You must specify the required compatibility when creating : As of 3.1, uses an int based API to normalize and detect token characters. See and for details. Construct a new . to match the input to split up into tokens Construct a new using a given . to match the attribute factory to use for this the input to split up into tokens Converts char to lower case in the invariant culture. Factory for . <fieldType name="text_lwrcase" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.LowerCaseTokenizerFactory"/> </analyzer> </fieldType> Creates a new An that filters with You must specify the required compatibility when creating : As of 3.1, uses an int based API to normalize and detect token codepoints. See and for details. Creates a new to match Filters with and . You must specify the required compatibility when creating : As of 3.1, StopFilter correctly handles Unicode 4.0 supplementary characters in stopwords As of 2.9, position increments are preserved An unmodifiable set containing some common English words that are not usually useful for searching. Builds an analyzer which removes words in . See Builds an analyzer with the stop words from the given set. See Set of stop words Builds an analyzer with the stop words from the given file. See File to load stop words from Builds an analyzer with the stop words from the given reader. See to load stop words from Creates used to tokenize all the text in the provided . built from a filtered with Removes stop words from a token stream. You must specify the required compatibility when creating : As of 3.1, StopFilter correctly handles Unicode 4.0 supplementary characters in stopwords and position increments are preserved Constructs a filter which removes words from the input that are named in the . Lucene version to enable correct Unicode 4.0 behavior in the stop set if Version > 3.0. See > for details. Input A representing the stopwords. Builds a from an array of stop words, appropriate for passing into the constructor. This permits this construction to be cached once when an is constructed. to enable correct Unicode 4.0 behavior in the returned set if Version > 3.0 An array of stopwords passing false to ignoreCase Builds a from an array of stop words, appropriate for passing into the constructor. This permits this construction to be cached once when an is constructed. to enable correct Unicode 4.0 behavior in the returned set if Version > 3.0 A List of s or or any other ToString()-able list representing the stopwords A Set () containing the words passing false to ignoreCase Creates a stopword set from the given stopword array. to enable correct Unicode 4.0 behavior in the returned set if Version > 3.0 An array of stopwords If true, all words are lower cased first. a Set () containing the words Creates a stopword set from the given stopword list. to enable correct Unicode 4.0 behavior in the returned set if Version > 3.0 A List of s or or any other ToString()-able list representing the stopwords if true, all words are lower cased first A Set () containing the words Creates a stopword set from the given stopword list. to enable correct Unicode 4.0 behavior in the returned set if Version > 3.0 A List of s or or any other ToString()-able list representing the stopwords if true, all words are lower cased first A Set () containing the words Returns the next input Token whose Term is not a stop word. Factory for . <fieldType name="text_stop" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" format="wordset" /> </analyzer> </fieldType> All attributes are optional: ignoreCase defaults to false words should be the name of a stopwords file to parse, if not specified the factory will use format defines how the words file will be parsed, and defaults to wordset. If words is not specified, then format must not be specified. The valid values for the format option are: wordset - This is the default format, which supports one word per line (including any intra-word whitespace) and allows whole line comments begining with the "#" character. Blank lines are ignored. See for details. snowball - This format allows for multiple words specified on each line, and trailing comments may be specified using the vertical line ("|"). Blank lines are ignored. See for details. Creates a new Removes tokens whose types appear in a set of blocked types from a token stream. @deprecated enablePositionIncrements=false is not supported anymore as of Lucene 4.4. @deprecated enablePositionIncrements=false is not supported anymore as of Lucene 4.4. Create a new . the match version the to consume the types to filter if true, then tokens whose type is in will be kept, otherwise they will be filtered out Create a new that filters tokens out (useWhiteList=false). By default accept the token if its type is not a stop type. When the parameter is set to true then accept the token if its type is contained in the Factory class for . <fieldType name="chars" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.TypeTokenFilterFactory" types="stoptypes.txt" useWhitelist="false"/> </analyzer> </fieldType> Creates a new Normalizes token text to UPPER CASE. You must specify the required compatibility when creating NOTE: In Unicode, this transformation may lose information when the upper case character represents more than one lower case character. Use this filter when you Require uppercase tokens. Use the for general search matching Create a new , that normalizes token text to upper case. See to filter Factory for . <fieldType name="text_uppercase" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.UpperCaseFilterFactory"/> </analyzer> </fieldType> NOTE: In Unicode, this transformation may lose information when the upper case character represents more than one lower case character. Use this filter when you require uppercase tokens. Use the for general search matching Creates a new An that uses . You must specify the required compatibility when creating : As of 3.1, uses an int based API to normalize and detect token codepoints. See and for details. Creates a new to match A is a tokenizer that divides text at whitespace. Adjacent sequences of non-Whitespace characters form tokens. You must specify the required compatibility when creating : As of 3.1, uses an int based API to normalize and detect token characters. See and for details. Construct a new . to match the input to split up into tokens Construct a new using a given . to match the attribute factory to use for this the input to split up into tokens Collects only characters which do not satisfy . Factory for . <fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> </analyzer> </fieldType> Creates a new for Czech language. Supports an external list of stopwords (words that will not be indexed at all). A default set of stopwords is used unless an alternative list is specified. You must specify the required compatibility when creating : As of 3.1, words are stemmed with As of 2.9, StopFilter preserves position increments As of 2.4, Tokens incorrectly identified as acronyms are corrected (see LUCENE-1068) File containing default Czech stopwords. Returns a set of default Czech-stopwords a set of default Czech-stopwords Builds an analyzer with the default stop words (). to match Builds an analyzer with the given stop words. to match a stopword set Builds an analyzer with the given stop words and a set of work to be excluded from the . to match a stopword set a stemming exclusion set Creates used to tokenize all the text in the provided . built from a filtered with , , , and (only if version is >= LUCENE_31). If a version is >= LUCENE_31 and a stem exclusion set is provided via a is added before . A that applies to stem Czech words. To prevent terms from being stemmed use an instance of or a custom that sets the before this . NOTE: Input is expected to be in lowercase, but with diacritical marks Factory for . <fieldType name="text_czstem" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.CzechStemFilterFactory"/> </analyzer> </fieldType> Creates a new Light Stemmer for Czech. Implements the algorithm described in: Indexing and stemming approaches for the Czech language http://portal.acm.org/citation.cfm?id=1598600 Stem an input buffer of Czech text. NOTE: Input is expected to be in lowercase, but with diacritical marks input buffer length of input buffer length of input buffer after normalization for Danish. File containing default Danish stopwords. Returns an unmodifiable instance of the default stop words set. default stop words set. Atomically loads the in a lazy fashion once the outer class accesses the static final set the first time.; Builds an analyzer with the default stop words: . Builds an analyzer with the given stop words. lucene compatibility version a stopword set Builds an analyzer with the given stop words. If a non-empty stem exclusion set is provided this analyzer will add a before stemming. lucene compatibility version a stopword set a set of terms not to be stemmed Creates a which tokenizes all the text in the provided . A built from an filtered with , , , if a stem exclusion set is provided and . for German language. Supports an external list of stopwords (words that will not be indexed at all) and an external list of exclusions (word that will not be stemmed, but indexed). A default set of stopwords is used unless an alternative list is specified, but the exclusion list is empty by default. You must specify the required compatibility when creating GermanAnalyzer: As of 3.6, GermanLightStemFilter is used for less aggressive stemming. As of 3.1, Snowball stemming is done with SnowballFilter, and Snowball stopwords are used by default. As of 2.9, StopFilter preserves position increments NOTE: This class uses the same dependent settings as . @deprecated in 3.1, remove in Lucene 5.0 (index bw compat) File containing default German stopwords. Returns a set of default German-stopwords a set of default German-stopwords @deprecated in 3.1, remove in Lucene 5.0 (index bw compat) Contains words that should be indexed but not stemmed. Builds an analyzer with the default stop words: . Builds an analyzer with the given stop words lucene compatibility version a stopword set Builds an analyzer with the given stop words lucene compatibility version a stopword set a stemming exclusion set Creates used to tokenize all the text in the provided . built from a filtered with , , , if a stem exclusion set is provided, and A that applies to stem German words. To prevent terms from being stemmed use an instance of or a custom that sets the before this . Factory for . <fieldType name="text_delgtstem" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.GermanLightStemFilterFactory"/> </analyzer> </fieldType> Creates a new Light Stemmer for German. This stemmer implements the "UniNE" algorithm in: Light Stemming Approaches for the French, Portuguese, German and Hungarian Languages Jacques Savoy A that applies to stem German words. To prevent terms from being stemmed use an instance of or a custom that sets the before this . Factory for . <fieldType name="text_deminstem" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.GermanMinimalStemFilterFactory"/> </analyzer> </fieldType> Creates a new Minimal Stemmer for German. This stemmer implements the following algorithm: Morphologie et recherche d'information Jacques Savoy. Normalizes German characters according to the heuristics of the http://snowball.tartarus.org/algorithms/german2/stemmer.html German2 snowball algorithm. It allows for the fact that ä, ö and ü are sometimes written as ae, oe and ue. 'ß' is replaced by 'ss' 'ä', 'ö', 'ü' are replaced by 'a', 'o', 'u', respectively. 'ae' and 'oe' are replaced by 'a', and 'o', respectively. 'ue' is replaced by 'u', when not following a vowel or q. This is useful if you want this normalization without using the German2 stemmer, or perhaps no stemming at all. Factory for . <fieldType name="text_denorm" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.GermanNormalizationFilterFactory"/> </analyzer> </fieldType> Creates a new A that stems German words. It supports a table of words that should not be stemmed at all. The stemmer used can be changed at runtime after the filter object is created (as long as it is a ). To prevent terms from being stemmed use an instance of or a custom that sets the before this . The actual token in the input stream. Creates a instance the source Returns true for next token in the stream, or false at EOS Set a alternative/custom for this filter. Factory for . <fieldType name="text_destem" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.GermanStemFilterFactory"/> </analyzer> </fieldType> Creates a new A stemmer for German words. The algorithm is based on the report "A Fast and Simple Stemming Algorithm for German Words" by Jörg Caumanns (joerg.caumanns at isst.fhg.de). Buffer for the terms while stemming them. Amount of characters that are removed with while stemming. Stemms the given term to an unique discriminator. The term that should be stemmed. Discriminator for Checks if a term could be stemmed. true if, and only if, the given term consists in letters. suffix stripping (stemming) on the current term. The stripping is reduced to the seven "base" suffixes "e", "s", "n", "t", "em", "er" and * "nd", from which all regular suffixes are build of. The simplification causes some overstemming, and way more irregular stems, but still provides unique. discriminators in the most of those cases. The algorithm is context free, except of the length restrictions. Does some optimizations on the term. This optimisations are contextual. Removes a particle denotion ("ge") from a term. Do some substitutions for the term to reduce overstemming: Substitute Umlauts with their corresponding vowel: äöü -> aou, "ß" is substituted by "ss" Substitute a second char of a pair of equal characters with an asterisk: ?? -> ?* Substitute some common character combinations with a token: sch/ch/ei/ie/ig/st -> $/§/%/&/#/! Undoes the changes made by . That are character pairs and character combinations. Umlauts will remain as their corresponding vowel, as "ß" remains as "ss". for the Greek language. Supports an external list of stopwords (words that will not be indexed at all). A default set of stopwords is used unless an alternative list is specified. You must specify the required compatibility when creating : As of 3.1, StandardFilter and GreekStemmer are used by default. As of 2.9, StopFilter preserves position increments NOTE: This class uses the same dependent settings as . File containing default Greek stopwords. Returns a set of default Greek-stopwords a set of default Greek-stopwords Builds an analyzer with the default stop words. Lucene compatibility version, See Builds an analyzer with the given stop words. NOTE: The stopwords set should be pre-processed with the logic of for best results. Lucene compatibility version, See a stopword set Creates used to tokenize all the text in the provided . built from a filtered with , , , and Normalizes token text to lower case, removes some Greek diacritics, and standardizes final sigma to sigma. You must specify the required compatibility when creating : As of 3.1, supplementary characters are properly lowercased. Create a that normalizes Greek token text. Lucene compatibility version, See to filter Factory for . <fieldType name="text_glc" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.GreekLowerCaseFilterFactory"/> </analyzer> </fieldType> Creates a new A that applies to stem Greek words. To prevent terms from being stemmed use an instance of or a custom that sets the before this . NOTE: Input is expected to be casefolded for Greek (including folding of final sigma to sigma), and with diacritics removed. This can be achieved by using either or ICUFoldingFilter before . @lucene.experimental Factory for . <fieldType name="text_gstem" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.GreekLowerCaseFilterFactory"/> <filter class="solr.GreekStemFilterFactory"/> </analyzer> </fieldType> Creates a new A stemmer for Greek words, according to: Development of a Stemmer for the Greek Language. Georgios Ntais NOTE: Input is expected to be casefolded for Greek (including folding of final sigma to sigma), and with diacritics removed. This can be achieved with either or ICUFoldingFilter. @lucene.experimental Stems a word contained in a leading portion of a array. The word is passed through a number of rules that modify it's length. A array that contains the word to be stemmed. The length of the array. The new length of the stemmed word. Checks if the word contained in the leading portion of char[] array , ends with the suffix given as parameter. A char[] array that represents a word. The length of the char[] array. A object to check if the word given ends with these characters. True if the word ends with the suffix given , false otherwise. Checks if the word contained in the leading portion of array , ends with a Greek vowel. A array that represents a word. The length of the array. True if the word contained in the leading portion of array , ends with a vowel , false otherwise. Checks if the word contained in the leading portion of array , ends with a Greek vowel. A array that represents a word. The length of the array. True if the word contained in the leading portion of array , ends with a vowel , false otherwise. for English. Returns an unmodifiable instance of the default stop words set. default stop words set. Atomically loads the in a lazy fashion once the outer class accesses the static final set the first time.; Builds an analyzer with the default stop words: . lucene compatibility version Builds an analyzer with the given stop words. lucene compatibility version a stopword set Builds an analyzer with the given stop words. If a non-empty stem exclusion set is provided this analyzer will add a before stemming. lucene compatibility version a stopword set a set of terms not to be stemmed Creates a which tokenizes all the text in the provided . A built from an filtered with , , , , if a stem exclusion set is provided and . A that applies to stem English words. To prevent terms from being stemmed use an instance of or a custom that sets the before this . Factory for . <fieldType name="text_enminstem" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishMinimalStemFilterFactory"/> </analyzer> </fieldType> Creates a new Minimal plural stemmer for English. This stemmer implements the "S-Stemmer" from How Effective Is Suffixing? Donna Harman. TokenFilter that removes possessives (trailing 's) from words. You must specify the required compatibility when creating : As of 3.6, U+2019 RIGHT SINGLE QUOTATION MARK and U+FF07 FULLWIDTH APOSTROPHE are also treated as quotation marks. @deprecated Use instead. Factory for . <fieldType name="text_enpossessive" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPossessiveFilterFactory"/> </analyzer> </fieldType> Creates a new A list of words used by Kstem A list of words used by Kstem A list of words used by Kstem A list of words used by Kstem A list of words used by Kstem A list of words used by Kstem A list of words used by Kstem A list of words used by Kstem A high-performance kstem filter for english. See "Viewing Morphology as an Inference Process" (Krovetz, R., Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 191-203, 1993). All terms must already be lowercased for this filter to work correctly. Note: This filter is aware of the . To prevent certain terms from being passed to the stemmer should be set to true in a previous . Note: For including the original term as well as the stemmed version, see Returns the next, stemmed, input Token. The stemmed form of a token. If there is a low-level I/O error. Factory for . <fieldType name="text_kstem" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.KStemFilterFactory"/> </analyzer> </fieldType> Creates a new This class implements the Kstem algorithm Title: Kstemmer Description: This is a java version of Bob Krovetz' kstem stemmer Copyright: Copyright 2008, Luicid Imagination, Inc. Copyright: Copyright 2003, CIIR University of Massachusetts Amherst (http://ciir.cs.umass.edu) INDEX of final letter in word. You must add 1 to k to get the current length of word. When you want the length of word, use the method wordLength, which returns (k+1). length of stem within word Convert plurals to singular form, and '-ies' to 'y' replace old suffix with s convert past tense (-ed) to present, and `-ied' to `y' return TRUE if word ends with a double consonant handle `-ing' endings this routine deals with -ity endings. It accepts -ability, -ibility, and -ality, even without checking the dictionary because they are so productive. The first two are mapped to -ble, and the -ity is remove for the latter handle -ence and -ance handle -ness handle -ism this routine deals with -ment endings. this routine deals with -ize endings. handle -ency and -ancy handle -able and -ible handle -ic endings. This is fairly straightforward, but this is also the only place we try *expanding* an ending, -ic -> -ical. This is to handle cases like `canonic' -> `canonical' this routine deals with -ion, -ition, -ation, -ization, and -ication. The -ization ending is always converted to -ize this routine deals with -er, -or, -ier, and -eer. The -izer ending is always converted to -ize this routine deals with -ly endings. The -ally ending is always converted to -al Sometimes this will temporarily leave us with a non-word (e.g., heuristically maps to heuristical), but then the -al is removed in the next step. this routine deals with -al endings. Some of the endings from the previous routine are finished up here. this routine deals with -ive endings. It normalizes some of the -ative endings directly, and also maps some -ive endings to -ion. Returns the result of the stem (assuming the word was changed) as a . Stems the text in the token. Returns true if changed. Transforms the token stream as per the Porter stemming algorithm. Note: the input to the stemming filter must already be in lower case, so you will need to use LowerCaseFilter or LowerCaseTokenizer farther down the Tokenizer chain in order for this to work properly! To use this filter with other analyzers, you'll want to write an Analyzer class that sets up the TokenStream chain as you want it. To use this with LowerCaseTokenizer, for example, you'd write an analyzer like this: class MyAnalyzer : Analyzer { protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader) { Tokenizer source = new LowerCaseTokenizer(version, reader); return new TokenStreamComponents(source, new PorterStemFilter(source)); } } Note: This filter is aware of the . To prevent certain terms from being passed to the stemmer should be set to true in a previous . Note: For including the original term as well as the stemmed version, see Factory for . <fieldType name="text_porterstem" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> </fieldType> Creates a new Stemmer, implementing the Porter Stemming Algorithm The Stemmer class transforms a word into its root form. The input word can be provided a character at time (by calling ), or at once by calling one of the various Stem methods, such as . resets the stemmer so it can stem another word. If you invoke the stemmer by calling and then , you must call before starting another word. Add a character to the word being stemmed. When you are finished adding characters, you can call to process the word. After a word has been stemmed, it can be retrieved by , or a reference to the internal buffer can be retrieved by and (which is generally more efficient.) Returns the length of the word resulting from the stemming process. Returns a reference to a character buffer containing the results of the stemming process. You also need to consult to determine the length of the result. Stem a word provided as a . Returns the result as a . Stem a word contained in a . Returns true if the stemming process resulted in a word different from the input. You can retrieve the result with / or . Stem a word contained in a portion of a array. Returns true if the stemming process resulted in a word different from the input. You can retrieve the result with / or . Stem a word contained in a leading portion of a array. Returns true if the stemming process resulted in a word different from the input. You can retrieve the result with / or . Stem the word placed into the Stemmer buffer through calls to . Returns true if the stemming process resulted in a word different from the input. You can retrieve the result with / or . for Spanish. You must specify the required compatibility when creating : As of 3.6, is used for less aggressive stemming. File containing default Spanish stopwords. Returns an unmodifiable instance of the default stop words set. default stop words set. Atomically loads the in a lazy fashion once the outer class accesses the static final set the first time.; Builds an analyzer with the default stop words: . Builds an analyzer with the given stop words. lucene compatibility version a stopword set Builds an analyzer with the given stop words. If a non-empty stem exclusion set is provided this analyzer will add a before stemming. lucene compatibility version a stopword set a set of terms not to be stemmed Creates a which tokenizes all the text in the provided . A built from an filtered with , , , if a stem exclusion set is provided and . A that applies to stem Spanish words. To prevent terms from being stemmed use an instance of or a custom that sets the before this . Factory for . <fieldType name="text_eslgtstem" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SpanishLightStemFilterFactory"/> </analyzer> </fieldType> Creates a new Light Stemmer for Spanish This stemmer implements the algorithm described in: Report on CLEF-2001 Experiments Jacques Savoy for Basque. File containing default Basque stopwords. Returns an unmodifiable instance of the default stop words set. default stop words set. Atomically loads the in a lazy fashion once the outer class accesses the static final set the first time.; Builds an analyzer with the default stop words: . Builds an analyzer with the given stop words. lucene compatibility version a stopword set Builds an analyzer with the given stop words. If a non-empty stem exclusion set is provided this analyzer will add a before stemming. lucene compatibility version a stopword set a set of terms not to be stemmed Creates a which tokenizes all the text in the provided . A built from an filtered with , , , if a stem exclusion set is provided and . for Persian. This Analyzer uses which implies tokenizing around zero-width non-joiner in addition to whitespace. Some persian-specific variant forms (such as farsi yeh and keheh) are standardized. "Stemming" is accomplished via stopwords. File containing default Persian stopwords. Default stopword list is from http://members.unine.ch/jacques.savoy/clef/index.html. The stopword list is BSD-Licensed. The comment character in the stopwords file. All lines prefixed with this will be ignored Returns an unmodifiable instance of the default stop-words set. an unmodifiable instance of the default stop-words set. Atomically loads the in a lazy fashion once the outer class accesses the static final set the first time.; Builds an analyzer with the default stop words: . Builds an analyzer with the given stop words lucene compatibility version a stopword set Creates used to tokenize all the text in the provided . built from a filtered with , , and Persian Stop words Wraps the with that replaces instances of Zero-width non-joiner with an ordinary space. Factory for . <fieldType name="text_fa" class="solr.TextField" positionIncrementGap="100"> <analyzer> <charFilter class="solr.PersianCharFilterFactory"/> <tokenizer class="solr.StandardTokenizerFactory"/> </analyzer> </fieldType> Creates a new A that applies to normalize the orthography. Factory for . <fieldType name="text_fanormal" class="solr.TextField" positionIncrementGap="100"> <analyzer> <charFilter class="solr.PersianCharFilterFactory"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.PersianNormalizationFilterFactory"/> </analyzer> </fieldType> Creates a new Normalizer for Persian. Normalization is done in-place for efficiency, operating on a termbuffer. Normalization is defined as: Normalization of various heh + hamza forms and heh goal to heh. Normalization of farsi yeh and yeh barree to arabic yeh Normalization of persian keheh to arabic kaf Normalize an input buffer of Persian text input buffer length of input buffer length of input buffer after normalization A that applies to stem Arabic words.. To prevent terms from being stemmed use an instance of or a custom that sets the before this . Factory for . <fieldType name="text_arstem" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.PersianNormalizationFilterFactory"/> <filter class="solr.PersianStemFilterFactory"/> </analyzer> </fieldType> Creates a new Stemmer for Persian. Stemming is done in-place for efficiency, operating on a termbuffer. Stemming is defined as: Removal of attached definite article, conjunction, and prepositions. Stemming of common suffixes. Stem an input buffer of Persian text. input buffer length of input buffer length of input buffer after normalization Stem suffix(es) off an Persian word. input buffer length of input buffer new length of input buffer after stemming Returns true if the suffix matches and can be stemmed input buffer length of input buffer suffix to check true if the suffix matches and can be stemmed for Finnish. File containing default Italian stopwords. Returns an unmodifiable instance of the default stop words set. default stop words set. Atomically loads the in a lazy fashion once the outer class accesses the static final set the first time.; Builds an analyzer with the default stop words: . Builds an analyzer with the given stop words. lucene compatibility version a stopword set Builds an analyzer with the given stop words. If a non-empty stem exclusion set is provided this analyzer will add a before stemming. lucene compatibility version a stopword set a set of terms not to be stemmed Creates a which tokenizes all the text in the provided . A built from an filtered with , , , if a stem exclusion set is provided and . A that applies to stem Finnish words. To prevent terms from being stemmed use an instance of or a custom that sets the before this . Factory for . <fieldType name="text_filgtstem" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.FinnishLightStemFilterFactory"/> </analyzer> </fieldType> Creates a new Light Stemmer for Finnish. This stemmer implements the algorithm described in: Report on CLEF-2003 Monolingual Tracks Jacques Savoy for French language. Supports an external list of stopwords (words that will not be indexed at all) and an external list of exclusions (word that will not be stemmed, but indexed). A default set of stopwords is used unless an alternative list is specified, but the exclusion list is empty by default. You must specify the required compatibility when creating FrenchAnalyzer: As of 3.6, is used for less aggressive stemming. As of 3.1, Snowball stemming is done with , is used prior to , and and Snowball stopwords are used by default. As of 2.9, preserves position increments NOTE: This class uses the same dependent settings as . Extended list of typical French stopwords. @deprecated (3.1) remove in Lucene 5.0 (index bw compat) File containing default French stopwords. Default set of articles for Contains words that should be indexed but not stemmed. Returns an unmodifiable instance of the default stop-words set. an unmodifiable instance of the default stop-words set. @deprecated (3.1) remove this in Lucene 5.0, index bw compat Builds an analyzer with the default stop words (). Builds an analyzer with the given stop words lucene compatibility version a stopword set Builds an analyzer with the given stop words lucene compatibility version a stopword set a stemming exclusion set Creates used to tokenize all the text in the provided . built from a filtered with , , , , if a stem exclusion set is provided, and A that applies to stem French words. To prevent terms from being stemmed use an instance of or a custom that sets the before this . Factory for . <fieldType name="text_frlgtstem" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ElisionFilterFactory"/> <filter class="solr.FrenchLightStemFilterFactory"/> </analyzer> </fieldType> Creates a new Light Stemmer for French. This stemmer implements the "UniNE" algorithm in: Light Stemming Approaches for the French, Portuguese, German and Hungarian Languages Jacques Savoy A that applies to stem French words. To prevent terms from being stemmed use an instance of or a custom that sets the before this . Factory for . <fieldType name="text_frminstem" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ElisionFilterFactory"/> <filter class="solr.FrenchMinimalStemFilterFactory"/> </analyzer> </fieldType> Creates a new Light Stemmer for French. This stemmer implements the following algorithm: A Stemming procedure and stopword list for general French corpora. Jacques Savoy. A that stems french words. The used stemmer can be changed at runtime after the filter object is created (as long as it is a ). To prevent terms from being stemmed use an instance of or a custom that sets the before this . @deprecated (3.1) Use with instead, which has the same functionality. This filter will be removed in Lucene 5.0 The actual token in the input stream. Returns true for the next token in the stream, or false at EOS Set a alternative/custom for this filter. A stemmer for French words. The algorithm is based on the work of Dr Martin Porter on his snowball project refer to http://snowball.sourceforge.net/french/stemmer.html (French stemming algorithm) for details @deprecated Use instead, which has the same functionality. This filter will be removed in Lucene 4.0 Buffer for the terms while stemming them. A temporary buffer, used to reconstruct R2 Region R0 is equal to the whole buffer Region RV "If the word begins with two vowels, RV is the region after the third letter, otherwise the region after the first vowel not at the beginning of the word, or the end of the word if these positions cannot be found." Region R1 "R1 is the region after the first non-vowel following a vowel or is the null region at the end of the word if there is no such non-vowel" Region R2 "R2 is the region after the first non-vowel in R1 following a vowel or is the null region at the end of the word if there is no such non-vowel" Set to true if we need to perform step 2 Set to true if the buffer was modified Stems the given term to a unique discriminator. The term that should be stemmed Discriminator for Sets the search region strings it needs to be done each time the buffer was modified First step of the Porter Algorithm refer to http://snowball.sourceforge.net/french/stemmer.html for an explanation Second step (A) of the Porter Algorithm Will be performed if nothing changed from the first step or changed were done in the amment, emment, ments or ment suffixes refer to http://snowball.sourceforge.net/french/stemmer.html for an explanation true if something changed in the Second step (B) of the Porter Algorithm Will be performed if step 2 A was performed unsuccessfully refer to http://snowball.sourceforge.net/french/stemmer.html for an explanation Third step of the Porter Algorithm refer to http://snowball.sourceforge.net/french/stemmer.html for an explanation Fourth step of the Porter Algorithm refer to http://snowball.sourceforge.net/french/stemmer.html for an explanation Fifth step of the Porter Algorithm refer to http://snowball.sourceforge.net/french/stemmer.html for an explanation Sixth (and last!) step of the Porter Algorithm refer to http://snowball.sourceforge.net/french/stemmer.html for an explanation Delete a suffix searched in zone "source" if zone "from" contains prefix + search string the primary source zone for search the strings to search for suppression the secondary source zone for search the prefix to add to the search string to test true if modified Delete a suffix searched in zone "source" if the preceding letter is (or isn't) a vowel the primary source zone for search the strings to search for suppression true if we need a vowel before the search string the secondary source zone for search (where vowel could be) true if modified Delete a suffix searched in zone "source" if preceded by the prefix the primary source zone for search the strings to search for suppression the prefix to add to the search string to test true if it will be deleted even without prefix found Delete a suffix searched in zone "source" if preceded by prefix or replace it with the replace string if preceded by the prefix in the zone "from" or delete the suffix if specified the primary source zone for search the strings to search for suppression the prefix to add to the search string to test true if it will be deleted even without prefix found the secondary source zone for search the replacement string Replace a search string with another within the source zone the source zone for search the strings to search for replacement the replacement string Delete a search string within the source zone the source zone for search the strings to search for suppression Test if a char is a french vowel, including accentuated ones the char to test true if the char is a vowel Retrieve the "R zone" (1 or 2 depending on the buffer) and return the corresponding string "R is the region after the first non-vowel following a vowel or is the null region at the end of the word if there is no such non-vowel" the in buffer the resulting string Retrieve the "RV zone" from a buffer an return the corresponding string "If the word begins with two vowels, RV is the region after the third letter, otherwise the region after the first vowel not at the beginning of the word, or the end of the word if these positions cannot be found." the in buffer the resulting string Turns u and i preceded AND followed by a vowel to UpperCase Turns y preceded OR followed by a vowel to UpperCase Turns u preceded by q to UpperCase the buffer to treat the treated buffer Checks a term if it can be processed correctly. true if, and only if, the given term consists in letters. for Irish. File containing default Irish stopwords. When StandardTokenizer splits t‑athair into {t, athair}, we don't want to cause a position increment, otherwise there will be problems with phrase queries versus tAthair (which would not have a gap). Returns an unmodifiable instance of the default stop words set. default stop words set. Atomically loads the in a lazy fashion once the outer class accesses the static final set the first time.; Builds an analyzer with the default stop words: . Builds an analyzer with the given stop words. lucene compatibility version a stopword set Builds an analyzer with the given stop words. If a non-empty stem exclusion set is provided this analyzer will add a before stemming. lucene compatibility version a stopword set a set of terms not to be stemmed Creates a which tokenizes all the text in the provided . A built from an filtered with , , , if a stem exclusion set is provided and . Normalises token text to lower case, handling t-prothesis and n-eclipsis (i.e., that 'nAthair' should become 'n-athair') Create an that normalises Irish token text. Factory for . <fieldType name="text_ga" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.IrishLowerCaseFilterFactory"/> </analyzer> </fieldType> Creates a new for Galician. File containing default Galician stopwords. Returns an unmodifiable instance of the default stop words set. default stop words set. Atomically loads the in a lazy fashion once the outer class accesses the static final set the first time.; Builds an analyzer with the default stop words: . Builds an analyzer with the given stop words. lucene compatibility version a stopword set Builds an analyzer with the given stop words. If a non-empty stem exclusion set is provided this analyzer will add a before stemming. lucene compatibility version a stopword set a set of terms not to be stemmed Creates a which tokenizes all the text in the provided . A built from an filtered with , , , if a stem exclusion set is provided and . A that applies to stem Galician words. To prevent terms from being stemmed use an instance of or a custom that sets the before this . Factory for . <fieldType name="text_glplural" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.GalicianMinimalStemFilterFactory"/> </analyzer> </fieldType> Creates a new Minimal Stemmer for Galician This follows the "RSLP-S" algorithm, but modified for Galician. Hence this stemmer only applies the plural reduction step of: "Regras do lematizador para o galego" A that applies to stem Galician words. To prevent terms from being stemmed use an instance of or a custom that sets the before this . Factory for . <fieldType name="text_glstem" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.GalicianStemFilterFactory"/> </analyzer> </fieldType> Creates a new Galician stemmer implementing "Regras do lematizador para o galego". Description of rules buffer, oversized to at least len+1 initial valid length of buffer new valid length, stemmed Analyzer for Hindi. You must specify the required compatibility when creating HindiAnalyzer: As of 3.6, StandardTokenizer is used for tokenization File containing default Hindi stopwords. Default stopword list is from http://members.unine.ch/jacques.savoy/clef/index.html The stopword list is BSD-Licensed. Returns an unmodifiable instance of the default stop-words set. an unmodifiable instance of the default stop-words set. Atomically loads the in a lazy fashion once the outer class accesses the static final set the first time.; Builds an analyzer with the given stop words lucene compatibility version a stopword set a stemming exclusion set Builds an analyzer with the given stop words lucene compatibility version a stopword set Builds an analyzer with the default stop words: . Creates used to tokenize all the text in the provided . built from a filtered with , , , if a stem exclusion set is provided, , and Hindi Stop words A that applies to normalize the orthography. In some cases the normalization may cause unrelated terms to conflate, so to prevent terms from being normalized use an instance of or a custom that sets the before this . Factory for . <fieldType name="text_hinormal" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.HindiNormalizationFilterFactory"/> </analyzer> </fieldType> Creates a new Normalizer for Hindi. Normalizes text to remove some differences in spelling variations. Implements the Hindi-language specific algorithm specified in: Word normalization in Indian languages Prasad Pingali and Vasudeva Varma. http://web2py.iiit.ac.in/publications/default/download/inproceedings.pdf.3fe5b38c-02ee-41ce-9a8f-3e745670be32.pdf with the following additions from Hindi CLIR in Thirty Days Leah S. Larkey, Margaret E. Connell, and Nasreen AbdulJaleel. http://maroo.cs.umass.edu/pub/web/getpdf.php?id=454: Internal Zero-width joiner and Zero-width non-joiners are removed In addition to chandrabindu, NA+halant is normalized to anusvara Normalize an input buffer of Hindi text input buffer length of input buffer length of input buffer after normalization A that applies to stem Hindi words. Factory for . <fieldType name="text_histem" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.HindiStemFilterFactory"/> </analyzer> </fieldType> Creates a new Light Stemmer for Hindi. Implements the algorithm specified in: A Lightweight Stemmer for Hindi Ananthakrishnan Ramanathan and Durgesh D Rao. http://computing.open.ac.uk/Sites/EACLSouthAsia/Papers/p6-Ramanathan.pdf In-memory structure for the dictionary (.dic) and affix (.aff) data of a hunspell dictionary. Creates a new containing the information read from the provided s to hunspell affix and dictionary files. You have to dispose the provided s yourself. for reading the hunspell affix file (won't be disposed). for reading the hunspell dictionary file (won't be disposed). Can be thrown while reading from the s Can be thrown if the content of the files does not meet expected formats Creates a new containing the information read from the provided s to hunspell affix and dictionary files. You have to dispose the provided s yourself. for reading the hunspell affix file (won't be disposed). for reading the hunspell dictionary files (won't be disposed). ignore case? Can be thrown while reading from the s Can be thrown if the content of the files does not meet expected formats Looks up HunspellAffix suffixes that have an append that matches the created from the given array, offset and length array to generate the from Offset in the char array that the starts at Length from the offset that the is List of HunspellAffix suffixes with an append that matches the , or null if none are found Reads the affix file through the provided , building up the prefix and suffix maps to read the content of the affix file from to decode the content of the file Can be thrown while reading from the InputStream Parses a specific affix rule putting the result into the provided affix map where the result of the parsing will be put Header line of the affix rule to read the content of the rule from pattern to be used to generate the condition regex pattern map from condition -> index of patterns, for deduplication. Can be thrown while reading the rule pattern accepts optional BOM + SET + any whitespace Parses the encoding specified in the affix file readable through the provided for reading the affix file Encoding specified in the affix file Can be thrown while reading from the Thrown if the first non-empty non-comment line read from the file does not adhere to the format SET <encoding> Retrieves the for the given encoding. Note, This isn't perfect as I think ISCII-DEVANAGARI and MICROSOFT-CP1251 etc are allowed... Encoding to retrieve the instance for for the given encoding Determines the appropriate based on the FLAG definition line taken from the affix file Line containing the flag information that handles parsing flags in the way specified in the FLAG definition Reads the dictionary file through the provided s, building up the words map s to read the dictionary file through used to decode the contents of the file Can be thrown while reading from the file Abstraction of the process of parsing flags taken from the affix and dic files Parses the given into a single flag to parse into a flag Parsed flag Parses the given into multiple flags to parse into flags Parsed flags Simple implementation of that treats the chars in each as a individual flags. Can be used with both the ASCII and UTF-8 flag types. Implementation of that assumes each flag is encoded in its numerical form. In the case of multiple flags, each number is separated by a comma. Implementation of that assumes each flag is encoded as two ASCII characters whose codes must be combined into a single character. that uses hunspell affix rules and words to stem tokens. Since hunspell supports a word having multiple stems, this filter can emit multiple tokens for each consumed token Note: This filter is aware of the . To prevent certain terms from being passed to the stemmer should be set to true in a previous . Note: For including the original term as well as the stemmed version, see @lucene.experimental Create a outputting all possible stems. Create a outputting all possible stems. Creates a new HunspellStemFilter that will stem tokens from the given using affix rules in the provided Dictionary whose tokens will be stemmed Hunspell containing the affix rules and words that will be used to stem the tokens remove duplicates true if only the longest term should be output. that creates instances of . Example config for British English: <filter class="solr.HunspellStemFilterFactory" dictionary="en_GB.dic,my_custom.dic" affix="en_GB.aff" ignoreCase="false" longestOnly="false" /> Both parameters dictionary and affix are mandatory. Dictionaries for many languages are available through the OpenOffice project. See http://wiki.apache.org/solr/Hunspell @lucene.experimental Creates a new Stemmer uses the affix rules declared in the to generate one or more stems for a word. It conforms to the algorithm in the original hunspell algorithm, including recursive suffix stripping. Constructs a new Stemmer which will use the provided to create its stems. that will be used to create the stems Find the stem(s) of the provided word. Word to find the stems for of stems for the word Find the stem(s) of the provided word Word to find the stems for length of stems for the word Find the unique stem(s) of the provided word Word to find the stems for length of stems for the word Generates a list of stems for the provided word Word to generate the stems for length previous affix that was removed (so we dont remove same one twice) Flag from a previous stemming step that need to be cross-checked with any affixes in this recursive step flag of the most inner removed prefix, so that when removing a suffix, its also checked against the word current recursiondepth true if we should remove prefixes true if we should remove suffixes true if the previous removal was a prefix: if we are removing a suffix, and it has no continuation requirements, its ok. but two prefixes (COMPLEXPREFIXES) or two suffixes must have continuation requirements to recurse. true if the previous prefix removal was signed as a circumfix this means inner most suffix must also contain circumfix flag. true if we are searching for a case variant. if the word has KEEPCASE flag it cannot succeed. of stems, or empty list if no stems are found checks condition of the concatenation of two strings Applies the affix rule to the given word, producing a list of stems if any are found Word the affix has been removed and the strip added valid length of stripped word HunspellAffix representing the affix rule itself when we already stripped a prefix, we cant simply recurse and check the suffix, unless both are compatible so we must check dictionary form against both to add it as a stem! current recursion depth true if we are removing a prefix (false if its a suffix) true if the previous prefix removal was signed as a circumfix this means inner most suffix must also contain circumfix flag. true if we are searching for a case variant. if the word has KEEPCASE flag it cannot succeed. of stems for the word, or an empty list if none are found Checks if the given flag cross checks with the given array of flags Flag to cross check with the array of flags Array of flags to cross check against. Can be null If true, will match a zero length flags array. true if the flag is found in the array or the array is null, false otherwise for Hungarian. File containing default Hungarian stopwords. Returns an unmodifiable instance of the default stop words set. default stop words set. Atomically loads the in a lazy fashion once the outer class accesses the static final set the first time.; Builds an analyzer with the default stop words: . lucene compatibility version Builds an analyzer with the given stop words. lucene compatibility version a stopword set Builds an analyzer with the given stop words. If a non-empty stem exclusion set is provided this analyzer will add a before stemming. lucene compatibility version a stopword set a set of terms not to be stemmed Creates a which tokenizes all the text in the provided . A built from an filtered with , , , if a stem exclusion set is provided and . A that applies to stem Hungarian words. To prevent terms from being stemmed use an instance of or a custom that sets the before this . Factory for . <fieldType name="text_hulgtstem" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.HungarianLightStemFilterFactory"/> </analyzer> </fieldType> Creates a new Light Stemmer for Hungarian. This stemmer implements the "UniNE" algorithm in: Light Stemming Approaches for the French, Portuguese, German and Hungarian Languages Jacques Savoy for Armenian. File containing default Armenian stopwords. Returns an unmodifiable instance of the default stop words set. default stop words set. Atomically loads the in a lazy fashion once the outer class accesses the static final set the first time.; Builds an analyzer with the default stop words: . lucene compatibility version Builds an analyzer with the given stop words. lucene compatibility version a stopword set Builds an analyzer with the given stop words. If a non-empty stem exclusion set is provided this analyzer will add a before stemming. lucene compatibility version a stopword set a set of terms not to be stemmed Creates a which tokenizes all the text in the provided . A built from an filtered with , , , if a stem exclusion set is provided and . for Indonesian (Bahasa) File containing default Indonesian stopwords. Returns an unmodifiable instance of the default stop-words set. an unmodifiable instance of the default stop-words set. Atomically loads the in a lazy fashion once the outer class accesses the static final set the first time.; Builds an analyzer with the default stop words: . Builds an analyzer with the given stop words lucene compatibility version a stopword set Builds an analyzer with the given stop word. If a none-empty stem exclusion set is provided this analyzer will add a before . lucene compatibility version a stopword set a set of terms not to be stemmed Creates used to tokenize all the text in the provided . built from an filtered with , , , if a stem exclusion set is provided and . A that applies to stem Indonesian words. Calls IndonesianStemFilter(input, true) Create a new . If is false, only inflectional suffixes (particles and possessive pronouns) are stemmed. Factory for . <fieldType name="text_idstem" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.IndonesianStemFilterFactory" stemDerivational="true"/> </analyzer> </fieldType> Creates a new Stemmer for Indonesian. Stems Indonesian words with the algorithm presented in: A Study of Stemming Effects on Information Retrieval in Bahasa Indonesia, Fadillah Z Tala. http://www.illc.uva.nl/Publications/ResearchReports/MoL-2003-02.text.pdf Stem a term (returning its new length). Use to control whether full stemming or only light inflectional stemming is done. A that applies to normalize text in Indian Languages. Factory for . <fieldType name="text_innormal" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.IndicNormalizationFilterFactory"/> </analyzer> </fieldType> Creates a new Normalizes the Unicode representation of text in Indian languages. Follows guidelines from Unicode 5.2, chapter 6, South Asian Scripts I and graphical decompositions from http://ldc.upenn.edu/myl/IndianScriptsUnicode.html Decompositions according to Unicode 5.2, and http://ldc.upenn.edu/myl/IndianScriptsUnicode.html Most of these are not handled by unicode normalization anyway. The numbers here represent offsets into the respective codepages, with -1 representing null and 0xFF representing zero-width joiner. the columns are: ch1, ch2, ch3, res, flags ch1, ch2, and ch3 are the decomposition res is the composition, and flags are the scripts to which it applies. Normalizes input text, and returns the new length. The length will always be less than or equal to the existing length. input text valid length normalized length Compose into standard form any compositions in the decompositions table. LUCENENET: Returns the unicode block for the specified character. Caches the last script and script data used on the current thread to optimize performance when not switching between scripts. Simple Tokenizer for text in Indian Languages. @deprecated (3.6) Use instead. for Italian. You must specify the required compatibility when creating : As of 3.6, is used for less aggressive stemming. As of 3.2, with a set of Italian contractions is used by default. File containing default Italian stopwords. Returns an unmodifiable instance of the default stop words set. default stop words set. Atomically loads the in a lazy fashion once the outer class accesses the static final set the first time.; Builds an analyzer with the default stop words: . lucene compatibility version Builds an analyzer with the given stop words. lucene compatibility version a stopword set Builds an analyzer with the given stop words. If a non-empty stem exclusion set is provided this analyzer will add a before stemming. lucene compatibility version a stopword set a set of terms not to be stemmed Creates a which tokenizes all the text in the provided . A built from an filtered with , , , , if a stem exclusion set is provided and . A that applies to stem Italian words. To prevent terms from being stemmed use an instance of or a custom that sets the before this . Factory for . <fieldType name="text_itlgtstem" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ItalianLightStemFilterFactory"/> </analyzer> </fieldType> Creates a new Light Stemmer for Italian. This stemmer implements the algorithm described in: Report on CLEF-2001 Experiments Jacques Savoy for Latvian. File containing default Latvian stopwords. Returns an unmodifiable instance of the default stop words set. default stop words set. Atomically loads the in a lazy fashion once the outer class accesses the static final set the first time.; Builds an analyzer with the default stop words: . lucene compatibility version Builds an analyzer with the given stop words. lucene compatibility version a stopword set Builds an analyzer with the given stop words. If a non-empty stem exclusion set is provided this analyzer will add a before stemming. lucene compatibility version a stopword set a set of terms not to be stemmed Creates a which tokenizes all the text in the provided . A built from an filtered with , , , if a stem exclusion set is provided and . A that applies to stem Latvian words. To prevent terms from being stemmed use an instance of or a custom that sets the before this . Factory for . <fieldType name="text_lvstem" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.LatvianStemFilterFactory"/> </analyzer> </fieldType> Creates a new Light stemmer for Latvian. This is a light version of the algorithm in Karlis Kreslin's PhD thesis A stemming algorithm for Latvian with the following modifications: Only explicitly stems noun and adjective morphology Stricter length/vowel checks for the resulting stems (verb etc suffix stripping is removed) Removes only the primary inflectional suffixes: case and number for nouns ; case, number, gender, and definitiveness for adjectives. Palatalization is only handled when a declension II,V,VI noun suffix is removed. Stem a latvian word. returns the new adjusted length. Most cases are handled except for the ambiguous ones: s -> š t -> š d -> ž z -> ž Count the vowels in the string, we always require at least one in the remaining stem to accept it. This class converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists. Characters from the following Unicode blocks are converted; however, only those characters with reasonable ASCII alternatives are converted: See: http://en.wikipedia.org/wiki/Latin_characters_in_Unicode For example, '&agrave;' will be replaced by 'a'. Create a new . TokenStream to filter should the original tokens be kept on the input stream with a 0 position increment from the folded tokens? Does the filter preserve the original tokens? Converts characters above ASCII to their ASCII equivalents. For example, accents are removed from accented characters. The string to fold The number of characters in the input string Converts characters above ASCII to their ASCII equivalents. For example, accents are removed from accented characters. @lucene.internal The characters to fold Index of the first character to fold The result of the folding. Should be of size >= length * 4. Index of output where to put the result of the folding The number of characters to fold length of output Factory for . <fieldType name="text_ascii" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false"/> </analyzer> </fieldType> Creates a new A filter to apply normal capitalization rules to Tokens. It will make the first letter capital and the rest lower case. This filter is particularly useful to build nice looking facet parameters. This filter is not appropriate if you intend to use a prefix query. Creates a with the default parameters using the invariant culture. Calls CapitalizationFilter(in, true, null, true, null, 0, DEFAULT_MAX_WORD_COUNT, DEFAULT_MAX_TOKEN_LENGTH, null) Creates a with the default parameters and the specified . Calls CapitalizationFilter(in, true, null, true, null, 0, DEFAULT_MAX_WORD_COUNT, DEFAULT_MAX_TOKEN_LENGTH) input tokenstream The culture to use for the casing operation. If null, will be used. Creates a with the specified parameters using the invariant culture. input tokenstream should each word be capitalized or all of the words? a keep word list. Each word that should be kept separated by whitespace. Force the first letter to be capitalized even if it is in the keep list. do not change word capitalization if a word begins with something in this list. how long the word needs to be to get capitalization applied. If the minWordLength is 3, "and" > "And" but "or" stays "or". if the token contains more then maxWordCount words, the capitalization is assumed to be correct. The maximum length for an individual token. Tokens that exceed this length will not have the capitalization operation performed. Creates a with the specified parameters and the specified . input tokenstream should each word be capitalized or all of the words? a keep word list. Each word that should be kept separated by whitespace. Force the first letter to be capitalized even if it is in the keep list. do not change word capitalization if a word begins with something in this list. how long the word needs to be to get capitalization applied. If the minWordLength is 3, "and" > "And" but "or" stays "or". if the token contains more then maxWordCount words, the capitalization is assumed to be correct. The maximum length for an individual token. Tokens that exceed this length will not have the capitalization operation performed. The culture to use for the casing operation. If null, will be used. Factory for . The factory takes parameters: "onlyFirstWord" - should each word be capitalized or all of the words? "keep" - a keep word list. Each word that should be kept separated by whitespace. "keepIgnoreCase - true or false. If true, the keep list will be considered case-insensitive. "forceFirstLetter" - Force the first letter to be capitalized even if it is in the keep list "okPrefix" - do not change word capitalization if a word begins with something in this list. for example if "McK" is on the okPrefix list, the word "McKinley" should not be changed to "Mckinley" "minWordLength" - how long the word needs to be to get capitalization applied. If the minWordLength is 3, "and" > "And" but "or" stays "or" "maxWordCount" - if the token contains more then maxWordCount words, the capitalization is assumed to be correct. "culture" - the culture to use to apply the capitalization rules. If not supplied or the string "invariant" is supplied, the invariant culture is used. <fieldType name="text_cptlztn" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.CapitalizationFilterFactory" onlyFirstWord="true" keep="java solr lucene" keepIgnoreCase="false" okPrefix="McK McD McA"/> </analyzer> </fieldType> @since solr 1.3 Creates a new Removes words that are too long or too short from the stream. Note: Length is calculated as the number of Unicode codepoints. Create a new . This will filter out tokens whose is either too short ( < min) or too long ( > max). the Lucene match version the to consume the minimum length the maximum length Factory for . <fieldType name="text_lngth" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.CodepointCountFilterFactory" min="0" max="1" /> </analyzer> </fieldType> Creates a new An always exhausted token stream. When the plain text is extracted from documents, we will often have many words hyphenated and broken into two lines. This is often the case with documents where narrow text columns are used, such as newsletters. In order to increase search efficiency, this filter puts hyphenated words broken into two lines back together. This filter should be used on indexing time only. Example field definition in schema.xml: <fieldtype name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> <filter class="solr.StopFilterFactory" ignoreCase="true"/> <filter class="solr.HyphenatedWordsFilterFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldtype> Creates a new that will be filtered Consumers (i.e., ) use this method to advance the stream to the next token. Implementing classes must implement this method and update the appropriate s with the attributes of the next token. The producer must make no assumptions about the attributes after the method has been returned: the caller may arbitrarily change it. If the producer needs to preserve the state for subsequent calls, it can use to create a copy of the current attribute state. this method is called for every token of a document, so an efficient implementation is crucial for good performance. To avoid calls to and , references to all s that this stream uses should be retrieved during instantiation. To ensure that filters and consumers know which attributes are available, the attributes must be added during instantiation. Filters and consumers are not required to check for availability of attributes in . false for end of stream; true otherwise This method is called by a consumer before it begins consumption using . Resets this stream to a clean state. Stateful implementations must implement this method so that they can be reused, just as if they had been created fresh. If you override this method, always call base.Reset(), otherwise some internal state will not be correctly reset (e.g., will throw on further usage). NOTE: The default implementation chains the call to the input , so be sure to call base.Reset() when overriding this method. Writes the joined unhyphenated term Factory for . <fieldType name="text_hyphn" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.HyphenatedWordsFilterFactory"/> </analyzer> </fieldType> Creates a new A that only keeps tokens with text contained in the required words. This filter behaves like the inverse of . @since solr 1.3 @deprecated enablePositionIncrements=false is not supported anymore as of Lucene 4.4. Create a new . NOTE: The words set passed to this constructor will be directly used by this filter and should not be modified. the Lucene match version the to consume the words to keep Factory for . <fieldType name="text_keepword" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.KeepWordFilterFactory" words="keepwords.txt" ignoreCase="false"/> </analyzer> </fieldType> Creates a new Marks terms as keywords via the . Creates a new the input stream Factory for . <fieldType name="text_keyword" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protectedkeyword.txt" pattern="^.+er$" ignoreCase="false"/> </analyzer> </fieldType> Creates a new This TokenFilter emits each incoming token twice once as keyword and once non-keyword, in other words once with set to true and once set to false. This is useful if used with a stem filter that respects the to index the stemmed and the un-stemmed version of a term into the same field. Construct a token stream filtering the given input. Factory for . Since emits two tokens for every input token, and any tokens that aren't transformed later in the analysis chain will be in the document twice. Therefore, consider adding later in the analysis chain. Creates a new Removes words that are too long or too short from the stream. Note: Length is calculated as the number of UTF-16 code units. Create a new . This will filter out tokens whose is either too short ( < min) or too long ( > max). the Lucene match version the to consume the minimum length the maximum length Factory for . <fieldType name="text_lngth" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LengthFilterFactory" min="0" max="1" /> </analyzer> </fieldType> Creates a new This limits the number of tokens while indexing. It is a replacement for the maximum field length setting inside . Build an analyzer that limits the maximum number of tokens per field. This analyzer will not consume any tokens beyond the maxTokenCount limit Build an analyzer that limits the maximum number of tokens per field. the analyzer to wrap max number of tokens to produce whether all tokens from the delegate should be consumed even if maxTokenCount is reached. This limits the number of tokens while indexing. It is a replacement for the maximum field length setting inside . By default, this filter ignores any tokens in the wrapped once the limit has been reached, which can result in being called prior to returning false. For most implementations this should be acceptable, and faster then consuming the full stream. If you are wrapping a which requires that the full stream of tokens be exhausted in order to function properly, use the consumeAllTokens option. Build a filter that only accepts tokens up to a maximum number. This filter will not consume any tokens beyond the limit the stream to wrap max number of tokens to produce Build an filter that limits the maximum number of tokens per field. the stream to wrap max number of tokens to produce whether all tokens from the input must be consumed even if is reached. Factory for . <fieldType name="text_lngthcnt" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LimitTokenCountFilterFactory" maxTokenCount="10" consumeAllTokens="false" /> </analyzer> </fieldType> The property is optional and defaults to false. See for an explanation of it's use. Creates a new This limits its emitted tokens to those with positions that are not greater than the configured limit. By default, this filter ignores any tokens in the wrapped once the limit has been exceeded, which can result in being called prior to returning false. For most implementations this should be acceptable, and faster then consuming the full stream. If you are wrapping a which requires that the full stream of tokens be exhausted in order to function properly, use the consumeAllTokens option. Build a filter that only accepts tokens up to and including the given maximum position. This filter will not consume any tokens with position greater than the limit. the stream to wrap max position of tokens to produce (1st token always has position 1) Build a filter that limits the maximum position of tokens to emit. the stream to wrap max position of tokens to produce (1st token always has position 1) whether all tokens from the wrapped input stream must be consumed even if maxTokenPosition is exceeded. Factory for . <fieldType name="text_limit_pos" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LimitTokenPositionFilterFactory" maxTokenPosition="3" consumeAllTokens="false" /> </analyzer> </fieldType> The property is optional and defaults to false. See for an explanation of its use. Creates a new Old Broken version of If not null is the set of tokens to protect from being delimited Creates a new to be filtered table containing character types Flags configuring the filter If not null is the set of tokens to protect from being delimited Creates a new using as its charTypeTable to be filtered Flags configuring the filter If not null is the set of tokens to protect from being delimited This method is called by a consumer before it begins consumption using . Resets this stream to a clean state. Stateful implementations must implement this method so that they can be reused, just as if they had been created fresh. If you override this method, always call base.Reset(), otherwise some internal state will not be correctly reset (e.g., will throw on further usage). NOTE: The default implementation chains the call to the input , so be sure to call base.Reset() when overriding this method. Saves the existing attribute states Flushes the given by either writing its concat and then clearing, or just clearing. that will be flushed true if the concatenation was written before it was cleared, false otherwise Determines whether to concatenate a word or number if the current word is the given type Type of the current word used to determine if it should be concatenated true if concatenation should occur, false otherwise Determines whether a word/number part should be generated for a word of the given type Type of the word used to determine if a word/number part should be generated true if a word/number part should be generated, false otherwise Concatenates the saved buffer to the given WordDelimiterConcatenation WordDelimiterConcatenation to concatenate the buffer to Generates a word/number part, updating the appropriate attributes true if the generation is occurring from a single word, false otherwise Get the position increment gap for a subword or concatenation true if this token wants to be injected position increment gap Checks if the given word type includes Word type to check true if the type contains , false otherwise Checks if the given word type includes Word type to check true if the type contains , false otherwise Checks if the given word type includes Word type to check true if the type contains , false otherwise Checks if the given word type includes Word type to check true if the type contains , false otherwise Determines whether the given flag is set Flag to see if set true if flag is set A WDF concatenated 'run' Appends the given text of the given length, to the concetenation at the given offset Text to append Offset in the concetenation to add the text Length of the text to append Writes the concatenation to the attributes Determines if the concatenation is empty true if the concatenation is empty, false otherwise Clears the concatenation and resets its state Convenience method for the common scenario of having to write the concetenation and then clearing its state Efficient Lucene analyzer/tokenizer that preferably operates on a rather than a , that can flexibly separate text into terms via a regular expression (with behaviour similar to ), and that combines the functionality of , , , into a single efficient multi-purpose class. If you are unsure how exactly a regular expression should look like, consider prototyping by simply trying various expressions on some test texts via . Once you are satisfied, give that regex to . Also see Regular Expression Tutorial. This class can be considerably faster than the "normal" Lucene tokenizers. It can also serve as a building block in a compound Lucene chain. For example as in this stemming example: PatternAnalyzer pat = ... TokenStream tokenStream = new SnowballFilter( pat.GetTokenStream("content", "James is running round in the woods"), "English")); @deprecated (4.0) use the pattern-based analysis in the analysis/pattern package instead. "\\W+"; Divides text at non-letters (NOT Character.isLetter(c)) "\\s+"; Divides text at whitespaces (Character.isWhitespace(c)) A lower-casing word analyzer with English stop words (can be shared freely across threads without harm); global per class loader. A lower-casing word analyzer with extended English stop words (can be shared freely across threads without harm); global per class loader. The stop words are borrowed from http://thomas.loc.gov/home/stopwords.html, see http://thomas.loc.gov/home/all.about.inquery.html Constructs a new instance with the given parameters. currently does nothing a regular expression delimiting tokens if true returns tokens after applying String.toLowerCase() if non-null, ignores all tokens that are contained in the given stop set (after previously having applied toLowerCase() if applicable). For example, created via and/or as in WordlistLoader.getWordSet(new File("samples/fulltext/stopwords.txt") or other stop words lists . Creates a token stream that tokenizes the given string into token terms (aka words). the name of the field to tokenize (currently ignored). reader (e.g. charfilter) of the original text. can be null. the string to tokenize a new token stream Creates a token stream that tokenizes all the text in the given SetReader; This implementation forwards to and is less efficient than . the name of the field to tokenize (currently ignored). the reader delivering the text a new token stream Indicates whether some other object is "equal to" this one. the reference object with which to compare. true if equal, false otherwise Returns a hash code value for the object. the hash code. equality where o1 and/or o2 can be null assumes p1 and p2 are not null Reads until end-of-stream and returns all read chars, finally closes the stream. the input stream if an I/O error occurs while reading the stream The work horse; performance isn't fantastic, but it's not nearly as bad as one might think - kudos to the Sun regex developers. Special-case class for best performance in common cases; this class is otherwise unnecessary. A that exposes it's contained string for fast direct access. Might make sense to generalize this to ICharSequence and make it public? Marks terms as keywords via the . Each token that matches the provided pattern is marked as a keyword by setting to true. Create a new , that marks the current token as a keyword if the tokens term buffer matches the provided via the . to filter the pattern to apply to the incoming term buffer This analyzer is used to facilitate scenarios where different fields Require different analysis techniques. Use the Map argument in to add non-default analyzers for fields. Example usage: IDictionary<string, Analyzer> analyzerPerField = new Dictionary<string, Analyzer>(); analyzerPerField["firstname"] = new KeywordAnalyzer(); analyzerPerField["lastname"] = new KeywordAnalyzer(); PerFieldAnalyzerWrapper aWrapper = new PerFieldAnalyzerWrapper(new StandardAnalyzer(version), analyzerPerField); In this example, will be used for all fields except "firstname" and "lastname", for which will be used. A can be used like any other analyzer, for both indexing and query parsing. Constructs with default analyzer. Any fields not specifically defined to use a different analyzer will use the one provided here. Constructs with default analyzer and a map of analyzers to use for specific fields. The type of supplied will determine the type of behavior. General use. null keys are not supported. Use when sorted keys are required. null keys are not supported. Similar behavior as . null keys are supported. Use when sorted keys are required. null keys are supported. Use when insertion order must be preserved ( preserves insertion order only until items are removed). null keys are supported. Or, use a 3rd party or custom if other behavior is desired. Any fields not specifically defined to use a different analyzer will use the one provided here. A (String field name to the Analyzer) to be used for those fields. Links two . NOTE: This filter might not behave correctly if used with custom s, i.e. s other than the ones located in Lucene.Net.Analysis.TokenAttributes. Joins two token streams and leaves the last token of the first stream available to be used when updating the token values in the second stream based on that token. The default implementation adds last prefix token end offset to the suffix token start and end offsets. NOTE: This filter might not behave correctly if used with custom s, i.e. s other than the ones located in Lucene.Net.Analysis.TokenAttributes. The default implementation adds last prefix token end offset to the suffix token start and end offsets. a token from the suffix stream the last token from the prefix stream consumer token A which filters out s at the same position and Term text as the previous token in the stream. Creates a new RemoveDuplicatesTokenFilter TokenStream that will be filtered Consumers (i.e., ) use this method to advance the stream to the next token. Implementing classes must implement this method and update the appropriate s with the attributes of the next token. The producer must make no assumptions about the attributes after the method has been returned: the caller may arbitrarily change it. If the producer needs to preserve the state for subsequent calls, it can use to create a copy of the current attribute state. this method is called for every token of a document, so an efficient implementation is crucial for good performance. To avoid calls to and , references to all s that this stream uses should be retrieved during instantiation. To ensure that filters and consumers know which attributes are available, the attributes must be added during instantiation. Filters and consumers are not required to check for availability of attributes in . false for end of stream; true otherwise This method is called by a consumer before it begins consumption using . Resets this stream to a clean state. Stateful implementations must implement this method so that they can be reused, just as if they had been created fresh. If you override this method, always call base.Reset(), otherwise some internal state will not be correctly reset (e.g., will throw on further usage). Factory for . <fieldType name="text_rmdup" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType> Creates a new This filter folds Scandinavian characters åÅäæÄÆ->a and öÖøØ->o. It also discriminate against use of double vowels aa, ae, ao, oe and oo, leaving just the first one. It's is a semantically more destructive solution than but can in addition help with matching raksmorgas as räksmörgås. blåbærsyltetøj == blåbärsyltetöj == blaabaarsyltetoej == blabarsyltetoj räksmörgås == ræksmørgås == ræksmörgaos == raeksmoergaas == raksmorgas Background: Swedish åäö are in fact the same letters as Norwegian and Danish åæø and thus interchangeable when used between these languages. They are however folded differently when people type them on a keyboard lacking these characters. In that situation almost all Swedish people use a, a, o instead of å, ä, ö. Norwegians and Danes on the other hand usually type aa, ae and oe instead of å, æ and ø. Some do however use a, a, o, oo, ao and sometimes permutations of everything above. This filter solves that mismatch problem, but might also cause new. Factory for . <fieldType name="text_scandfold" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.ScandinavianFoldingFilterFactory"/> </analyzer> </fieldType> Creates a new This filter normalize use of the interchangeable Scandinavian characters æÆäÄöÖøØ and folded variants (aa, ao, ae, oe and oo) by transforming them to åÅæÆøØ. It's a semantically less destructive solution than , most useful when a person with a Norwegian or Danish keyboard queries a Swedish index and vice versa. This filter does not the common Swedish folds of å and ä to a nor ö to o. blåbærsyltetøj == blåbärsyltetöj == blaabaarsyltetoej but not blabarsyltetoj räksmörgås == ræksmørgås == ræksmörgaos == raeksmoergaas but not raksmorgas Factory for . <fieldType name="text_scandnorm" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.ScandinavianNormalizationFilterFactory"/> </analyzer> </fieldType> Creates a new Marks terms as keywords via the . Each token contained in the provided set is marked as a keyword by setting to true. Create a new , that marks the current token as a keyword if the tokens term buffer is contained in the given set via the . to filter the keywords set to lookup the current termbuffer A containing a single token. Provides the ability to override any aware stemmer with custom dictionary-based stemming. Create a new , performing dictionary-based stemming with the provided dictionary (). Any dictionary-stemmed terms will be marked with so that they will not be stemmed with stemmers down the chain. A read-only 4-byte FST backed map that allows fast case-insensitive key value lookups for Creates a new the fst to lookup the overrides if the keys case should be ingored Returns a to pass to the method. Returns the value mapped to the given key or null if the key is not in the FST dictionary. This builder builds an for the Creates a new with set to false Creates a new if the input case should be ignored. Adds an input string and it's stemmer override output to this builder. the input char sequence the stemmer override output char sequence false if the input has already been added to this builder otherwise true. or is null. Adds an input string and it's stemmer override output to this builder. the input char sequence the stemmer override output char sequence false if the input has already been added to this builder otherwise true. or is null. Adds an input string and it's stemmer override output to this builder. the input char sequence the stemmer override output char sequence false if the input has already been added to this builder otherwise true. or is null. Returns a to be used with the a to be used with the if an occurs; Factory for . <fieldType name="text_dicstem" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StemmerOverrideFilterFactory" dictionary="dictionary.txt" ignoreCase="false"/> </analyzer> </fieldType> Creates a new Trims leading and trailing whitespace from Tokens in the stream. As of Lucene 4.4, this filter does not support updateOffsets=true anymore as it can lead to broken token streams. Create a new . the Lucene match version the stream to consume whether to update offsets @deprecated Offset updates are not supported anymore as of Lucene 4.4. Create a new on top of . Factory for . <fieldType name="text_trm" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.NGramTokenizerFactory"/> <filter class="solr.TrimFilterFactory" /> </analyzer> </fieldType> Creates a new A token filter for truncating the terms into a specific length. Fixed prefix truncation, as a stemming method, produces good results on Turkish language. It is reported that F5, using first 5 characters, produced best results in Information Retrieval on Turkish Texts Factory for . The following type is recommended for "diacritics-insensitive search" for Turkish. <fieldType name="text_tr_ascii_f5" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.ApostropheFilterFactory"/> <filter class="solr.TurkishLowerCaseFilterFactory"/> <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/> <filter class="solr.KeywordRepeatFilterFactory"/> <filter class="solr.TruncateTokenFilterFactory" prefixLength="5"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType> Creates a new Adds the as a synonym, i.e. another token at the same position, optionally with a specified prefix prepended. Initializes a new instance of with the specified token stream. Input token stream. Initializes a new instance of with the specified token stream and prefix. Input token stream. Prepend this string to every token type emitted as token text. If null, nothing will be prepended. Factory for . <fieldType name="text_type_as_synonym" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/> <filter class="solr.TypeAsSynonymFilterFactory" prefix="_type_" /> </analyzer> </fieldType> If the optional prefix parameter is used, the specified value will be prepended to the type, e.g.with prefix = "_type_", for a token "example.com" with type "<URL>", the emitted synonym will have text "_type_<URL>". Configuration options for the . LUCENENET specific - these options were passed as int constant flags in Lucene. Causes parts of words to be generated: "PowerShot" => "Power" "Shot" Causes number subwords to be generated: "500-42" => "500" "42" Causes maximum runs of word parts to be catenated: "wi-fi" => "wifi" Causes maximum runs of word parts to be catenated: "wi-fi" => "wifi" Causes all subword parts to be catenated: "wi-fi-4000" => "wifi4000" Causes original words are preserved and added to the subword list (Defaults to false) "500-42" => "500" "42" "500-42" If not set, causes case changes to be ignored (subwords will only be generated given SUBWORD_DELIM tokens) If not set, causes numeric changes to be ignored (subwords will only be generated given SUBWORD_DELIM tokens). Causes trailing "'s" to be removed for each subword "O'Neil's" => "O", "Neil" Splits words into subwords and performs optional transformations on subword groups. Words are split into subwords with the following rules: split on intra-word delimiters (by default, all non alpha-numeric characters): "Wi-Fi""Wi", "Fi" split on case transitions: "PowerShot""Power", "Shot" split on letter-number transitions: "SD500""SD", "500" leading and trailing intra-word delimiters on each subword are ignored: "//hello---there, 'dude'""hello", "there", "dude" trailing "'s" are removed for each subword: "O'Neil's""O", "Neil"
    Note: this step isn't performed in a separate filter because of possible subword combinations.
The combinations parameter affects how subwords are combined: combinations="0" causes no subword combinations: "PowerShot"0:"Power", 1:"Shot" (0 and 1 are the token positions) combinations="1" means that in addition to the subwords, maximum runs of non-numeric subwords are catenated and produced at the same position of the last subword in the run:
    "PowerShot"0:"Power", 1:"Shot" 1:"PowerShot" "A's+B's&C's" -gt; 0:"A", 1:"B", 2:"C", 2:"ABC" "Super-Duper-XL500-42-AutoCoder!"0:"Super", 1:"Duper", 2:"XL", 2:"SuperDuperXL", 3:"500" 4:"42", 5:"Auto", 6:"Coder", 6:"AutoCoder"
One use for is to help match words with different subword delimiters. For example, if the source text contained "wi-fi" one may want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so is to specify combinations="1" in the analyzer used for indexing, and combinations="0" (the default) in the analyzer used for querying. Given that the current immediately removes many intra-word delimiters, it is recommended that this filter be used after a tokenizer that does not do this (such as ).
If not null is the set of tokens to protect from being delimited Creates a new WordDelimiterFilter lucene compatibility version TokenStream to be filtered table containing character types Flags configuring the filter If not null is the set of tokens to protect from being delimited Creates a new WordDelimiterFilter using as its charTypeTable lucene compatibility version to be filtered Flags configuring the filter If not null is the set of tokens to protect from being delimited Saves the existing attribute states Flushes the given by either writing its concat and then clearing, or just clearing. that will be flushed true if the concatenation was written before it was cleared, false otherwise Determines whether to concatenate a word or number if the current word is the given type Type of the current word used to determine if it should be concatenated true if concatenation should occur, false otherwise Determines whether a word/number part should be generated for a word of the given type Type of the word used to determine if a word/number part should be generated true if a word/number part should be generated, false otherwise Concatenates the saved buffer to the given to concatenate the buffer to Generates a word/number part, updating the appropriate attributes true if the generation is occurring from a single word, false otherwise Get the position increment gap for a subword or concatenation true if this token wants to be injected position increment gap Checks if the given word type includes Word type to check true if the type contains , false otherwise Checks if the given word type includes Word type to check true if the type contains , false otherwise Checks if the given word type includes Word type to check true if the type contains , false otherwise Checks if the given word type includes Word type to check true if the type contains , false otherwise Determines whether the given flag is set Flag to see if set true if flag is set A WDF concatenated 'run' Appends the given text of the given length, to the concetenation at the given offset Text to append Offset in the concetenation to add the text Length of the text to append Writes the concatenation to the attributes Determines if the concatenation is empty true if the concatenation is empty, false otherwise Clears the concatenation and resets its state Convenience method for the common scenario of having to write the concetenation and then clearing its state Factory for . <fieldType name="text_wd" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" protected="protectedword.txt" preserveOriginal="0" splitOnNumerics="1" splitOnCaseChange="1" catenateWords="0" catenateNumbers="0" catenateAll="0" generateWordParts="1" generateNumberParts="1" stemEnglishPossessive="1" types="wdfftypes.txt" /> </analyzer> </fieldType> Creates a new A BreakIterator-like API for iterating over subwords in text, according to rules. @lucene.internal Indicates the end of iteration start position of text, excluding leading delimiters end position of text, excluding trailing delimiters Beginning of subword End of subword does this string end with a possessive such as 's If false, causes case changes to be ignored (subwords will only be generated given SUBWORD_DELIM tokens). (Defaults to true) If false, causes numeric changes to be ignored (subwords will only be generated given SUBWORD_DELIM tokens). (Defaults to true) If true, causes trailing "'s" to be removed for each subword. (Defaults to true)

"O'Neil's" => "O", "Neil"

if true, need to skip over a possessive found in the last call to next() Create a new operating with the supplied rules. table containing character types if true, causes "PowerShot" to be two tokens; ("Power-Shot" remains two parts regards) if true, causes "j2se" to be three tokens; "j" "2" "se" if true, causes trailing "'s" to be removed for each subword: "O'Neil's" => "O", "Neil" Advance to the next subword in the string. index of the next subword, or if all subwords have been returned Return the type of the current subword. This currently uses the type of the first character in the subword. type of the current word Reset the text to a new value, and reset all state New text length of the text Determines whether the transition from lastType to type indicates a break Last subword type Current subword type true if the transition indicates a break, false otherwise Determines if the current word contains only one subword. Note, it could be potentially surrounded by delimiters true if the current word contains only one subword, false otherwise Set the internal word bounds (remove leading and trailing delimiters). Note, if a possessive is found, don't remove it yet, simply note it. Determines if the text at the given position indicates an English possessive which should be removed Position in the text to check if it indicates an English possessive true if the text at the position indicates an English posessive, false otherwise Determines the type of the given character Character whose type is to be determined Type of the character Computes the type of the given character Character whose type is to be determined Type of the character Creates new instances of . <fieldType name="text_edgngrm" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="1"/> </analyzer> </fieldType> Creates a new Tokenizes the given token into n-grams of given size(s). This create n-grams from the beginning edge or ending edge of a input token. As of Lucene 4.4, this filter does not support (you can use up-front and afterward to get the same behavior), handles supplementary characters correctly and does not update offsets anymore. Specifies which side of the input the n-gram should be generated from Get the n-gram from the front of the input Get the n-gram from the end of the input Get the appropriate from a string Creates that can generate n-grams in the sizes of the given range the Lucene match version - See holding the input to be tokenized the from which to chop off an n-gram the smallest n-gram to generate the largest n-gram to generate Creates that can generate n-grams in the sizes of the given range the Lucene match version - See holding the input to be tokenized the name of the from which to chop off an n-gram the smallest n-gram to generate the largest n-gram to generate Creates that can generate n-grams in the sizes of the given range the Lucene match version - See holding the input to be tokenized the smallest n-gram to generate the largest n-gram to generate Tokenizes the input from an edge into n-grams of given size(s). This create n-grams from the beginning edge or ending edge of a input token. As of Lucene 4.4, this tokenizer can handle maxGram larger than 1024 chars, but beware that this will result in increased memory usage doesn't trim the input, sets position increments equal to 1 instead of 1 for the first token and 0 for all other ones doesn't support backward n-grams anymore. supports pre-tokenization, correctly handles supplementary characters. Although highly discouraged, it is still possible to use the old behavior through . Creates that can generate n-grams in the sizes of the given range the Lucene match version - See holding the input to be tokenized the smallest n-gram to generate the largest n-gram to generate Creates EdgeNGramTokenizer that can generate n-grams in the sizes of the given range the Lucene match version - See to use holding the input to be tokenized the smallest n-gram to generate the largest n-gram to generate Creates new instances of . <fieldType name="text_edgngrm" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.EdgeNGramTokenizerFactory" minGramSize="1" maxGramSize="1"/> </analyzer> </fieldType> Creates a new Old version of which doesn't handle correctly supplementary characters. Specifies which side of the input the n-gram should be generated from Get the n-gram from the front of the input Get the n-gram from the end of the input Creates that can generate n-grams in the sizes of the given range the Lucene match version - See holding the input to be tokenized the from which to chop off an n-gram the smallest n-gram to generate the largest n-gram to generate Creates that can generate n-grams in the sizes of the given range the Lucene match version - See to use holding the input to be tokenized the from which to chop off an n-gram the smallest n-gram to generate the largest n-gram to generate Creates that can generate n-grams in the sizes of the given range the Lucene match version - See holding the input to be tokenized the name of the from which to chop off an n-gram the smallest n-gram to generate the largest n-gram to generate Creates that can generate n-grams in the sizes of the given range the Lucene match version - See to use holding the input to be tokenized the name of the from which to chop off an n-gram the smallest n-gram to generate the largest n-gram to generate Creates that can generate n-grams in the sizes of the given range the Lucene match version - See holding the input to be tokenized the smallest n-gram to generate the largest n-gram to generate Creates that can generate n-grams in the sizes of the given range the Lucene match version - See to use holding the input to be tokenized the smallest n-gram to generate the largest n-gram to generate Returns the next token in the stream, or null at EOS. Old broken version of . Creates with given min and max n-grams. holding the input to be tokenized the smallest n-gram to generate the largest n-gram to generate Creates with given min and max n-grams. to use holding the input to be tokenized the smallest n-gram to generate the largest n-gram to generate Creates with default min and max n-grams. holding the input to be tokenized Returns the next token in the stream, or null at EOS. Factory for . <fieldType name="text_ngrm" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="2"/> </analyzer> </fieldType> Creates a new Tokenizes the input into n-grams of the given size(s). You must specify the required compatibility when creating a . As of Lucene 4.4, this token filters: handles supplementary characters correctly, emits all n-grams for the same token at the same position, does not modify offsets, sorts n-grams by their offset in the original token first, then increasing length (meaning that "abc" will give "a", "ab", "abc", "b", "bc", "c"). You can make this filter use the old behavior by providing a version < in the constructor but this is not recommended as it will lead to broken s that will cause highlighting bugs. If you were using this to perform partial highlighting, this won't work anymore since this filter doesn't update offsets. You should modify your analysis chain to use , and potentially override to perform pre-tokenization. Creates with given min and max n-grams. Lucene version to enable correct position increments. See for details. holding the input to be tokenized the smallest n-gram to generate the largest n-gram to generate Creates with default min and max n-grams. Lucene version to enable correct position increments. See for details. holding the input to be tokenized Returns the next token in the stream, or null at EOS. Tokenizes the input into n-grams of the given size(s). On the contrary to , this class sets offsets so that characters between startOffset and endOffset in the original stream are the same as the term chars. For example, "abcde" would be tokenized as (minGram=2, maxGram=3): Term Position increment Position length Offsets ab 1 1 [0,2[ abc 1 1 [0,3[ bc 1 1 [1,3[ bcd 1 1 [1,4[ cd 1 1 [2,4[ cde 1 1 [2,5[ de 1 1 [3,5[ This tokenizer changed a lot in Lucene 4.4 in order to: tokenize in a streaming fashion to support streams which are larger than 1024 chars (limit of the previous version), count grams based on unicode code points instead of java chars (and never split in the middle of surrogate pairs), give the ability to pre-tokenize the stream () before computing n-grams. Additionally, this class doesn't trim trailing whitespaces and emits tokens in a different order, tokens are now emitted by increasing start offsets while they used to be emitted by increasing lengths (which prevented from supporting large input streams). Although highly discouraged, it is still possible to use the old behavior through . Creates with given min and max n-grams. the lucene compatibility version holding the input to be tokenized the smallest n-gram to generate the largest n-gram to generate Creates with given min and max n-grams. the lucene compatibility version to use holding the input to be tokenized the smallest n-gram to generate the largest n-gram to generate Creates with default min and max n-grams. the lucene compatibility version holding the input to be tokenized Consume one code point. Only collect characters which satisfy this condition. Factory for . <fieldType name="text_ngrm" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.NGramTokenizerFactory" minGramSize="1" maxGramSize="2"/> </analyzer> </fieldType> Creates a new Creates the of n-grams from the given and . for Dutch language. Supports an external list of stopwords (words that will not be indexed at all), an external list of exclusions (word that will not be stemmed, but indexed) and an external list of word-stem pairs that overrule the algorithm (dictionary stemming). A default set of stopwords is used unless an alternative list is specified, but the exclusion list is empty by default. You must specify the required compatibility when creating : As of 3.6, and also populate the default entries for the stem override dictionary As of 3.1, Snowball stemming is done with SnowballFilter, LowerCaseFilter is used prior to StopFilter, and Snowball stopwords are used by default. As of 2.9, StopFilter preserves position increments NOTE: This class uses the same dependent settings as . File containing default Dutch stopwords. Returns an unmodifiable instance of the default stop-words set. an unmodifiable instance of the default stop-words set. Contains the stopwords used with the . Contains words that should be indexed but not stemmed. Builds an analyzer with the default stop words () and a few default entries for the stem exclusion table. Returns a (possibly reused) which tokenizes all the text in the provided . A built from a filtered with , , , if a stem exclusion set is provided, , and A that stems Dutch words. It supports a table of words that should not be stemmed at all. The stemmer used can be changed at runtime after the filter object is created (as long as it is a ). To prevent terms from being stemmed use an instance of or a custom that sets the before this . @deprecated (3.1) Use with instead, which has the same functionality. This filter will be removed in Lucene 5.0 The actual token in the input stream. Input Input Dictionary of word stem pairs, that overrule the algorithm Returns the next token in the stream, or null at EOS Set a alternative/custom for this filter. Set dictionary for stemming, this dictionary overrules the algorithm, so you can correct for a particular unwanted word-stem pair. A stemmer for Dutch words. The algorithm is an implementation of the dutch stemming algorithm in Martin Porter's snowball project. @deprecated (3.1) Use instead, which has the same functionality. This filter will be removed in Lucene 5.0 Buffer for the terms while stemming them. Stems the given term to an unique discriminator. The term that should be stemmed. Discriminator for Delete suffix e if in R1 and preceded by a non-vowel, and then undouble the ending String being stemmed Delete "heid" String being stemmed A d-suffix, or derivational suffix, enables a new word, often with a different grammatical category, or with a different sense, to be built from another word. Whether a d-suffix can be attached is discovered not from the rules of grammar, but by referring to a dictionary. So in English, ness can be added to certain adjectives to form corresponding nouns (littleness, kindness, foolishness ...) but not to all adjectives (not for example, to big, cruel, wise ...) d-suffixes can be used to change meaning, often in rather exotic ways. Remove "ing", "end", "ig", "lijk", "baar" and "bar" String being stemmed undouble vowel If the words ends CVD, where C is a non-vowel, D is a non-vowel other than I, and V is double a, e, o or u, remove one of the vowels from V (for example, maan -> man, brood -> brod). String being stemmed Checks if a term could be stemmed. true if, and only if, the given term consists in letters. Substitute ä, ë, ï, ö, ü, á , é, í, ó, ú for Norwegian. File containing default Norwegian stopwords. Returns an unmodifiable instance of the default stop words set. default stop words set. Atomically loads the in a lazy fashion once the outer class accesses the static final set the first time.; Builds an analyzer with the default stop words: . Builds an analyzer with the given stop words. lucene compatibility version a stopword set Builds an analyzer with the given stop words. If a non-empty stem exclusion set is provided this analyzer will add a before stemming. lucene compatibility version a stopword set a set of terms not to be stemmed Creates a which tokenizes all the text in the provided . A built from an filtered with , , , if a stem exclusion set is provided and . A that applies to stem Norwegian words. To prevent terms from being stemmed use an instance of or a custom that sets the before this . Calls - NorwegianLightStemFilter(input, BOKMAAL) the source to filter Creates a new the source to filter set to , , or both. Factory for . <fieldType name="text_svlgtstem" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.NorwegianLightStemFilterFactory" variant="nb"/> </analyzer> </fieldType> Creates a new Constant to remove Bokmål-specific endings Constant to remove Nynorsk-specific endings Light Stemmer for Norwegian. Parts of this stemmer is adapted from , except that while the Swedish one has a pre-defined rule set and a corresponding corpus to validate against whereas the Norwegian one is hand crafted. Creates a new set to , , or both. A that applies to stem Norwegian words. To prevent terms from being stemmed use an instance of or a custom that sets the before this . Calls - NorwegianMinimalStemFilter(input, BOKMAAL) Creates a new the source to filter set to , , or both. Factory for . <fieldType name="text_svlgtstem" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.NorwegianMinimalStemFilterFactory" variant="nb"/> </analyzer> </fieldType> Creates a new Minimal Stemmer for Norwegian Bokmål (no-nb) and Nynorsk (no-nn) Stems known plural forms for Norwegian nouns only, together with genitiv -s Creates a new set to , , or both. Tokenizer for path-like hierarchies. Take something like: /something/something/else and make: /something /something/something /something/something/else Factory for . This factory is typically configured for use only in the index Analyzer (or only in the query Analyzer, but never both). For example, in the configuration below a query for Books/NonFic will match documents indexed with values like Books/NonFic, Books/NonFic/Law, Books/NonFic/Science/Physics, etc. But it will not match documents indexed with values like Books, or Books/Fic... <fieldType name="descendent_path" class="solr.TextField"> <analyzer type="index"> <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/" /> </analyzer> <analyzer type="query"> <tokenizer class="solr.KeywordTokenizerFactory" /> </analyzer> </fieldType> In this example however we see the oposite configuration, so that a query for Books/NonFic/Science/Physics would match documents containing Books/NonFic, Books/NonFic/Science, or Books/NonFic/Science/Physics, but not Books/NonFic/Science/Physics/Theory or Books/NonFic/Law. <fieldType name="descendent_path" class="solr.TextField"> <analyzer type="index"> <tokenizer class="solr.KeywordTokenizerFactory" /> </analyzer> <analyzer type="query"> <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/" /> </analyzer> </fieldType> Creates a new Tokenizer for domain-like hierarchies. Take something like: www.site.co.uk and make: www.site.co.uk site.co.uk co.uk uk Factory for . <fieldType name="text_ptncapturegroup" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.PatternCaptureGroupFilterFactory" pattern="([^a-z])" preserve_original="true"/> </analyzer> </fieldType> CaptureGroup uses .NET regexes to emit multiple tokens - one for each capture group in one or more patterns. For example, a pattern like: "(https?://([a-zA-Z\-_0-9.]+))" when matched against the string "http://www.foo.com/index" would return the tokens "https://www.foo.com" and "www.foo.com". If none of the patterns match, or if preserveOriginal is true, the original token will be preserved. Each pattern is matched as often as it can be, so the pattern "(...)", when matched against "abcdefghi" would produce ["abc","def","ghi"] A camelCaseFilter could be written as: "([A-Z]{2,})", "(?<![A-Z])([A-Z][a-z]+)", "(?:^|\\b|(?<=[0-9_])|(?<=[A-Z]{2}))([a-z]+)", "([0-9]+)" plus if is true, it would also return camelCaseFilter Creates a new the input set to true to return the original token even if one of the patterns matches an array of objects to match against each token that uses a regular expression for the target of replace string. The pattern match will be done in each "block" in char stream. ex1) source="aa bb aa bb", pattern="(aa)\\s+(bb)" replacement="$1#$2" output="aa#bb aa#bb" NOTE: If you produce a phrase that has different length to source string and the field is used for highlighting for a term of the phrase, you will face a trouble. ex2) source="aa123bb", pattern="(aa)\\d+(bb)" replacement="$1 $2" output="aa bb" and you want to search bb and highlight it, you will get highlight snippet="aa1<em>23bb</em>" @since Solr 1.5 Replace pattern in input and mark correction offsets. Factory for . <fieldType name="text_ptnreplace" class="solr.TextField" positionIncrementGap="100"> <analyzer> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([^a-z])" replacement=""/> <tokenizer class="solr.KeywordTokenizerFactory"/> </analyzer> </fieldType> @since Solr 3.1 Creates a new A TokenFilter which applies a to each token in the stream, replacing match occurances with the specified replacement string. Note: Depending on the input and the pattern used and the input , this may produce s whose text is the empty string. Constructs an instance to replace either the first, or all occurances the to process the pattern (a object) to apply to each the "replacement string" to substitute, if null a blank string will be used. Note that this is not the literal string that will be used, '$' and '\' have special meaning. if true, all matches will be replaced otherwise just the first match. Factory for . <fieldType name="text_ptnreplace" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" replacement="" replace="all"/> </analyzer> </fieldType> Creates a new This tokenizer uses regex pattern matching to construct distinct tokens for the input stream. It takes two arguments: "pattern" and "group". "pattern" is the regular expression. "group" says which group to extract into tokens. group=-1 (the default) is equivalent to "split". In this case, the tokens will be equivalent to the output from (without empty tokens): Using group >= 0 selects the matching group as the token. For example, if you have:
pattern = \'([^\']+)\' group = 0 input = aaa 'bbb' 'ccc' the output will be two tokens: 'bbb' and 'ccc' (including the ' marks). With the same input but using group=1, the output would be: bbb and ccc (no ' marks)
NOTE: This does not output tokens that are of zero length.
creates a new returning tokens from group (-1 for split functionality) creates a new returning tokens from group (-1 for split functionality) Factory for . This tokenizer uses regex pattern matching to construct distinct tokens for the input stream. It takes two arguments: "pattern" and "group". "pattern" is the regular expression. "group" says which group to extract into tokens. group=-1 (the default) is equivalent to "split". In this case, the tokens will be equivalent to the output from (without empty tokens): Using group >= 0 selects the matching group as the token. For example, if you have:
pattern = \'([^\']+)\' group = 0 input = aaa 'bbb' 'ccc' the output will be two tokens: 'bbb' and 'ccc' (including the ' marks). With the same input but using group=1, the output would be: bbb and ccc (no ' marks)
NOTE: This Tokenizer does not output tokens that are of zero length. <fieldType name="text_ptn" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.PatternTokenizerFactory" pattern="\'([^\']+)\'" group="1"/> </analyzer> </fieldType> @since solr1.2
Creates a new Split the input using configured pattern Base class for payload encoders. Characters before the delimiter are the "token", those after are the payload. For example, if the delimiter is '|', then for the string "foo|bar", foo is the token and "bar" is a payload. Note, you can also include a to convert the payload in an appropriate way (from characters to bytes). Note make sure your doesn't split on the delimiter, or this won't work Factory for . <fieldType name="text_dlmtd" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.DelimitedPayloadTokenFilterFactory" encoder="float" delimiter="|"/> </analyzer> </fieldType> Creates a new Encode a character array as a . NOTE: This was FloatEncoder in Lucene Does nothing other than convert the char array to a byte array using the specified encoding. Encode a character array as a . See . Assigns a payload to a token based on the Factory for . <fieldType name="text_numpayload" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.NumericPayloadTokenFilterFactory" payload="24" typeMatch="word"/> </analyzer> </fieldType> Creates a new Mainly for use with the , converts char buffers to . NOTE: This interface is subject to change Convert a char array to a encoded Utility methods for encoding payloads. NOTE: This was encodeFloat() in Lucene NOTE: This was encodeFloat() in Lucene NOTE: This was encodeInt() in Lucene NOTE: This was encodeInt() in Lucene NOTE: This was decodeFloat() in Lucene the decoded float Decode the payload that was encoded using . NOTE: the length of the array must be at least offset + 4 long. NOTE: This was decodeFloat() in Lucene The bytes to decode The offset into the array. The float that was encoded NOTE: This was decodeInt() in Lucene Adds the and First 4 bytes are the start Factory for . <fieldType name="text_tokenoffset" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.TokenOffsetPayloadTokenFilterFactory"/> </analyzer> </fieldType> Creates a new Makes the a payload. Encodes the type using System.Text.Encoding.UTF8.GetBytes(string) Factory for . <fieldType name="text_typeaspayload" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.TypeAsPayloadTokenFilterFactory"/> </analyzer> </fieldType> Creates a new Set the positionIncrement of all tokens to the "positionIncrement", except the first return token which retains its original positionIncrement value. The default positionIncrement value is zero. @deprecated (4.4) makes graphs inconsistent which can cause highlighting bugs. Its main use-case being to make QueryParser generate boolean queries instead of phrase queries, it is now advised to use QueryParser.AutoGeneratePhraseQueries = true (for simple cases) or to override QueryParser.NewFieldQuery. Position increment to assign to all but the first token - default = 0 The first token must have non-zero positionIncrement * Constructs a that assigns a position increment of zero to all but the first token from the given input stream. the input stream Constructs a that assigns the given position increment to all but the first token from the given input stream. the input stream position increment to assign to all but the first token from the input stream Factory for . Set the positionIncrement of all tokens to the "positionIncrement", except the first return token which retains its original positionIncrement value. The default positionIncrement value is zero. <fieldType name="text_position" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.PositionFilterFactory" positionIncrement="0"/> </analyzer> </fieldType> Creates a new for Portuguese. You must specify the required compatibility when creating : As of 3.6, PortugueseLightStemFilter is used for less aggressive stemming. File containing default Portuguese stopwords. Returns an unmodifiable instance of the default stop words set. default stop words set. Atomically loads the in a lazy fashion once the outer class accesses the static final set the first time.; Builds an analyzer with the default stop words: . Builds an analyzer with the given stop words. lucene compatibility version a stopword set Builds an analyzer with the given stop words. If a non-empty stem exclusion set is provided this analyzer will add a before stemming. lucene compatibility version a stopword set a set of terms not to be stemmed Creates a which tokenizes all the text in the provided . A built from an filtered with , , , if a stem exclusion set is provided and . A that applies to stem Portuguese words. To prevent terms from being stemmed use an instance of or a custom that sets the before this . Factory for . <fieldType name="text_ptlgtstem" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.PortugueseLightStemFilterFactory"/> </analyzer> </fieldType> Creates a new Light Stemmer for Portuguese This stemmer implements the "UniNE" algorithm in: Light Stemming Approaches for the French, Portuguese, German and Hungarian Languages Jacques Savoy A that applies to stem Portuguese words. To prevent terms from being stemmed use an instance of or a custom that sets the before this . Factory for . <fieldType name="text_ptminstem" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.PortugueseMinimalStemFilterFactory"/> </analyzer> </fieldType> Creates a new Minimal Stemmer for Portuguese This follows the "RSLP-S" algorithm presented in: A study on the Use of Stemming for Monolingual Ad-Hoc Portuguese Information Retrieval (Orengo, et al) which is just the plural reduction step of the RSLP algorithm from A Stemming Algorithm for the Portuguese Language, Orengo et al. A that applies to stem Portuguese words. To prevent terms from being stemmed use an instance of or a custom that sets the before this . Factory for . <fieldType name="text_ptstem" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.PortugueseStemFilterFactory"/> </analyzer> </fieldType> Creates a new Portuguese stemmer implementing the RSLP (Removedor de Sufixos da Lingua Portuguesa) algorithm. This is sometimes also referred to as the Orengo stemmer. buffer, oversized to at least len+1 initial valid length of buffer new valid length, stemmed Base class for stemmers that use a set of RSLP-like stemming steps. RSLP (Removedor de Sufixos da Lingua Portuguesa) is an algorithm designed originally for stemming the Portuguese language, described in the paper A Stemming Algorithm for the Portuguese Language, Orengo et. al. Since this time a plural-only modification (RSLP-S) as well as a modification for the Galician language have been implemented. This class parses a configuration file that describes s, where each contains a set of s. The general rule format is: { "suffix", N, "replacement", { "exception1", "exception2", ...}} where: suffix is the suffix to be removed (such as "inho"). N is the min stem size, where stem is defined as the candidate stem after removing the suffix (but before appending the replacement!) replacement is an optimal string to append after removing the suffix. This can be the empty string. exceptions is an optional list of exceptions, patterns that should not be stemmed. These patterns can be specified as whole word or suffix (ends-with) patterns, depending upon the exceptions format flag in the step header. A step is an ordered list of rules, with a structure in this format:
{ "name", N, B, { "cond1", "cond2", ... } ... rules ... };
where: name is a name for the step (such as "Plural"). N is the min word size. Words that are less than this length bypass the step completely, as an optimization. Note: N can be zero, in this case this implementation will automatically calculate the appropriate value from the underlying rules. B is a "boolean" flag specifying how exceptions in the rules are matched. A value of 1 indicates whole-word pattern matching, a value of 0 indicates that exceptions are actually suffixes and should be matched with ends-with. conds are an optional list of conditions to enter the step at all. If the list is non-empty, then a word must end with one of these conditions or it will bypass the step completely as an optimization.
RSLP description @lucene.internal
A basic rule, with no exceptions. Create a rule. suffix to remove minimum stem length replacement string true if the word matches this rule. new valid length of the string after firing this rule. A rule with a set of whole-word exceptions. A rule with a set of exceptional suffixes. A step containing a list of rules. Create a new step Step's name. an ordered list of rules. minimum word size. if this is 0 it is automatically calculated. optional list of conditional suffixes. may be null. new valid length of the string after applying the entire step. Parse a resource file into an RSLP stemmer description. a Map containing the named s in this description. An used primarily at query time to wrap another analyzer and provide a layer of protection which prevents very common words from being passed into queries. For very large indexes the cost of reading TermDocs for a very common word can be high. This analyzer was created after experience with a 38 million doc index which had a term in around 50% of docs and was causing TermQueries for this term to take 2 seconds. Creates a new with stopwords calculated for all indexed fields from terms with a document frequency percentage greater than Version to be used in whose will be filtered to identify the stopwords from Can be thrown while reading from the Creates a new with stopwords calculated for all indexed fields from terms with a document frequency greater than the given Version to be used in whose will be filtered to identify the stopwords from Document frequency terms should be above in order to be stopwords Can be thrown while reading from the Creates a new with stopwords calculated for all indexed fields from terms with a document frequency percentage greater than the given Version to be used in whose will be filtered to identify the stopwords from The maximum percentage (between 0.0 and 1.0) of index documents which contain a term, after which the word is considered to be a stop word Can be thrown while reading from the Creates a new with stopwords calculated for the given selection of fields from terms with a document frequency percentage greater than the given Version to be used in whose will be filtered to identify the stopwords from Selection of fields to calculate stopwords for The maximum percentage (between 0.0 and 1.0) of index documents which contain a term, after which the word is considered to be a stop word Can be thrown while reading from the Creates a new with stopwords calculated for the given selection of fields from terms with a document frequency greater than the given Version to be used in Analyzer whose TokenStream will be filtered to identify the stopwords from Selection of fields to calculate stopwords for Document frequency terms should be above in order to be stopwords Can be thrown while reading from the Provides information on which stop words have been identified for a field The field for which stop words identified in "addStopWords" method calls will be returned the stop words identified for a field Provides information on which stop words have been identified for all fields the stop words (as terms) Reverse token string, for example "country" => "yrtnuoc". If is supplied, then tokens will be also prepended by that character. For example, with a marker of \u0001, "country" => "\u0001yrtnuoc". This is useful when implementing efficient leading wildcards search. You must specify the required compatibility when creating , or when using any of its static methods: As of 3.1, supplementary characters are handled correctly Example marker character: U+0001 (START OF HEADING) Example marker character: U+001F (INFORMATION SEPARATOR ONE) Example marker character: U+EC00 (PRIVATE USE AREA: EC00) Example marker character: U+200F (RIGHT-TO-LEFT MARK) Create a new that reverses all tokens in the supplied . The reversed tokens will not be marked. lucene compatibility version to filter Create a new that reverses and marks all tokens in the supplied . The reversed tokens will be prepended (marked) by the character. lucene compatibility version to filter A character used to mark reversed tokens Reverses the given input string lucene compatibility version the string to reverse the given input string in reversed order Reverses the given input buffer in-place lucene compatibility version the input char array to reverse Partially reverses the given input buffer in-place from offset 0 up to the given length. lucene compatibility version the input char array to reverse the length in the buffer up to where the buffer should be reversed @deprecated (3.1) Remove this when support for 3.0 indexes is no longer needed. Partially reverses the given input buffer in-place from the given offset up to the given length. lucene compatibility version the input char array to reverse the offset from where to reverse the buffer the length in the buffer up to where the buffer should be reversed Factory for . <fieldType name="text_rvsstr" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.ReverseStringFilterFactory"/> </analyzer> </fieldType> @since solr 1.4 Creates a new for Romanian. File containing default Romanian stopwords. The comment character in the stopwords file. All lines prefixed with this will be ignored. Returns an unmodifiable instance of the default stop words set. default stop words set. Atomically loads the in a lazy fashion once the outer class accesses the static final set the first time.; Builds an analyzer with the default stop words: . lucene compatibility version Builds an analyzer with the given stop words. lucene compatibility version a stopword set Builds an analyzer with the given stop words. If a non-empty stem exclusion set is provided this analyzer will add a before stemming. lucene compatibility version a stopword set a set of terms not to be stemmed Creates a which tokenizes all the text in the provided . A built from an filtered with , , , if a stem exclusion set is provided and . for Russian language. Supports an external list of stopwords (words that will not be indexed at all). A default set of stopwords is used unless an alternative list is specified. You must specify the required compatibility when creating : As of 3.1, is used, Snowball stemming is done with , and Snowball stopwords are used by default. List of typical Russian stopwords. (for backwards compatibility) @deprecated (3.1) Remove this for LUCENE 5.0 File containing default Russian stopwords. @deprecated (3.1) remove this for Lucene 5.0 Returns an unmodifiable instance of the default stop-words set. an unmodifiable instance of the default stop-words set. Builds an analyzer with the given stop words lucene compatibility version a stopword set Builds an analyzer with the given stop words lucene compatibility version a stopword set a set of words not to be stemmed Creates used to tokenize all the text in the provided . built from a filtered with , , , if a stem exclusion set is provided, and A is a that extends by also allowing the basic Latin digits 0-9. You must specify the required compatibility when creating :
  • As of 3.1, uses an int based API to normalize and detect token characters. See and for details.
@deprecated (3.1) Use instead, which has the same functionality. This filter will be removed in Lucene 5.0
Construct a new . lucene compatibility version the input to split up into tokens Construct a new RussianLetterTokenizer using a given . lucene compatibility version the attribute factory to use for this the input to split up into tokens Collects only characters which satisfy . @deprecated Use instead. This tokenizer has no Russian-specific functionality. Creates a new A that applies to stem Russian words. To prevent terms from being stemmed use an instance of or a custom that sets the before this . Factory for . <fieldType name="text_rulgtstem" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.RussianLightStemFilterFactory"/> </analyzer> </fieldType> Creates a new Light Stemmer for Russian. This stemmer implements the following algorithm: Indexing and Searching Strategies for the Russian Language. Ljiljana Dolamic and Jacques Savoy. A ShingleAnalyzerWrapper wraps a around another . A shingle is another name for a token based n-gram. Creates a new whose is to be filtered Min shingle (token ngram) size Max shingle size Used to separate input stream tokens in output shingles Whether or not the filter shall pass the original tokens to the output stream Overrides the behavior of outputUnigrams==false for those times when no shingles are available (because there are fewer than minShingleSize tokens in the input stream)? Note that if outputUnigrams==true, then unigrams are always output, regardless of whether any shingles are available. filler token to use when positionIncrement is more than 1 Wraps . Wraps . The max shingle (token ngram) size The max shingle (token ngram) size The min shingle (token ngram) size The min shingle (token ngram) size A constructs shingles (token n-grams) from a token stream. In other words, it creates combinations of tokens as a single token. For example, the sentence "please divide this sentence into shingles" might be tokenized into shingles "please divide", "divide this", "this sentence", "sentence into", and "into shingles". This filter handles position increments > 1 by inserting filler tokens (tokens with termtext "_"). It does not handle a position increment of 0. filler token for when positionIncrement is more than 1 default maximum shingle size is 2. default minimum shingle size is 2. default token type attribute value is "shingle" The default string to use when joining adjacent tokens to form a shingle The sequence of input stream tokens (or filler tokens, if necessary) that will be composed to form output shingles. The number of input tokens in the next output token. This is the "n" in "token n-grams". Shingle and unigram text is composed here. The token type attribute value to use - default is "shingle" The string to use when joining adjacent tokens to form a shingle The string to insert for each position at which there is no token (i.e., when position increment is greater than one). By default, we output unigrams (individual tokens) as well as shingles (token n-grams). By default, we don't override behavior of outputUnigrams. maximum shingle size (number of tokens) minimum shingle size (number of tokens) The remaining number of filler tokens to be inserted into the input stream from which shingles are composed, to handle position increments greater than one. When the next input stream token has a position increment greater than one, it is stored in this field until sufficient filler tokens have been inserted to account for the position increment. Whether or not there is a next input stream token. Whether at least one unigram or shingle has been output at the current position. true if no shingles have been output yet (for outputUnigramsIfNoShingles). Holds the State after input.end() was called, so we can restore it in our end() impl. Constructs a with the specified shingle size from the input stream minimum shingle size produced by the filter. maximum shingle size produced by the filter. Constructs a with the specified shingle size from the input stream maximum shingle size produced by the filter. Construct a with default shingle size: 2. input stream Construct a with the specified token type for shingle tokens and the default shingle size: 2 input stream token type for shingle tokens Set the type of the shingle tokens produced by this filter. (default: "shingle") token tokenType Shall the output stream contain the input tokens (unigrams) as well as shingles? (default: true.) Whether or not the output stream shall contain the input tokens (unigrams) Shall we override the behavior of outputUnigrams==false for those times when no shingles are available (because there are fewer than minShingleSize tokens in the input stream)? (default: false.) Note that if outputUnigrams==true, then unigrams are always output, regardless of whether any shingles are available. Whether or not to output a single unigram when no shingles are available. Set the max shingle size (default: 2) max size of output shingles Set the min shingle size (default: 2). This method requires that the passed in minShingleSize is not greater than maxShingleSize, so make sure that maxShingleSize is set before calling this method. The unigram output option is independent of the min shingle size. min size of output shingles Sets the string to use when joining adjacent tokens to form a shingle used to separate input stream tokens in output shingles Sets the string to insert for each position at which there is no token (i.e., when position increment is greater than one). string to insert at each position where there is no token Get the next token from the input stream. If the next token has positionIncrement > 1, positionIncrement - 1 s are inserted first. Where to put the new token; if null, a new instance is created. On success, the populated token; null otherwise if the input stream has a problem Fills with input stream tokens, if available, shifting to the right if the window was previously full. Resets to its minimum value. if there's a problem getting the next token An instance of this class is used to maintain the number of input stream tokens that will be used to compose the next unigram or shingle: . gramSize will take on values from the circular sequence { [ 1, ] [ , ... , ] }. 1 is included in the circular sequence only if = true. the current value. Increments this circular number's value to the next member in the circular sequence gramSize will take on values from the circular sequence { [ 1, ] [ , ... , ] }. 1 is included in the circular sequence only if = true. Sets this circular number's value to the first member of the circular sequence gramSize will take on values from the circular sequence { [ 1, ] [ , ... , ] }. 1 is included in the circular sequence only if = true. Returns true if the current value is the first member of the circular sequence. If = true, the first member of the circular sequence will be 1; otherwise, it will be . true if the current value is the first member of the circular sequence; false otherwise the value this instance had before the last call Factory for . <fieldType name="text_shingle" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize="2" outputUnigrams="true" outputUnigramsIfNoShingles="false" tokenSeparator=" " fillerToken="_"/> </analyzer> </fieldType> Creates a new Attempts to parse the as a Date using either the or methods. If a format is passed, will be used, and the format must strictly match one of the specified formats as specified in the MSDN documentation. If the value is a Date, it will add it to the sink. Creates a new instance of using the current culture and . Loosely matches standard DateTime formats using . Creates a new instance of using the supplied culture and . Loosely matches standard DateTime formats using . An object that supplies culture-specific format information Creates a new instance of using the current culture and . Strictly matches the supplied DateTime formats using . The allowable format of the . If supplied, it must match the format of the date exactly to get a match. Creates a new instance of using the current culture and . Strictly matches the supplied DateTime formats using . An array of allowable formats of the . If supplied, one of them must match the format of the date exactly to get a match. Creates a new instance of using the supplied culture and . Loosely matches standard DateTime formats using . An object that supplies culture-specific format information A bitwise combination of enumeration values that indicates the permitted format of s. A typical value to specify is Creates a new instance of using the supplied format, culture and . Strictly matches the supplied DateTime formats using . The allowable format of the . If supplied, it must match the format of the date exactly to get a match. An object that supplies culture-specific format information Creates a new instance of using the supplied formats, culture and . Strictly matches the supplied DateTime formats using . An array of allowable formats of the . If supplied, one of them must match the format of the date exactly to get a match. An object that supplies culture-specific format information Creates a new instance of using the supplied format, culture and . Strictly matches the supplied DateTime formats using . The allowable format of the . If supplied, it must match the format of the date exactly to get a match. An object that supplies culture-specific format information A bitwise combination of enumeration values that indicates the permitted format of s. A typical value to specify is Creates a new instance of using the supplied formats, culture and . Strictly matches the supplied DateTime formats using . An array of allowable formats of the . If supplied, one of them must match the format of the date exactly to get a match. An object that supplies culture-specific format information A bitwise combination of enumeration values that indicates the permitted format of s. A typical value to specify is This TokenFilter provides the ability to set aside attribute states that have already been analyzed. This is useful in situations where multiple fields share many common analysis steps and then go their separate ways. It is also useful for doing things like entity extraction or proper noun analysis as part of the analysis workflow and saving off those tokens for use in another field. TeeSinkTokenFilter source1 = new TeeSinkTokenFilter(new WhitespaceTokenizer(version, reader1)); TeeSinkTokenFilter.SinkTokenStream sink1 = source1.NewSinkTokenStream(); TeeSinkTokenFilter.SinkTokenStream sink2 = source1.NewSinkTokenStream(); TeeSinkTokenFilter source2 = new TeeSinkTokenFilter(new WhitespaceTokenizer(version, reader2)); source2.AddSinkTokenStream(sink1); source2.AddSinkTokenStream(sink2); TokenStream final1 = new LowerCaseFilter(version, source1); TokenStream final2 = source2; TokenStream final3 = new EntityDetect(sink1); TokenStream final4 = new URLDetect(sink2); d.Add(new TextField("f1", final1, Field.Store.NO)); d.Add(new TextField("f2", final2, Field.Store.NO)); d.Add(new TextField("f3", final3, Field.Store.NO)); d.Add(new TextField("f4", final4, Field.Store.NO)); In this example, sink1 and sink2 will both get tokens from both reader1 and reader2 after whitespace tokenizer and now we can further wrap any of these in extra analysis, and more "sources" can be inserted if desired. It is important, that tees are consumed before sinks (in the above example, the field names must be less the sink's field names). If you are not sure, which stream is consumed first, you can simply add another sink and then pass all tokens to the sinks at once using . This is exhausted after this. In the above example, change the example above to: ... TokenStream final1 = new LowerCaseFilter(version, source1.NewSinkTokenStream()); TokenStream final2 = source2.NewSinkTokenStream(); sink1.ConsumeAllTokens(); sink2.ConsumeAllTokens(); ... In this case, the fields can be added in any order, because the sources are not used anymore and all sinks are ready. Note, the EntityDetect and URLDetect TokenStreams are for the example and do not currently exist in Lucene. Instantiates a new . Returns a new that receives all tokens consumed by this stream. Returns a new that receives all tokens consumed by this stream that pass the supplied filter. Adds a created by another to this one. The supplied stream will also receive all consumed tokens. This method can be used to pass tokens from two different tees to one sink. passes all tokens to the added sinks when itself is consumed. To be sure, that all tokens from the input stream are passed to the sinks, you can call this methods. This instance is exhausted after this, but all sinks are instant available. A filter that decides which states to store in the sink. Returns true, iff the current state of the passed-in shall be stored in the sink. Called by . This method does nothing by default and can optionally be overridden. output from a tee with optional filtering. Counts the tokens as they go by and saves to the internal list those between the range of lower and upper, exclusive of upper Adds a token to the sink if it has a specific type. Filters with , , and . Available stemmers are listed in org.tartarus.snowball.ext. The name of a stemmer is the part of the class name before "Stemmer", e.g., the stemmer in is named "English". NOTE: This class uses the same dependent settings as , with the following addition: As of 3.1, uses for Turkish language. @deprecated (3.1) Use the language-specific analyzer in modules/analysis instead. This analyzer will be removed in Lucene 5.0 Builds the named analyzer with no stop words. Builds the named analyzer with the given stop words. Constructs a filtered by a , a , a , and a A filter that stems words using a Snowball-generated stemmer. Available stemmers are listed in Lucene.Net.Tartarus.Snowball.Ext. NOTE: expects lowercased text. For the Turkish language, see . For other languages, see . Note: This filter is aware of the . To prevent certain terms from being passed to the stemmer should be set to true in a previous . Note: For including the original term as well as the stemmed version, see Construct the named stemming filter. Available stemmers are listed in Lucene.Net.Tartarus.Snowball.Ext. The name of a stemmer is the part of the class name before "Stemmer", e.g., the stemmer in is named "English". the input tokens to stem the name of a stemmer Returns the next input , after being stemmed Factory for , with configurable language Note: Use of the "Lovins" stemmer is not recommended, as it is implemented with reflection. <fieldType name="text_snowballstem" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory" protected="protectedkeyword.txt" language="English"/> </analyzer> </fieldType> Creates a new Filters with , and , using a list of English stop words. You must specify the required compatibility when creating : As of 3.1, correctly handles Unicode 4.0 supplementary characters in stopwords As of 2.9, preserves position increments As of 2.4, s incorrectly identified as acronyms are corrected (see LUCENE-1068) was named in Lucene versions prior to 3.1. As of 3.1, implements Unicode text segmentation, as specified by UAX#29. Default maximum allowed token length An unmodifiable set containing some common English words that are usually not useful for searching. Builds an analyzer with the given stop words. Lucene compatibility version - See stop words Builds an analyzer with the default stop words (). Lucene compatibility version - See Builds an analyzer with the stop words from the given reader. Lucene compatibility version - See to read stop words from Gets or sets maximum allowed token length. If a token is seen that exceeds this length then it is discarded. This setting only takes effect the next time tokenStream or tokenStream is called. Normalizes tokens extracted with . Construct filtering . Returns the next token in the stream, or null at EOS. Removes 's from the end of words. Removes dots from acronyms. Factory for . <fieldType name="text_clssc" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.ClassicTokenizerFactory"/> <filter class="solr.ClassicFilterFactory"/> </analyzer> </fieldType> Creates a new A grammar-based tokenizer constructed with JFlex (and then ported to .NET) This should be a good tokenizer for most European-language documents: Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token. Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split. Recognizes email addresses and internet hostnames as one token. Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer. was named in Lucene versions prior to 3.1. As of 3.1, implements Unicode text segmentation, as specified by UAX#29. A private instance of the JFlex-constructed scanner String token types that correspond to token type int constants Set the max allowed token length. Any token longer than this is skipped. Creates a new instance of the . Attaches the to the newly created JFlex scanner. lucene compatibility version The input reader See http://issues.apache.org/jira/browse/LUCENE-1068 Creates a new with a given Factory for . <fieldType name="text_clssc" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.ClassicTokenizerFactory" maxTokenLength="120"/> </analyzer> </fieldType> Creates a new This class implements the classic lucene up until 3.0 This character denotes the end of file initial size of the lookahead buffer lexical states ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l at the beginning of a line l is of the form l = 2*k, k a non negative integer Translates characters to character classes Translates characters to character classes Translates DFA states to action switch labels. Translates a state to a row index in the transition table The transition table of the DFA ZZ_ATTRIBUTE[aState] contains the attributes of state aState the input device the current state of the DFA the current lexical state this buffer contains the current text to be matched and is the source of the YyText string the textposition at the last accepting state the current text position in the buffer startRead marks the beginning of the YyText string in the buffer endRead marks the last character in the buffer, that has been read from input the number of characters up to the start of the matched text zzAtEOF == true <=> the scanner is at the EOF Fills ICharTermAttribute with the current token text. Creates a new scanner the TextReader to read input from. Unpacks the compressed character translation table. the packed character translation table the unpacked character translation table Refills the input buffer. false, iff there was new input. if any I/O-Error occurs Disposes the input stream. Resets the scanner to read from a new input stream. Does not close the old reader. All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to . Internal scan buffer is resized down to its initial length, if it has grown. the new input stream Returns the current lexical state. Enters a new lexical state the new lexical state Returns the text matched by the current regular expression. Returns the character at position pos from the matched text. It is equivalent to YyText[pos], but faster the position of the character to fetch. A value from 0 to YyLength-1. the character at position pos Returns the length of the matched text region. Reports an error that occured while scanning. In a wellformed scanner (no or only correct usage of YyPushBack(int) and a match-all fallback rule) this method will only be called with things that "Can't Possibly Happen". If this method is called, something is seriously wrong (e.g. a JFlex bug producing a faulty scanner etc.). Usual syntax/scanner level error handling should be done in error fallback rules. the code of the errormessage to display Pushes the specified amount of characters back into the input stream. They will be read again by then next call of the scanning method the number of characters to be read again. This number must not be greater than YyLength! Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs. the next token if any I/O-Error occurs Filters with , and , using a list of English stop words. You must specify the required compatibility when creating : As of 3.4, Hiragana and Han characters are no longer wrongly split from their combining characters. If you use a previous version number, you get the exact broken behavior for backwards compatibility. As of 3.1, implements Unicode text segmentation, and correctly handles Unicode 4.0 supplementary characters in stopwords. and are the pre-3.1 implementations of and . As of 2.9, preserves position increments As of 2.4, s incorrectly identified as acronyms are corrected (see LUCENE-1068) Default maximum allowed token length An unmodifiable set containing some common English words that are usually not useful for searching. Builds an analyzer with the given stop words. Lucene compatibility version - See stop words Builds an analyzer with the default stop words (). Lucene compatibility version - See Builds an analyzer with the stop words from the given reader. Lucene compatibility version - See to read stop words from Set maximum allowed token length. If a token is seen that exceeds this length then it is discarded. This setting only takes effect the next time tokenStream or tokenStream is called. Normalizes tokens extracted with . Factory for . <fieldType name="text_stndrd" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> </analyzer> </fieldType> Creates a new A grammar-based tokenizer constructed with JFlex. As of Lucene version 3.1, this class implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.

Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer. You must specify the required compatibility when creating : As of 3.4, Hiragana and Han characters are no longer wrongly split from their combining characters. If you use a previous version number, you get the exact broken behavior for backwards compatibility. As of 3.1, StandardTokenizer implements Unicode text segmentation. If you use a previous version number, you get the exact behavior of for backwards compatibility.

A private instance of the JFlex-constructed scanner @deprecated (3.1) @deprecated (3.1) @deprecated (3.1) @deprecated (3.1) @deprecated (3.1) @deprecated (3.1) String token types that correspond to token type int constants Set the max allowed token length. Any token longer than this is skipped. Creates a new instance of the . Attaches the to the newly created JFlex-generated (then ported to .NET) scanner. Lucene compatibility version - See The input reader See http://issues.apache.org/jira/browse/LUCENE-1068 Creates a new with a given Factory for . <fieldType name="text_stndrd" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory" maxTokenLength="255"/> </analyzer> </fieldType> Creates a new This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29. Tokens produced are of the following types: <ALPHANUM>: A sequence of alphabetic and numeric characters <NUM>: A number <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer <IDEOGRAPHIC>: A single CJKV ideographic character <HIRAGANA>: A single hiragana character <KATAKANA>: A sequence of katakana characters <HANGUL>: A sequence of Hangul characters This character denotes the end of file initial size of the lookahead buffer lexical states ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l at the beginning of a line l is of the form l = 2*k, k a non negative integer Translates characters to character classes Translates characters to character classes Translates DFA states to action switch labels. Translates a state to a row index in the transition table The transition table of the DFA ZZ_ATTRIBUTE[aState] contains the attributes of state aState the input device the current state of the DFA the current lexical state this buffer contains the current text to be matched and is the source of the YyText() string the textposition at the last accepting state the current text position in the buffer startRead marks the beginning of the YyText() string in the buffer endRead marks the last character in the buffer, that has been read from input the number of characters up to the start of the matched text zzAtEOF == true <=> the scanner is at the EOF Alphanumeric sequences Numbers Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept together as as a single token rather than broken up, because the logic required to break them at word boundaries is too complex for UAX#29. See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA Fills with the current token text. Creates a new scanner the to read input from. Unpacks the compressed character translation table. the packed character translation table the unpacked character translation table Refills the input buffer. false, iff there was new input. if any I/O-Error occurs Disposes the input stream. Resets the scanner to read from a new input stream. Does not close the old reader. All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to . Internal scan buffer is resized down to its initial length, if it has grown. the new input stream Returns the current lexical state. Enters a new lexical state the new lexical state Returns the text matched by the current regular expression. Returns the character at position from the matched text. It is equivalent to YyText[pos], but faster the position of the character to fetch. A value from 0 to YyLength-1. the character at position pos Returns the length of the matched text region. Reports an error that occured while scanning. In a wellformed scanner (no or only correct usage of YyPushBack(int) and a match-all fallback rule) this method will only be called with things that "Can't Possibly Happen". If this method is called, something is seriously wrong (e.g. a JFlex bug producing a faulty scanner etc.). Usual syntax/scanner level error handling should be done in error fallback rules. the code of the errormessage to display Pushes the specified amount of characters back into the input stream. They will be read again by then next call of the scanning method the number of characters to be read again. This number must not be greater than YyLength! Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs. the next token if any I/O-Error occurs Internal interface for supporting versioned grammars. @lucene.internal Copies the matched text into the Returns the current position. Resets the scanner to read from a new input stream. Does not close the old reader. All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to YYINITIAL. the new input stream Returns the length of the matched text region. Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs. the next token, on end of stream if any I/O-Error occurs This character denotes the end of file This class implements StandardTokenizer, except with a bug (https://issues.apache.org/jira/browse/LUCENE-3358) where Han and Hiragana characters would be split from combining characters: @deprecated This class is only for exact backwards compatibility This character denotes the end of file initial size of the lookahead buffer lexical states ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l at the beginning of a line l is of the form l = 2*k, k a non negative integer Translates characters to character classes Translates characters to character classes Translates DFA states to action switch labels. Translates a state to a row index in the transition table The transition table of the DFA ZZ_ATTRIBUTE[aState] contains the attributes of state aState the input device the current state of the DFA the current lexical state this buffer contains the current text to be matched and is the source of the YyText() string the textposition at the last accepting state the current text position in the buffer startRead marks the beginning of the YyText string in the buffer endRead marks the last character in the buffer, that has been read from input the number of characters up to the start of the matched text zzAtEOF == true <=> the scanner is at the EOF Alphanumeric sequences Numbers Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept together as as a single token rather than broken up, because the logic required to break them at word boundaries is too complex for UAX#29. See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA Fills ICharTermAttribute with the current token text. Creates a new scanner the TextReader to read input from. Unpacks the compressed character translation table. the packed character translation table the unpacked character translation table Refills the input buffer. false, iff there was new input. if any I/O-Error occurs Disposes the input stream. Resets the scanner to read from a new input stream. Does not close the old reader. All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to . Internal scan buffer is resized down to its initial length, if it has grown. the new input stream Returns the current lexical state. Enters a new lexical state the new lexical state Returns the text matched by the current regular expression. Returns the character at position from the matched text. It is equivalent to YyText[pos], but faster the position of the character to fetch. A value from 0 to YyLength-1. the character at position pos Returns the length of the matched text region. Reports an error that occured while scanning. In a wellformed scanner (no or only correct usage of YyPushBack(int) and a match-all fallback rule) this method will only be called with things that "Can't Possibly Happen". If this method is called, something is seriously wrong (e.g. a JFlex bug producing a faulty scanner etc.). Usual syntax/scanner level error handling should be done in error fallback rules. the code of the errormessage to display Pushes the specified amount of characters back into the input stream. They will be read again by then next call of the scanning method the number of characters to be read again. This number must not be greater than YyLength! Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs. the next token if any I/O-Error occurs This class implements UAX29URLEmailTokenizer, except with a bug (https://issues.apache.org/jira/browse/LUCENE-3358) where Han and Hiragana characters would be split from combining characters: @deprecated This class is only for exact backwards compatibility This character denotes the end of file initial size of the lookahead buffer lexical states ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l at the beginning of a line l is of the form l = 2*k, k a non negative integer Translates characters to character classes Translates characters to character classes Translates DFA states to action switch labels. The transition table of the DFA error codes error messages for the codes above ZZ_ATTRIBUTE[aState] contains the attributes of state aState the input device the current state of the DFA the current lexical state this buffer contains the current text to be matched and is the source of the YyText string the textposition at the last accepting state the current text position in the buffer startRead marks the beginning of the YyText string in the buffer endRead marks the last character in the buffer, that has been read from input the number of characters up to the start of the matched text zzAtEOF == true <=> the scanner is at the EOF Alphanumeric sequences Numbers Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept together as as a single token rather than broken up, because the logic required to break them at word boundaries is too complex for UAX#29. See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA Fills with the current token text. Creates a new scanner the to read input from. Unpacks the compressed character translation table. the packed character translation table the unpacked character translation table Refills the input buffer. false, iff there was new input. if any I/O-Error occurs Closes the input stream. Resets the scanner to read from a new input stream. Does not close the old reader. All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to . Internal scan buffer is resized down to its initial length, if it has grown. the new input stream Returns the current lexical state. Enters a new lexical state the new lexical state Returns the text matched by the current regular expression. Returns the character at position pos from the matched text. It is equivalent to YyText[pos], but faster the position of the character to fetch. A value from 0 to YyLength-1. the character at position pos Returns the length of the matched text region. Reports an error that occured while scanning. In a wellformed scanner (no or only correct usage of YyPushBack(int) and a match-all fallback rule) this method will only be called with things that "Can't Possibly Happen". If this method is called, something is seriously wrong (e.g. a JFlex bug producing a faulty scanner etc.). Usual syntax/scanner level error handling should be done in error fallback rules. the code of the errormessage to display Pushes the specified amount of characters back into the input stream. They will be read again by then next call of the scanning method the number of characters to be read again. This number must not be greater than YyLength! Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs. the next token if any I/O-Error occurs This class implements StandardTokenizer using Unicode 6.0.0. @deprecated This class is only for exact backwards compatibility This character denotes the end of file initial size of the lookahead buffer lexical states ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l at the beginning of a line l is of the form l = 2*k, k a non negative integer Translates characters to character classes Translates characters to character classes Translates DFA states to action switch labels. Translates a state to a row index in the transition table The transition table of the DFA ZZ_ATTRIBUTE[aState] contains the attributes of state aState the input device the current state of the DFA the current lexical state this buffer contains the current text to be matched and is the source of the YyText string the textposition at the last accepting state the current text position in the buffer startRead marks the beginning of the YyText string in the buffer endRead marks the last character in the buffer, that has been read from input the number of characters up to the start of the matched text zzAtEOF == true <=> the scanner is at the EOF Alphanumeric sequences Numbers Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept together as as a single token rather than broken up, because the logic required to break them at word boundaries is too complex for UAX#29. See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA Fills ICharTermAttribute with the current token text. Creates a new scanner the TextReader to read input from. Unpacks the compressed character translation table. the packed character translation table the unpacked character translation table Refills the input buffer. false, iff there was new input. if any I/O-Error occurs Disposes the input stream. Resets the scanner to read from a new input stream. Does not close the old reader. All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to . Internal scan buffer is resized down to its initial length, if it has grown. the new input stream Returns the current lexical state. Enters a new lexical state the new lexical state Returns the text matched by the current regular expression. Returns the character at position from the matched text. It is equivalent to YyText[pos], but faster the position of the character to fetch. A value from 0 to YyLength-1. the character at position pos Returns the length of the matched text region. Reports an error that occured while scanning. In a wellformed scanner (no or only correct usage of YyPushBack(int) and a match-all fallback rule) this method will only be called with things that "Can't Possibly Happen". If this method is called, something is seriously wrong Usual syntax/scanner level error handling should be done in error fallback rules. the code of the errormessage to display Pushes the specified amount of characters back into the input stream. They will be read again by then next call of the scanning method the number of characters to be read again. This number must not be greater than YyLength! Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs. the next token if any I/O-Error occurs This class implements UAX29URLEmailTokenizer, except with a bug (https://issues.apache.org/jira/browse/LUCENE-3880) where "mailto:" URI scheme prepended to an email address will disrupt recognition of the email address. @deprecated This class is only for exact backwards compatibility This character denotes the end of file initial size of the lookahead buffer lexical states ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l at the beginning of a line l is of the form l = 2*k, k a non negative integer Translates characters to character classes Translates characters to character classes Translates DFA states to action switch labels. Translates a state to a row index in the transition table The transition table of the DFA ZZ_ATTRIBUTE[aState] contains the attributes of state aState the input device the current state of the DFA the current lexical state this buffer contains the current text to be matched and is the source of the YyText string the textposition at the last accepting state the current text position in the buffer startRead marks the beginning of the YyText string in the buffer endRead marks the last character in the buffer, that has been read from input the number of characters up to the start of the matched text zzAtEOF == true <=> the scanner is at the EOF Alphanumeric sequences Numbers Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept together as as a single token rather than broken up, because the logic required to break them at word boundaries is too complex for UAX#29. See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA Fills ICharTermAttribute with the current token text. Creates a new scanner the TextReader to read input from. Unpacks the compressed character translation table. the packed character translation table the unpacked character translation table Refills the input buffer. false, iff there was new input. if any I/O-Error occurs Disposes the input stream. Resets the scanner to read from a new input stream. Does not close the old reader. All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to . Internal scan buffer is resized down to its initial length, if it has grown. the new input stream Returns the current lexical state. Enters a new lexical state the new lexical state Returns the text matched by the current regular expression. Returns the character at position from the matched text. It is equivalent to YyText[pos], but faster the position of the character to fetch. A value from 0 to YyLength-1. the character at position pos Returns the length of the matched text region. Reports an error that occured while scanning. In a wellformed scanner (no or only correct usage of YyPushBack(int) and a match-all fallback rule) this method will only be called with things that "Can't Possibly Happen". If this method is called, something is seriously wrong (e.g. a JFlex bug producing a faulty scanner etc.). Usual syntax/scanner level error handling should be done in error fallback rules. the code of the errormessage to display Pushes the specified amount of characters back into the input stream. They will be read again by then next call of the scanning method the number of characters to be read again. This number must not be greater than YyLength! Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs. the next token if any I/O-Error occurs This class implements UAX29URLEmailTokenizer using Unicode 6.0.0. @deprecated This class is only for exact backwards compatibility This character denotes the end of file initial size of the lookahead buffer lexical states ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l at the beginning of a line l is of the form l = 2*k, k a non negative integer Translates characters to character classes Translates characters to character classes Translates DFA states to action switch labels. Translates a state to a row index in the transition table The transition table of the DFA ZZ_ATTRIBUTE[aState] contains the attributes of state aState the input device the current state of the DFA the current lexical state this buffer contains the current text to be matched and is the source of the YyText string the textposition at the last accepting state the current text position in the buffer startRead marks the beginning of the YyText string in the buffer endRead marks the last character in the buffer, that has been read from input the number of characters up to the start of the matched text zzAtEOF == true <=> the scanner is at the EOF Alphanumeric sequences Numbers Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept together as as a single token rather than broken up, because the logic required to break them at word boundaries is too complex for UAX#29. See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA Fills ICharTermAttribute with the current token text. Creates a new scanner the TextReader to read input from. Unpacks the compressed character translation table. the packed character translation table the unpacked character translation table Refills the input buffer. false, iff there was new input. if any I/O-Error occurs Disposes the input stream. Resets the scanner to read from a new input stream. Does not close the old reader. All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to . Internal scan buffer is resized down to its initial length, if it has grown. the new input stream Returns the current lexical state. Enters a new lexical state the new lexical state Returns the text matched by the current regular expression. Returns the character at position from the matched text. It is equivalent to YyText[pos], but faster the position of the character to fetch. A value from 0 to YyLength-1. the character at position pos Returns the length of the matched text region. Reports an error that occured while scanning. In a wellformed scanner (no or only correct usage of YyPushBack(int) and a match-all fallback rule) this method will only be called with things that "Can't Possibly Happen". If this method is called, something is seriously wrong (e.g. a JFlex bug producing a faulty scanner etc.). Usual syntax/scanner level error handling should be done in error fallback rules. the code of the errormessage to display Pushes the specified amount of characters back into the input stream. They will be read again by then next call of the scanning method the number of characters to be read again. This number must not be greater than YyLength! Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs. the next token if any I/O-Error occurs This class implements StandardTokenizer using Unicode 6.1.0. @deprecated This class is only for exact backwards compatibility This character denotes the end of file initial size of the lookahead buffer lexical states ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l at the beginning of a line l is of the form l = 2*k, k a non negative integer Translates characters to character classes Translates characters to character classes Translates DFA states to action switch labels. Translates a state to a row index in the transition table The transition table of the DFA ZZ_ATTRIBUTE[aState] contains the attributes of state aState the input device the current state of the DFA the current lexical state this buffer contains the current text to be matched and is the source of the YyText string the textposition at the last accepting state the current text position in the buffer startRead marks the beginning of the YyText string in the buffer endRead marks the last character in the buffer, that has been read from input the number of characters up to the start of the matched text zzAtEOF == true <=> the scanner is at the EOF Alphanumeric sequences Numbers Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept together as as a single token rather than broken up, because the logic required to break them at word boundaries is too complex for UAX#29. See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA Fills ICharTermAttribute with the current token text. Creates a new scanner the TextReader to read input from. Unpacks the compressed character translation table. the packed character translation table the unpacked character translation table Refills the input buffer. false, iff there was new input. if any I/O-Error occurs Disposes the input stream. Resets the scanner to read from a new input stream. Does not close the old reader. All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to . Internal scan buffer is resized down to its initial length, if it has grown. the new input stream Returns the current lexical state. Enters a new lexical state the new lexical state Returns the text matched by the current regular expression. Returns the character at position from the matched text. It is equivalent to YyText[pos], but faster the position of the character to fetch. A value from 0 to YyLength-1. the character at position pos Returns the length of the matched text region. Reports an error that occured while scanning. In a wellformed scanner (no or only correct usage of YyPushBack(int) and a match-all fallback rule) this method will only be called with things that "Can't Possibly Happen". If this method is called, something is seriously wrong (e.g. a JFlex bug producing a faulty scanner etc.). Usual syntax/scanner level error handling should be done in error fallback rules. the code of the errormessage to display Pushes the specified amount of characters back into the input stream. They will be read again by then next call of the scanning method the number of characters to be read again. This number must not be greater than YyLength! Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs. the next token if any I/O-Error occurs This class implements using Unicode 6.1.0. @deprecated This class is only for exact backwards compatibility This character denotes the end of file initial size of the lookahead buffer lexical states ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l at the beginning of a line l is of the form l = 2*k, k a non negative integer Translates characters to character classes Translates characters to character classes Translates DFA states to action switch labels. Translates a state to a row index in the transition table The transition table of the DFA ZZ_ATTRIBUTE[aState] contains the attributes of state aState the input device the current state of the DFA the current lexical state this buffer contains the current text to be matched and is the source of the YyText string the textposition at the last accepting state the current text position in the buffer startRead marks the beginning of the YyText string in the buffer endRead marks the last character in the buffer, that has been read from input the number of characters up to the start of the matched text zzAtEOF == true <=> the scanner is at the EOF Alphanumeric sequences Numbers Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept together as as a single token rather than broken up, because the logic required to break them at word boundaries is too complex for UAX#29. See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA Fills ICharTermAttribute with the current token text. Creates a new scanner the TextReader to read input from. Unpacks the compressed character translation table. the packed character translation table the unpacked character translation table Refills the input buffer. false, iff there was new input. if any I/O-Error occurs Disposes the input stream. Resets the scanner to read from a new input stream. Does not close the old reader. All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to . Internal scan buffer is resized down to its initial length, if it has grown. the new input stream Returns the current lexical state. Enters a new lexical state the new lexical state Returns the text matched by the current regular expression. Returns the character at position from the matched text. It is equivalent to YyText[pos], but faster the position of the character to fetch. A value from 0 to YyLength-1. the character at position pos Returns the length of the matched text region. Reports an error that occured while scanning. In a wellformed scanner (no or only correct usage of YyPushBack(int) and a match-all fallback rule) this method will only be called with things that "Can't Possibly Happen". If this method is called, something is seriously wrong (e.g. a JFlex bug producing a faulty scanner etc.). Usual syntax/scanner level error handling should be done in error fallback rules. the code of the errormessage to display Pushes the specified amount of characters back into the input stream. They will be read again by then next call of the scanning method the number of characters to be read again. This number must not be greater than YyLength! Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs. the next token if any I/O-Error occurs Filters with , and , using a list of English stop words. You must specify the required compatibility when creating Default maximum allowed token length An unmodifiable set containing some common English words that are usually not useful for searching. Builds an analyzer with the given stop words. Lucene version to match - See stop words Builds an analyzer with the default stop words (. Lucene version to match - See Builds an analyzer with the stop words from the given reader. Lucene version to match - See to read stop words from Set maximum allowed token length. If a token is seen that exceeds this length then it is discarded. This setting only takes effect the next time tokenStream or tokenStream is called. This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in ` Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs. Tokens produced are of the following types: <ALPHANUM>: A sequence of alphabetic and numeric characters <NUM>: A number <URL>: A URL <EMAIL>: An email address <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer <IDEOGRAPHIC>: A single CJKV ideographic character <HIRAGANA>: A single hiragana character You must specify the required compatibility when creating : As of 3.4, Hiragana and Han characters are no longer wrongly split from their combining characters. If you use a previous version number, you get the exact broken behavior for backwards compatibility. A private instance of the JFlex-constructed scanner String token types that correspond to token type int constants Set the max allowed token length. Any token longer than this is skipped. Creates a new instance of the . Attaches the to the newly created JFlex scanner. Lucene compatibility version The input reader Creates a new with a given LUCENENET specific: This method was added in .NET to prevent having to repeat code in the constructors. Factory for . <fieldType name="text_urlemail" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.UAX29URLEmailTokenizerFactory" maxTokenLength="255"/> </analyzer> </fieldType> Creates a new This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs. Tokens produced are of the following types: <ALPHANUM>: A sequence of alphabetic and numeric characters <NUM>: A number <URL>: A URL <EMAIL>: An email address <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer <IDEOGRAPHIC>: A single CJKV ideographic character <HIRAGANA>: A single hiragana character <KATAKANA>: A sequence of katakana characters <HANGUL>: A sequence of Hangul characters This character denotes the end of file initial size of the lookahead buffer lexical states ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l at the beginning of a line l is of the form l = 2*k, k a non negative integer Translates characters to character classes Translates characters to character classes Translates DFA states to action switch labels. Translates a state to a row index in the transition table The transition table of the DFA ZZ_ATTRIBUTE[aState] contains the attributes of state aState the input device the current state of the DFA the current lexical state this buffer contains the current text to be matched and is the source of the YyText string the textposition at the last accepting state the current text position in the buffer startRead marks the beginning of the YyText string in the buffer endRead marks the last character in the buffer, that has been read from input the number of characters up to the start of the matched text zzAtEOF == true <=> the scanner is at the EOF Alphanumeric sequences Numbers Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept together as as a single token rather than broken up, because the logic required to break them at word boundaries is too complex for UAX#29. See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA Fills ICharTermAttribute with the current token text. Creates a new scanner the TextReader to read input from. Unpacks the compressed character translation table. the packed character translation table the unpacked character translation table Refills the input buffer. false, iff there was new input. if any I/O-Error occurs Disposes the input stream. Resets the scanner to read from a new input stream. Does not close the old reader. All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to . Internal scan buffer is resized down to its initial length, if it has grown. the new input stream Returns the current lexical state. Enters a new lexical state the new lexical state Returns the text matched by the current regular expression. Returns the character at position from the matched text. It is equivalent to YyText[pos], but faster the position of the character to fetch. A value from 0 to YyLength-1. the character at position pos Returns the length of the matched text region. Reports an error that occured while scanning. In a wellformed scanner (no or only correct usage of YyPushBack(int) and a match-all fallback rule) this method will only be called with things that "Can't Possibly Happen". If this method is called, something is seriously wrong (e.g. a JFlex bug producing a faulty scanner etc.). Usual syntax/scanner level error handling should be done in error fallback rules. the code of the errormessage to display Pushes the specified amount of characters back into the input stream. They will be read again by then next call of the scanning method the number of characters to be read again. This number must not be greater than YyLength! Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs. the next token if any I/O-Error occurs for Swedish. File containing default Swedish stopwords. Returns an unmodifiable instance of the default stop words set. default stop words set. Atomically loads the in a lazy fashion once the outer class accesses the static final set the first time.; Builds an analyzer with the default stop words: . lucene compatibility version Builds an analyzer with the given stop words. lucene compatibility version a stopword set Builds an analyzer with the given stop words. If a non-empty stem exclusion set is provided this analyzer will add a before stemming. lucene compatibility version a stopword set a set of terms not to be stemmed Creates a which tokenizes all the text in the provided . A built from an filtered with , , , if a stem exclusion set is provided and . A that applies to stem Swedish words. To prevent terms from being stemmed use an instance of or a custom that sets the before this . Factory for . <fieldType name="text_svlgtstem" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SwedishLightStemFilterFactory"/> </analyzer> </fieldType> Creates a new Light Stemmer for Swedish. This stemmer implements the algorithm described in: Report on CLEF-2003 Monolingual Tracks Jacques Savoy Load synonyms with the given class. handles multi-token synonyms with variable position increment offsets. The matched tokens from the input stream may be optionally passed through (includeOrig=true) or discarded. If the original tokens are included, the position increments may be modified to retain absolute positions after merging with the synonym tokenstream. Generated synonyms will start at the same position as the first matched source token. @deprecated (3.4) use SynonymFilterFactory instead. only for precise index backwards compatibility. this factory will be removed in Lucene 5.0 Factory for (only used with luceneMatchVersion < 3.4) <fieldType name="text_synonym" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="false" expand="true" tokenizerFactory="solr.WhitespaceTokenizerFactory"/> </analyzer> </fieldType> @deprecated (3.4) use SynonymFilterFactory instead. only for precise index backwards compatibility. this factory will be removed in Lucene 5.0 a list of all rules Splits a backslash escaped string on the separator. Current backslash escaping supported: \n \t \r \b \f are escaped the same as a .NET string Other characters following a backslash are produced verbatim (\c => c) the string to split the separator to split on decode backslash escaping Mapping rules for use with @deprecated (3.4) use instead. only for precise index backwards compatibility. this factory will be removed in Lucene 5.0 @lucene.internal @lucene.internal , the sequence of strings to match the list of tokens to use on a match sets a flag on this mapping signaling the generation of matched tokens in addition to the replacement tokens merge the replacement tokens with any other mappings that exist Produces a from a Merge two lists of tokens, producing a single list with manipulated positionIncrements so that the tokens end up at the same position. Example: [a b] merged with [c d] produces [a/b c/d] ('/' denotes tokens in the same position) Example: [a,5 b,2] merged with [c d,4 e,4] produces [c a,5/d b,2 e,2] (a,n means a has posInc=n) Parser for the Solr synonyms format. Blank lines and lines starting with '#' are comments. Explicit mappings match any token sequence on the LHS of "=>" and replace with all alternatives on the RHS. These types of mappings ignore the expand parameter in the constructor. Example: i-pod, i pod => ipod Equivalent synonyms may be separated with commas and give no explicit mapping. In this case the mapping behavior will be taken from the expand parameter in the constructor. This allows the same synonym file to be used in different synonym handling strategies. Example: ipod, i-pod, i pod Multiple synonym mapping entries are merged. Example: foo => foo bar foo => baz is equivalent to foo => foo bar, baz @lucene.experimental Matches single or multi word synonyms in a token stream. This token stream cannot properly handle position increments != 1, ie, you should place this filter before filtering out stop words. Note that with the current implementation, parsing is greedy, so whenever multiple parses would apply, the rule starting the earliest and parsing the most tokens wins. For example if you have these rules: a -> x a b -> y b c d -> z Then input a b c d e parses to y b c d, ie the 2nd rule "wins" because it started earliest and matched the most input tokens of other rules starting at that point. A future improvement to this filter could allow non-greedy parsing, such that the 3rd rule would win, and also separately allow multiple parses, such that all 3 rules would match, perhaps even on a rule by rule basis. NOTE: when a match occurs, the output tokens associated with the matching rule are "stacked" on top of the input stream (if the rule had keepOrig=true) and also on top of another matched rule's output tokens. This is not a correct solution, as really the output should be an arbitrary graph/lattice. For example, with the above match, you would expect an exact "y b c" to match the parsed tokens, but it will fail to do so. This limitation is necessary because Lucene's (and index) cannot yet represent an arbitrary graph. NOTE: If multiple incoming tokens arrive on the same position, only the first token at that position is used for parsing. Subsequent tokens simply pass through and are not parsed. A future improvement would be to allow these tokens to also be matched. input tokenstream synonym map case-folds input for matching with in using . Note, if you set this to true, its your responsibility to lowercase the input entries when you create the . Factory for . <fieldType name="text_synonym" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" format="solr" ignoreCase="false" expand="true" tokenizerFactory="solr.WhitespaceTokenizerFactory" [optional tokenizer factory parameters]/> </analyzer> </fieldType> An optional param name prefix of "tokenizerFactory." may be used for any init params that the needs to pass to the specified . If the expects an init parameters with the same name as an init param used by the , the prefix is mandatory. The optional format parameter controls how the synonyms will be parsed: It supports the short names of solr for and wordnet for and , or your own class name. The default is solr. A custom is expected to have a constructor taking: dedup - true if duplicates should be ignored, false otherwise expand - true if conflation groups should be expanded, false if they are one-directional analyzer - an analyzer used for each raw synonym Access to the delegator for test verification @deprecated Method exists only for testing 4x, will be removed in 5.0 @lucene.internal A map of synonyms, keys and values are phrases. @lucene.experimental for multiword support, you must separate words with this separator map<input word, list<ord>> map<ord, outputword> maxHorizontalContext: maximum context we need on the tokenstream Builds an FSTSynonymMap. Call until you have added all the mappings, then call to get an FSTSynonymMap @lucene.experimental If dedup is true then identical rules (same input, same output) will be added only once. Sugar: just joins the provided terms with . reuse and its chars must not be null. only used for asserting! Add a phrase->phrase synonym mapping. Phrases are character sequences where words are separated with character zero (U+0000). Empty words (two U+0000s in a row) are not allowed in the input nor the output! input phrase output phrase true if the original should be included Builds an and returns it. Abstraction for parsing synonym files. @lucene.experimental Parse the given input, adding synonyms to the inherited . The input to parse Sugar: analyzes the text with the analyzer and separates by . reuse and its chars must not be null. Parser for wordnet prolog format See http://wordnet.princeton.edu/man/prologdb.5WN.html for a description of the format. @lucene.experimental Strips all characters after an apostrophe (including the apostrophe itself). In Turkish, apostrophe is used to separate suffixes from proper names (continent, sea, river, lake, mountain, upland, proper names related to religion and mythology). This filter intended to be used before stem filters. For more information, see Role of Apostrophes in Turkish Information Retrieval Factory for . <fieldType name="text_tr_lower_apostrophes" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.ApostropheFilterFactory"/> <filter class="solr.TurkishLowerCaseFilterFactory"/> </analyzer> </fieldType> Creates a new for Turkish. File containing default Turkish stopwords. The comment character in the stopwords file. All lines prefixed with this will be ignored. Returns an unmodifiable instance of the default stop words set. default stop words set. Atomically loads the in a lazy fashion once the outer class accesses the static final set the first time.; Builds an analyzer with the default stop words: . Builds an analyzer with the given stop words. lucene compatibility version a stopword set Builds an analyzer with the given stop words. If a non-empty stem exclusion set is provided this analyzer will add a before stemming. lucene compatibility version a stopword set a set of terms not to be stemmed Creates a which tokenizes all the text in the provided . A built from an filtered with , , , if a stem exclusion set is provided and . Normalizes Turkish token text to lower case. Turkish and Azeri have unique casing behavior for some characters. This filter applies Turkish lowercase rules. For more information, see http://en.wikipedia.org/wiki/Turkish_dotted_and_dotless_I Create a new , that normalizes Turkish token text to lower case. to filter lookahead for a combining dot above. other NSMs may be in between. delete a character in-place. rarely happens, only if is found after an i Factory for . <fieldType name="text_trlwr" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.TurkishLowerCaseFilterFactory"/> </analyzer> </fieldType> Creates a new Abstract parent class for analysis factories , and . The typical lifecycle for a factory consumer is: Create factory via its constructor (or via XXXFactory.ForName) (Optional) If the factory uses resources such as files, is called to initialize those resources. Consumer calls create() to obtain instances. The original args, before any processing the luceneVersion arg Initialize this factory via a set of key-value pairs. this method can be called in the or methods, to inform user, that for this factory a is required NOTE: This was requireInt() in Lucene NOTE: This was getInt() in Lucene NOTE: This was requireFloat() in Lucene NOTE: This was getFloat() in Lucene Returns whitespace- and/or comma-separated set of values, or null if none are found Compiles a pattern for the value of the specified argument key Gets a value of the specified argument key . To specify the invariant culture, pass the string "invariant". LUCENENET specific Returns as from wordFiles, which can be a comma-separated list of filenames Returns the resource's lines (with content treated as UTF-8) Same as , except the input is in snowball format. Splits file names separated by comma character. File names can contain comma characters escaped by backslash '\' the string containing file names a list of file names with the escaping backslashed removed the string used to specify the concrete class name in a serialized representation: the class arg. If the concrete class name was not specified via a class arg, returns GetType().Name. Helper class for loading named SPIs from classpath (e.g. Tokenizers, TokenStreams). @lucene.internal Reloads the internal SPI list. Changes to the service list are visible after the method ends, all iterators (e.g, from ,...) stay consistent. NOTE: Only new service providers are added, existing ones are never removed or replaced. this method is expensive and should only be called for discovery of new service providers on the given classpath/classloader! LUCENENET specific class to mimic Java's BufferedReader (that is, a reader that is seekable) so it supports Mark() and Reset() (which are part of the Java Reader class), but also provide the Correct() method of BaseCharFilter. The object used to synchronize access to the reader. The characters that can be read and refilled in bulk. We maintain three indices into this buffer: { X X X X X X X X X X X X - - } ^ ^ ^ | | | mark pos end Pos points to the next readable character.End is one greater than the last readable character.When pos == end, the buffer is empty and must be before characters can be read. Mark is the value pos will be set to on calls to . Its value is in the range [0...pos]. If the mark is -1, the buffer cannot be reset. MarkLimit limits the distance between the mark and the pos.When this limit is exceeded, is permitted (but not required) to throw an exception. For shorter distances, shall not throw (unless the reader is closed). LUCENENET specific to throw an exception if the user calls instead of Creates a buffering character-input stream that uses a default-sized input buffer. A TextReader Creates a buffering character-input stream that uses an input buffer of the specified size. A TextReader Input-buffer size Disposes this reader. This implementation closes the buffered source reader and releases the buffer. Nothing is done if this reader has already been disposed. if an error occurs while closing this reader. Populates the buffer with data. It is an error to call this method when the buffer still contains data; ie. if pos < end. the number of bytes read into the buffer, or -1 if the end of the source stream has been reached. Checks to make sure that the stream has not been closed Indicates whether or not this reader is closed. Sets a mark position in this reader. The parameter indicates how many characters can be read before the mark is invalidated. Calling will reposition the reader back to the marked position if has not been surpassed. the number of characters that can be read before the mark is invalidated. if markLimit < 0 if an error occurs while setting a mark in this reader. Indicates whether this reader supports the and methods. This implementation returns true. Reads a single character from this reader and returns it with the two higher-order bytes set to 0. If possible, returns a character from the buffer. If there are no characters available in the buffer, it fills the buffer and then returns a character. It returns -1 if there are no more characters in the source reader. The character read or -1 if the end of the source reader has been reached. If this reader is disposed or some other I/O error occurs. Reads at most characters from this reader and stores them at in the character array . Returns the number of characters actually read or -1 if the end of the source reader has been reached. If all the buffered characters have been used, a mark has not been set and the requested number of characters is larger than this readers buffer size, BufferedReader bypasses the buffer and simply places the results directly into . the character array to store the characters read. the initial position in to store the bytes read from this reader. the maximum number of characters to read, must be non-negative. number of characters read or -1 if the end of the source reader has been reached. if offset < 0 or length < 0, or if offset + length is greater than the size of . if this reader is disposed or some other I/O error occurs. Returns the next line of text available from this reader. A line is represented by zero or more characters followed by '\n', '\r', "\r\n" or the end of the reader. The string does not include the newline sequence. The contents of the line or null if no characters were read before the end of the reader has been reached. if this reader is disposed or some other I/O error occurs. Indicates whether this reader is ready to be read without blocking. true if this reader will not block when is called, false if unknown or blocking will occur. Resets this reader's position to the last location. Invocations of and will occur from this new location. If this reader is disposed or no mark has been set. Skips characters in this reader. Subsequent s will not return these characters unless is used. Skipping characters may invalidate a mark if is surpassed. the maximum number of characters to skip. the number of characters actually skipped. if amount < 0. If this reader is disposed or some other I/O error occurs. Reads a single character from this reader and returns it with the two higher-order bytes set to 0. If possible, returns a character from the buffer. If there are no characters available in the buffer, it fills the buffer and then returns a character. It returns -1 if there are no more characters in the source reader. Unlike , this method does not advance the current position. The character read or -1 if the end of the source reader has been reached. If this reader is disposed or some other I/O error occurs. Not supported. In all cases. Not supported. In all cases. Not supported. In all cases. Not supported. In all cases. Not supported. In all cases. Not supported. In all cases. Not supported. In all cases. Not supported. In all cases. Not supported. In all cases. Not supported. In all cases. Not supported. In all cases. Not supported. The call didn't originate from within . provides a unified interface to Character-related operations to implement backwards compatible character operations based on a instance. @lucene.internal Returns a implementation according to the given instance. a version instance a implementation according to the given instance. Return a instance compatible with Java 1.4. Returns the code point at the given index of the . Depending on the passed to this method mimics the behavior of Character.CodePointAt(char[], int) as it would have been available on a Java 1.4 JVM or on a later virtual machine version. a character sequence the offset to the char values in the chars array to be converted the Unicode code point at the given index - if the sequence is null. - if the value offset is negative or not less than the length of the character sequence. Returns the code point at the given index of the . Depending on the passed to this method mimics the behavior of Character.CodePointAt(char[], int) as it would have been available on a Java 1.4 JVM or on a later virtual machine version. a character sequence the offset to the char values in the chars array to be converted the Unicode code point at the given index - if the sequence is null. - if the value offset is negative or not less than the length of the character sequence. Returns the code point at the given index of the char array where only elements with index less than the limit are used. Depending on the passed to this method mimics the behavior of Character.CodePointAt(char[], int) as it would have been available on a Java 1.4 JVM or on a later virtual machine version. a character array the offset to the char values in the chars array to be converted the index afer the last element that should be used to calculate codepoint. the Unicode code point at the given index - if the array is null. - if the value offset is negative or not less than the length of the char array. Return the number of characters in . Return the number of characters in . Return the number of characters in . Return the number of characters in . Creates a new and allocates a of the given bufferSize. the internal char buffer size, must be >= 2 a new instance. Converts each unicode codepoint to lowerCase via in the invariant culture starting at the given offset. the char buffer to lowercase the offset to start at the number of characters in the buffer to lower case Converts each unicode codepoint to UpperCase via in the invariant culture starting at the given offset. the char buffer to UPPERCASE the offset to start at the number of characters in the buffer to lower case Converts a sequence of .NET characters to a sequence of unicode code points. The number of code points written to the destination buffer. Converts a sequence of unicode code points to a sequence of .NET characters. the number of chars written to the destination buffer Fills the with characters read from the given reader . This method tries to read numChars characters into the , each call to fill will start filling the buffer from offset 0 up to . In case code points can span across 2 java characters, this method may only fill numChars - 1 characters in order not to split in the middle of a surrogate pair, even if there are remaining characters in the . Depending on the passed to this method implements supplementary character awareness when filling the given buffer. For all > 3.0 guarantees that the given will never contain a high surrogate character as the last element in the buffer unless it is the last available character in the reader. In other words, high and low surrogate pairs will always be preserved across buffer boarders. A return value of false means that this method call exhausted the reader, but there may be some bytes which have been read, which can be verified by checking whether buffer.Length > 0. the buffer to fill. the reader to read characters from. the number of chars to read false if and only if reader.read returned -1 while trying to fill the buffer if the reader throws an . Convenience method which calls Fill(buffer, reader, buffer.Buffer.Length). Return the index within buf[start:start+count] which is by code points from . A simple IO buffer to use with . Returns the internal buffer the buffer Returns the data offset in the internal buffer. the offset Return the length of the data in the internal buffer starting at the length Resets the CharacterBuffer. All internals are reset to its default values. A simple class that stores text s as 's in a hash table. Note that this is not a general purpose class. For example, it cannot remove items from the dictionary, nor does it resize its hash table to be smaller, etc. It is designed to be quick to retrieve items by keys without the necessity of converting to a first. You must specify the required compatibility when creating : As of 3.1, supplementary characters are properly lowercased. Before 3.1 supplementary characters could not be lowercased correctly due to the lack of Unicode 4 support in JDK 1.4. To use instances of with the behavior before Lucene 3.1 pass a < 3.1 to the constructors. Returns an empty, read-only dictionary. LUCENENET: Moved this from CharArraySet so it doesn't need to know the generic type of CharArrayDictionary LUCENENET SPECIFIC type used to act as a placeholder. Since null means that our value is not populated, we need an instance of something to indicate it is. Using an instance of would only work if we could constrain it with the new() constraint, which isn't possible because some types such as don't have a default constructor. So, this is a workaround that allows any type regardless of the type of constructor. Note also that we gain the ability to use value types for , but also create a difference in behavior from Java Lucene where the actual values returned could be null. Create dictionary with enough capacity to hold terms. lucene compatibility version - see for details. the initial capacity false if and only if the set should be case sensitive; otherwise true. is less than zero. Creates a dictionary from the mappings in another dictionary. compatibility match version see for details. a dictionary () whose mappings to be copied. false if and only if the set should be case sensitive; otherwise true. is null. Creates a dictionary from the mappings in another dictionary. compatibility match version see for details. a dictionary () whose mappings to be copied. false if and only if the set should be case sensitive; otherwise true. is null. Creates a dictionary from the mappings in another dictionary. compatibility match version see for details. a dictionary () whose mappings to be copied. false if and only if the set should be case sensitive; otherwise true. is null. Create set from the supplied dictionary (used internally for readonly maps...) Adds the for the passed in . Note that the instance is not added to the dictionary. A whose will be added for the corresponding . Adds the for the passed in . The string-able type to be added/updated in the dictionary. The corresponding value for the given . is null. An element with already exists in the dictionary. Adds the for the passed in . The string-able type to be added/updated in the dictionary. The corresponding value for the given . is null. An element with already exists in the dictionary. Adds the for the passed in . The string-able type to be added/updated in the dictionary. The corresponding value for the given . is null. -or- The 's property returns false. An element with already exists in the dictionary. Adds the for the passed in . The string-able type to be added/updated in the dictionary. The corresponding value for the given . is null. An element with already exists in the dictionary. Returns an unmodifiable . This allows to provide unmodifiable views of internal dictionary for "read-only" use. an new unmodifiable . Clears all entries in this dictionary. This method is supported for reusing, but not . Not supported. Copies all items in the current dictionary the starting at the . The array is assumed to already be dimensioned to fit the elements in this dictionary; otherwise a will be thrown. The array to copy the items into. A 32-bit integer that represents the index in at which copying begins. is null. is less than zero. The number of elements in the source is greater than the available space from to the end of the destination array. Copies all items in the current dictionary the starting at the . The array is assumed to already be dimensioned to fit the elements in this dictionary; otherwise a will be thrown. The array to copy the items into. A 32-bit integer that represents the index in at which copying begins. is null. is less than zero. The number of elements in the source is greater than the available space from to the end of the destination array. Copies all items in the current dictionary the starting at the . The array is assumed to already be dimensioned to fit the elements in this dictionary; otherwise a will be thrown. The array to copy the items into. A 32-bit integer that represents the index in at which copying begins. is null. is less than zero. The number of elements in the source is greater than the available space from to the end of the destination array. true if the chars of starting at are in the is null. or is less than zero. and refer to a position outside of . true if the entire is the same as the being passed in; otherwise false. is null. true if the is in the ; otherwise false is null. true if the is in the ; otherwise false is null. -or- The 's property returns false. true if the (in the invariant culture) is in the ; otherwise false is null. Returns the value of the mapping of chars of starting at . is null. or is less than zero. and refer to a position outside of . The effective text is not found in the dictionary. Returns the value of the mapping of the chars inside this . is null. is not found in the dictionary. Returns the value of the mapping of the chars inside this . is null. -or- The 's property returns false. is not found in the dictionary. Returns the value of the mapping of the chars inside this . is null. is not found in the dictionary. Returns the value of the mapping of the chars inside this . is null. is not found in the dictionary. Returns true if the is in the set. is null. -or- The 's property returns false. Returns true if the is in the set. is null. Add the given mapping. If ignoreCase is true for this dictionary, the text array will be directly modified. Note: The setter is more efficient than this method if the is not required. A text with which the specified is associated. The position of the where the target text begins. The total length of the . The value to be associated with the specified . The previous value associated with the text, or the default for the type of parameter if there was no mapping for . true if the mapping was added, false if the text already existed. The will be populated if the result is false. is null. or is less than zero. and refer to a position outside of . Add the given mapping. If ignoreCase is true for this dictionary, the text array will be directly modified. The user should never modify this text array after calling this method. Note: The setter is more efficient than this method if the is not required. A text with which the specified is associated. The value to be associated with the specified . The previous value associated with the text, or the default for the type of parameter if there was no mapping for . true if the mapping was added, false if the text already existed. The will be populated if the result is false. is null. Add the given mapping. Note: The setter is more efficient than this method if the is not required. A text with which the specified is associated. The value to be associated with the specified . The previous value associated with the text, or the default for the type of parameter if there was no mapping for . true if the mapping was added, false if the text already existed. The will be populated if the result is false. is null. Add the given mapping. Note: The setter is more efficient than this method if the is not required. A text with which the specified is associated. The value to be associated with the specified . The previous value associated with the text, or the default for the type of parameter if there was no mapping for . true if the mapping was added, false if the text already existed. The will be populated if the result is false. is null. -or- The 's property returns false. Add the given mapping using the representation of in the . Note: The setter is more efficient than this method if the is not required. A text with which the specified is associated. The value to be associated with the specified . The previous value associated with the text, or the default for the type of parameter if there was no mapping for . true if the mapping was added, false if the text already existed. The will be populated if the result is false. is null. Add the given mapping. is null. -or- The 's property returns false. Add the given mapping. is null. Add the given mapping. is null. LUCENENET specific. Centralizes the logic between Put() implementations that accept a value and those that don't. This value is so we know whether or not the value was set, since we can't reliably do a check for null on the type. is null. Sets the value of the mapping of the chars inside this . is null. Sets the value of the mapping of the chars inside this . is null. Sets the value of the mapping of the chars inside this . is null. Sets the value of the mapping of the chars inside this . is null. Sets the value of the mapping of the chars inside this . is null. Sets the value of the mapping of chars of starting at . If ignoreCase is true for this dictionary, the text array will be directly modified. A text with which the specified is associated. The position of the where the target text begins. The total length of the . The value to be associated with the specified . is null. or is less than zero. and refer to a position outside of . Sets the value of the mapping of the chars inside this . If ignoreCase is true for this dictionary, the text array will be directly modified. The user should never modify this text array after calling this method. is null. Sets the value of the mapping of the chars inside this . is null. -or- The 's property returns false. Sets the value of the mapping of the chars inside this . is null. Sets the value of the mapping of the chars inside this . is null. LUCENENET specific. Like PutImpl, but doesn't have a return value or lookup to get the old value. is null. -or- The 's property returns false. LUCENENET specific. Like PutImpl, but doesn't have a return value or lookup to get the old value. is null. LUCENENET specific. Like PutImpl, but doesn't have a return value or lookup to get the old value. is null. LUCENENET specific. Like PutImpl, but doesn't have a return value or lookup to get the old value. is null. or is less than zero. and refer to a position outside of . This implementation enumerates over the specified 's entries, and calls this dictionary's operation once for each entry. If ignoreCase is true for this dictionary, the text arrays will be directly modified. The user should never modify the text arrays after calling this method. A dictionary of values to add/update in the current dictionary. is null. -or- An element in the collection is null. This implementation enumerates over the specified 's entries, and calls this dictionary's operation once for each entry. A dictionary of values to add/update in the current dictionary. is null. -or- An element in the collection is null. This implementation enumerates over the specified 's entries, and calls this dictionary's operation once for each entry. A dictionary of values to add/update in the current dictionary. is null. -or- An element in the collection has a null text. -or- The text's property for a given element in the collection returns false. This implementation enumerates over the specified 's entries, and calls this dictionary's operation once for each entry. A dictionary of values to add/update in the current dictionary. is null. -or- An element in the collection is null. This implementation enumerates over the specified 's entries, and calls this dictionary's operation once for each entry. The values to add/update in the current dictionary. is null. -or- An element in the collection is null. This implementation enumerates over the specified 's entries, and calls this dictionary's operation once for each entry. The values to add/update in the current dictionary. is null. -or- An element in the collection is null. This implementation enumerates over the specified 's entries, and calls this dictionary's operation once for each entry. The values to add/update in the current dictionary. is null. -or- An element in the collection has a null text. -or- The text's property for a given element in the collection returns false. This implementation enumerates over the specified 's entries, and calls this dictionary's operation once for each entry. The values to add/update in the current dictionary. is null. -or- An element in the collection is null. LUCENENET Specific - test for value equality similar to how it is done in Java Another dictionary to test the values of true if the given object is an that contains the same text value pairs as the current dictionary LUCENENET Specific - override required by .NET because we override Equals to simulate Java's value equality checking. The Lucene version corresponding to the compatibility behavior that this instance emulates Adds a placeholder with the given as the text. Primarily for internal use by . NOTE: If ignoreCase is true for this , the text array will be directly modified. A key with which the placeholder is associated. The position of the where the target text begins. The total length of the . true if the text was added, false if the text already existed. is null. or is less than zero. and refer to a position outside of . Adds a placeholder with the given as the text. Primarily for internal use by . NOTE: If ignoreCase is true for this , the text array will be directly modified. The user should never modify this text array after calling this method. true if the text was added, false if the text already existed. is null. Adds a placeholder with the given as the text. Primarily for internal use by . true if the text was added, false if the text already existed. is null. -or- The 's property returns false. Adds a placeholder with the given as the text. Primarily for internal use by . true if the text was added, false if the text already existed. is null. Adds a placeholder with the given as the text. Primarily for internal use by . true if the text was added, false if the text already existed. is null. Returns a copy of the current as a new instance of . Preserves the value of matchVersion and ignoreCase from the current instance. A copy of the current as a . Returns a copy of the current as a new instance of using the specified value. Preserves the value of ignoreCase from the current instance. compatibility match version see Version note above for details. A copy of the current as a . Returns a copy of the current as a new instance of using the specified and values. compatibility match version see Version note above for details. false if and only if the set should be case sensitive otherwise true. A copy of the current as a . Gets the value associated with the specified text. The text of the value to get. The position of the where the target text begins. The total length of the . When this method returns, contains the value associated with the specified text, if the text is found; otherwise, the default value for the type of the value parameter. This parameter is passed uninitialized. true if the contains an element with the specified text; otherwise, false. is null. or is less than zero. and refer to a position outside of . Gets the value associated with the specified text. The text of the value to get. When this method returns, contains the value associated with the specified text, if the text is found; otherwise, the default value for the type of the value parameter. This parameter is passed uninitialized. true if the contains an element with the specified text; otherwise, false. is null. Gets the value associated with the specified text. The text of the value to get. When this method returns, contains the value associated with the specified text, if the text is found; otherwise, the default value for the type of the value parameter. This parameter is passed uninitialized. true if the contains an element with the specified text; otherwise, false. is null. -or- The 's property returns false. Gets the value associated with the specified text. The text of the value to get. When this method returns, contains the value associated with the specified text, if the text is found; otherwise, the default value for the type of the value parameter. This parameter is passed uninitialized. true if the contains an element with the specified text; otherwise, false. is null. Gets the value associated with the specified text. The text of the value to get. When this method returns, contains the value associated with the specified text, if the text is found; otherwise, the default value for the type of the value parameter. This parameter is passed uninitialized. true if the contains an element with the specified text; otherwise, false. is null. Gets or sets the value associated with the specified text. Note: If ignoreCase is true for this dictionary, the text array will be directly modified. The text of the value to get or set. The position of the where the target text begins. The total length of the . is null. or is less than zero. and refer to a position outside of . Gets or sets the value associated with the specified text. Note: If ignoreCase is true for this dictionary, the text array will be directly modified. The user should never modify this text array after calling this setter. The text of the value to get or set. is null. Gets or sets the value associated with the specified text. The text of the value to get or set. is null. -or- The 's property returns false. Gets or sets the value associated with the specified text. The text of the value to get or set. is null. Gets or sets the value associated with the specified text. The text of the value to get or set. is null. Gets a collection containing the keys in the . Gets a collection containing the values in the . This specialized collection can be enumerated in order to read its values and overrides in order to display a string representation of the values. Class that represents the values in the . Initializes a new instance of for the provided . The dictionary to read the values from. is null. Gets the number of elements contained in the . Retrieving the value of this property is an O(1) operation. Determines whether the set contains a specific element. The element to locate in the set. true if the set contains item; otherwise, false. Copies the elements to an existing one-dimensional array, starting at the specified array index. The one-dimensional array that is the destination of the elements copied from the . The array must have zero-based indexing. The zero-based index in at which copying begins. is null. is less than 0. The number of elements in the source is greater than the available space from to the end of the destination . The elements are copied to the array in the same order in which the enumerator iterates through the . This method is an O(n) operation, where n is . Returns an enumerator that iterates through the . An enumerator that iterates through the . An enumerator remains valid as long as the collection remains unchanged. If changes are made to the collection, such as adding, modifying, or deleting elements, the enumerator is irrecoverably invalidated and the next call to or throws an . This method is an O(log n) operation. Returns a string that represents the current collection. The presentation has a specific format. It is enclosed by square brackets ("[]"). Elements are separated by ', ' (comma and space). null values are represented as the string "null". A string that represents the current collection. Enumerates the elements of a . The foreach statement of the C# language (for each in C++, For Each in Visual Basic) hides the complexity of enumerators. Therefore, using foreach is recommended instead of directly manipulating the enumerator. Enumerators can be used to read the data in the collection, but they cannot be used to modify the underlying collection. Initially, the enumerator is positioned before the first element in the collection. At this position, the property is undefined. Therefore, you must call the method to advance the enumerator to the first element of the collection before reading the value of . The property returns the same object until is called. sets to the next element. If passes the end of the collection, the enumerator is positioned after the last element in the collection and returns false. When the enumerator is at this position, subsequent calls to also return false. If the last call to returned false, is undefined. You cannot set to the first element of the collection again; you must create a new enumerator object instead. An enumerator remains valid as long as the collection remains unchanged. If changes are made to the collection, such as adding, modifying, or deleting elements, the enumerator is irrecoverably invalidated and the next call to or throws an . The enumerator does not have exclusive access to the collection; therefore, enumerating through a collection is intrinsically not a thread-safe procedure. To guarantee thread safety during enumeration, you can lock the collection during the entire enumeration. To allow the collection to be accessed by multiple threads for reading and writing, you must implement your own synchronization. Gets the element at the current position of the enumerator. is undefined under any of the following conditions: The enumerator is positioned before the first element of the collection. That happens after an enumerator is created or after the method is called. The method must be called to advance the enumerator to the first element of the collection before reading the value of the property. The last call to returned false, which indicates the end of the collection and that the enumerator is positioned after the last element of the collection. The enumerator is invalidated due to changes made in the collection, such as adding, modifying, or deleting elements. does not move the position of the enumerator, and consecutive calls to return the same object until either or is called. Releases all resources used by the . Advances the enumerator to the next element of the . true if the enumerator was successfully advanced to the next element; false if the enumerator has passed the end of the collection. The collection was modified after the enumerator was created. After an enumerator is created, the enumerator is positioned before the first element in the collection, and the first call to the method advances the enumerator to the first element of the collection. If MoveNext passes the end of the collection, the enumerator is positioned after the last element in the collection and returns false. When the enumerator is at this position, subsequent calls to also return false. An enumerator remains valid as long as the collection remains unchanged. If changes are made to the collection, such as adding, modifying, or deleting elements, the enumerator is irrecoverably invalidated and the next call to or throws an . true if the is read-only; otherwise false. Returns an enumerator that iterates through the . A for the . For purposes of enumeration, each item is a structure representing a value and its text. There are also properties allowing direct access to the array of each element and quick conversions to or . The foreach statement of the C# language (for each in C++, For Each in Visual Basic) hides the complexity of enumerators. Therefore, using foreach is recommended instead of directly manipulating the enumerator. This enumerator can be used to read the data in the collection, or modify the corresponding value at the current position. Initially, the enumerator is positioned before the first element in the collection. At this position, the property is undefined. Therefore, you must call the method to advance the enumerator to the first element of the collection before reading the value of . The property returns the same object until is called. sets to the next element. If passes the end of the collection, the enumerator is positioned after the last element in the collection and returns false. When the enumerator is at this position, subsequent calls to also return false. If the last call to returned false, is undefined. You cannot set to the first element of the collection again; you must create a new enumerator object instead. An enumerator remains valid as long as the collection remains unchanged. If changes are made to the collection, such as adding, modifying, or deleting elements (other than through the method), the enumerator is irrecoverably invalidated and the next call to or throws an . The enumerator does not have exclusive access to the collection; therefore, enumerating through a collection is intrinsically not a thread-safe procedure. To guarantee thread safety during enumeration, you can lock the collection during the entire enumeration. To allow the collection to be accessed by multiple threads for reading and writing, you must implement your own synchronization. Default implementations of collections in the namespace are not synchronized. This method is an O(1) operation. Gets the number of text/value pairs contained in the . Returns a string that represents the current object. (Inherited from .) Returns an view on the dictionary's keys. The set will use the same as this dictionary. Enumerates the elements of a . This enumerator exposes efficient access to the underlying . It also has , , and properties for convenience. The foreach statement of the C# language (for each in C++, For Each in Visual Basic) hides the complexity of enumerators. Therefore, using foreach is recommended instead of directly manipulating the enumerator. This enumerator can be used to read the data in the collection, or modify the corresponding value at the current position. Initially, the enumerator is positioned before the first element in the collection. At this position, the property is undefined. Therefore, you must call the method to advance the enumerator to the first element of the collection before reading the value of . The property returns the same object until is called. sets to the next element. If passes the end of the collection, the enumerator is positioned after the last element in the collection and returns false. When the enumerator is at this position, subsequent calls to also return false. If the last call to returned false, is undefined. You cannot set to the first element of the collection again; you must create a new enumerator object instead. An enumerator remains valid as long as the collection remains unchanged. If changes are made to the collection, such as adding, modifying, or deleting elements, the enumerator is irrecoverably invalidated and the next call to or throws an . The enumerator does not have exclusive access to the collection; therefore, enumerating through a collection is intrinsically not a thread-safe procedure. To guarantee thread safety during enumeration, you can lock the collection during the entire enumeration. To allow the collection to be accessed by multiple threads for reading and writing, you must implement your own synchronization. Gets the current text as a . Gets the current text... do not modify the returned char[]. Gets the current text as a newly created object. Gets the value associated with the current text. Sets the value associated with the current text. Returns the value prior to the update. Releases all resources used by the . Advances the enumerator to the next element of the . true if the enumerator was successfully advanced to the next element; false if the enumerator has passed the end of the collection. The collection was modified after the enumerator was created. After an enumerator is created, the enumerator is positioned before the first element in the collection, and the first call to the method advances the enumerator to the first element of the collection. If passes the end of the collection, the enumerator is positioned after the last element in the collection and returns false. When the enumerator is at this position, subsequent calls to also return false. An enumerator remains valid as long as the collection remains unchanged. If changes are made to the collection, such as adding, modifying, or deleting elements, the enumerator is irrecoverably invalidated and the next call to or throws an . Gets the element at the current position of the enumerator. is undefined under any of the following conditions: The enumerator is positioned before the first element of the collection. That happens after an enumerator is created or after the method is called. The method must be called to advance the enumerator to the first element of the collection before reading the value of the property. The last call to returned false, which indicates the end of the collection and that the enumerator is positioned after the last element of the collection. The enumerator is invalidated due to changes made in the collection, such as adding, modifying, or deleting elements. does not move the position of the enumerator, and consecutive calls to return the same object until either or is called. LUCENENET specific interface used so can hold a reference to without knowing its generic closing type for TValue. LUCENENET specific interface used so can iterate the keys of without knowing its generic closing type for TValue. Returns a copy of the given dictionary as a . If the given dictionary is a the ignoreCase property will be preserved. Note: If you intend to create a copy of another where the of the source dictionary differs from its copy should be used instead. The will preserve the of the source dictionary if it is an instance of . compatibility match version see Version note above for details. This argument will be ignored if the given dictionary is a . a dictionary to copy a copy of the given dictionary as a . If the given dictionary is a the ignoreCase property as well as the will be of the given dictionary will be preserved. Used by to copy without knowing its generic type. Returns an unmodifiable . This allows to provide unmodifiable views of internal dictionary for "read-only" use. a dictionary for which the unmodifiable dictionary is returned. an new unmodifiable . if the given dictionary is null. Used by to create an instance without knowing the type of . Empty optimized for speed. Contains checks will always return false or throw NPE if necessary. Extensions to for . Returns a copy of the current as a using the specified value. The type of dictionary value. A to copy. compatibility match version see Version note above for details. A copy of the current dictionary as a . is null. Returns a copy of the current as a using the specified and values. The type of dictionary value. A to copy. compatibility match version see Version note above for details. false if and only if the set should be case sensitive otherwise true. A copy of the current dictionary as a . is null. LUCENENET specific. Just a class to make error messages easier to manage in one place. Ideally, these would be in resources so they can be localized (eventually), but at least this half-measure will make that somewhat easier to do and is guaranteed not to cause performance issues. A simple class that stores s as 's in a hash table. Note that this is not a general purpose class. For example, it cannot remove items from the set, nor does it resize its hash table to be smaller, etc. It is designed to be quick to test if a is in the set without the necessity of converting it to a first. You must specify the required compatibility when creating :
  • As of 3.1, supplementary characters are properly lowercased.
Before 3.1 supplementary characters could not be lowercased correctly due to the lack of Unicode 4 support in JDK 1.4. To use instances of with the behavior before Lucene 3.1 pass a to the constructors.
Please note: This class implements but does not behave like it should in all cases. The generic type is , because you can add any object to it, that has a string representation (which is converted to a string). The add methods will use and store the result using a buffer. The same behavior have the methods. The returns an
Create set with enough capacity to hold terms compatibility match version see for details. the initial capacity false if and only if the set should be case sensitive otherwise true. is less than zero. Creates a set from a collection of s. Compatibility match version see for details. A collection whose elements to be placed into the set. false if and only if the set should be case sensitive otherwise true. is null. -or- A given element within the is null. Creates a set from a collection of s. NOTE: If is true, the text arrays will be directly modified. The user should never modify these text arrays after calling this method. Compatibility match version see for details. A collection whose elements to be placed into the set. false if and only if the set should be case sensitive otherwise true. is null. -or- A given element within the is null. Creates a set from a collection of s. Compatibility match version see for details. A collection whose elements to be placed into the set. false if and only if the set should be case sensitive otherwise true. is null. -or- A given element within the is null. -or- The property for a given element in the returns false. Create set from the specified map (internal only), used also by Clears all entries in this set. This method is supported for reusing, but not . true if the chars of starting at are in the set. is null. or is less than zero. and refer to a position outside of . true if the s are in the set is null. true if the is in the set. is null. -or- The 's property returns false. true if the is in the set. is null. true if the representation of is in the set. is null. Adds the representation of into the set. The method is called after setting the thread to . If the type of is a value type, it will be converted using the . A string-able object. true if was added to the set; false if it already existed prior to this call. is null. Adds a into the set The text to be added to the set. true if was added to the set; false if it already existed prior to this call. is null. Adds a into the set The text to be added to the set. true if was added to the set; false if it already existed prior to this call. is null. Adds a directly to the set. NOTE: If ignoreCase is true for this , the text array will be directly modified. The user should never modify this text array after calling this method. The text to be added to the set. true if was added to the set; false if it already existed prior to this call. is null. Adds a to the set using the specified and . NOTE: If ignoreCase is true for this , the text array will be directly modified. The text to be added to the set. The position of the where the target text begins. The total length of the . true if was added to the set; false if it already existed prior to this call. is null. or is less than zero. and refer to a position outside of . LUCENENET specific for supporting . Gets the number of elements contained in the . true if the is read-only; otherwise false. Returns an unmodifiable . This allows to provide unmodifiable views of internal sets for "read-only" use. a set for which the unmodifiable set is returned. an new unmodifiable . if the given set is null. Returns an unmodifiable . This allows to provide unmodifiable views of internal sets for "read-only" use. A new unmodifiable . Returns a copy of this set as a new instance . The and ignoreCase property will be preserved. A copy of this set as a new instance of . The field as well as the will be preserved. Returns a copy of this set as a new instance with the provided . The ignoreCase property will be preserved from this . A copy of this set as a new instance of . The field will be preserved. Returns a copy of this set as a new instance with the provided and values. A copy of this set as a new instance of . Returns a copy of the given set as a . If the given set is a the ignoreCase property will be preserved. Note: If you intend to create a copy of another where the of the source set differs from its copy should be used instead. The method will preserve the of the source set it is an instance of . compatibility match version. This argument will be ignored if the given set is a . a set to copy A copy of the given set as a . If the given set is a the field as well as the will be preserved. is null. -or- A given element within the is null. -or- The property for a given element in the returns false. Returns an enumerator that iterates through the . An enumerator that iterates through the . An enumerator remains valid as long as the collection remains unchanged. If changes are made to the collection, such as adding, modifying, or deleting elements, the enumerator is irrecoverably invalidated and the next call to or throws an . This method is an O(log n) operation. Enumerates the elements of a object. This implementation provides direct access to the array of the underlying collection as well as convenience properties for converting to and . The foreach statement of the C# language (for each in C++, For Each in Visual Basic) hides the complexity of enumerators. Therefore, using foreach is recommended instead of directly manipulating the enumerator. Enumerators can be used to read the data in the collection, but they cannot be used to modify the underlying collection. Initially, the enumerator is positioned before the first element in the collection. At this position, the property is undefined. Therefore, you must call the method to advance the enumerator to the first element of the collection before reading the value of . The property returns the same object until is called. sets to the next element. If passes the end of the collection, the enumerator is positioned after the last element in the collection and returns false. When the enumerator is at this position, subsequent calls to also return false. If the last call to returned false, is undefined. You cannot set to the first element of the collection again; you must create a new enumerator object instead. An enumerator remains valid as long as the collection remains unchanged. If changes are made to the collection, such as adding, modifying, or deleting elements, the enumerator is irrecoverably invalidated and the next call to or throws an . The enumerator does not have exclusive access to the collection; therefore, enumerating through a collection is intrinsically not a thread-safe procedure. To guarantee thread safety during enumeration, you can lock the collection during the entire enumeration. To allow the collection to be accessed by multiple threads for reading and writing, you must implement your own synchronization. This method is an O(1) operation. Gets the current value as a . Gets the current value... do not modify the returned char[]. Gets the current value as a newly created object. Releases all resources used by the . Advances the enumerator to the next element of the . true if the enumerator was successfully advanced to the next element; false if the enumerator has passed the end of the collection. The collection was modified after the enumerator was created. After an enumerator is created, the enumerator is positioned before the first element in the collection, and the first call to the method advances the enumerator to the first element of the collection. If passes the end of the collection, the enumerator is positioned after the last element in the collection and returns false. When the enumerator is at this position, subsequent calls to also return false. An enumerator remains valid as long as the collection remains unchanged. If changes are made to the collection, such as adding, modifying, or deleting elements, the enumerator is irrecoverably invalidated and the next call to or throws an . Returns a string that represents the current collection. The presentation has a specific format. It is enclosed by curly brackets ("{}"). Keys and values are separated by '=', KeyValuePairs are separated by ', ' (comma and space). null values are represented as the string "null". A string that represents the current collection. Compares the specified object with this set for equality. Returns true if the given object is also a set, the two sets have the same size, and every member of the given set is contained in this set. This ensures that the equals method works properly across different implementations of the interface. This implementation first checks if the specified object is this set; if so it returns true. Then, it checks if the specified object is a set whose size is identical to the size of this set; if not, it returns false. If so, it uses the enumerator of this set and the specified object to determine if all of the contained values are present (using ). object to be compared for equality with this set true if the specified object is equal to this set Returns the hash code value for this set. The hash code of a set is defined to be the sum of the hash codes of the elements in the set, where the hash code of a null element is defined to be zero. This ensures that s1.Equals(s2) implies that s1.GetHashCode()==s2.GetHashCode() for any two sets s1 and s2. This implementation iterates over the set, calling the GetHashCode() method on each element in the set, and adding up the results. the hash code value for this set Copies the entire to a one-dimensional array, starting at the specified index of the target array. The one-dimensional Array that is the destination of the elements copied from . The Array must have zero-based indexing. is null. The number of elements in the source is greater than the available space in the destination array. Copies the entire to a one-dimensional array, starting at the specified index of the target array. The one-dimensional Array that is the destination of the elements copied from . The Array must have zero-based indexing. The zero-based index in array at which copying begins. is null. is less than zero. The number of elements in the source is greater than the available space from to the end of the destination array. Copies the entire to a one-dimensional array, starting at the specified index of the target array. The one-dimensional Array that is the destination of the elements copied from . The Array must have zero-based indexing. The zero-based index in array at which copying begins. is null. or is less than zero. is greater than the length of the destination . -or- is greater than the available space from the to the end of the destination . Copies the entire to a jagged array or of type char[], starting at the specified index of the target array. The jagged array or of type char[] that is the destination of the elements copied from . The Array must have zero-based indexing. is null. The number of elements in the source is greater than the available space in the destination array. Copies the entire to a jagged array or of type char[] starting at the specified index of the target array. The jagged array or of type char[] that is the destination of the elements copied from . The Array must have zero-based indexing. The zero-based index in array at which copying begins. is null. is less than zero. The number of elements in the source is greater than the available space from to the end of the destination array. Copies the entire to a jagged array or of type char[] starting at the specified index of the target array. The jagged array or of type char[] that is the destination of the elements copied from . The Array must have zero-based indexing. The zero-based index in array at which copying begins. is null. or is less than zero. is greater than the length of the destination . -or- is greater than the available space from the to the end of the destination . Copies the entire to a one-dimensional array, starting at the specified index of the target array. The one-dimensional Array that is the destination of the elements copied from . The Array must have zero-based indexing. is null. The number of elements in the source is greater than the available space in the destination array. Copies the entire to a one-dimensional array, starting at the specified index of the target array. The one-dimensional Array that is the destination of the elements copied from . The Array must have zero-based indexing. The zero-based index in array at which copying begins. is null. is less than zero. The number of elements in the source is greater than the available space from to the end of the destination array. Copies the entire to a one-dimensional array, starting at the specified index of the target array. The one-dimensional Array that is the destination of the elements copied from . The Array must have zero-based indexing. The zero-based index in array at which copying begins. is null. or is less than zero. is greater than the length of the destination . -or- is greater than the available space from the to the end of the destination . Determines whether the current set and the specified collection contain the same elements. The collection to compare to the current set. true if the current set is equal to other; otherwise, false. is null. Determines whether the current set and the specified collection contain the same elements. The collection to compare to the current set. true if the current set is equal to other; otherwise, false. is null. Determines whether the current set and the specified collection contain the same elements. The collection to compare to the current set. true if the current set is equal to other; otherwise, false. is null. Determines whether the current set and the specified collection contain the same elements. The collection to compare to the current set. true if the current set is equal to other; otherwise, false. is null. Modifies the current to contain all elements that are present in itself, the specified collection, or both. NOTE: If ignoreCase is true for this , the text arrays will be directly modified. The user should never modify these text arrays after calling this method. The collection whose elements should be merged into the . true if this changed as a result of the call. is null. This set instance is read-only. Modifies the current to contain all elements that are present in itself, the specified collection, or both. The collection whose elements should be merged into the . true if this changed as a result of the call. is null. -or- A given element within the collection is null. -or- The property for a given element in the collection returns false. This set instance is read-only. Modifies the current to contain all elements that are present in itself, the specified collection, or both. The collection whose elements should be merged into the . true if this changed as a result of the call. is null. This set instance is read-only. Modifies the current to contain all elements that are present in itself, the specified collection, or both. The collection whose elements should be merged into the . true if this changed as a result of the call. is null. This set instance is read-only. Determines whether a object is a subset of the specified collection. The collection to compare to the current object. true if this object is a subset of ; otherwise, false. is null. Determines whether a object is a subset of the specified collection. The collection to compare to the current object. true if this object is a subset of ; otherwise, false. is null. Determines whether a object is a subset of the specified collection. The collection to compare to the current object. true if this object is a subset of ; otherwise, false. is null. Determines whether a object is a subset of the specified collection. The collection to compare to the current object. true if this object is a subset of ; otherwise, false. is null. Determines whether a object is a superset of the specified collection. The collection to compare to the current object. true if this object is a superset of ; otherwise, false. is null. Determines whether a object is a superset of the specified collection. The collection to compare to the current object. true if this object is a superset of ; otherwise, false. is null. Determines whether a object is a superset of the specified collection. The collection to compare to the current object. true if this object is a superset of ; otherwise, false. is null. Determines whether a object is a superset of the specified collection. The collection to compare to the current object. true if this object is a superset of ; otherwise, false. is null. Determines whether a object is a proper subset of the specified collection. The collection to compare to the current object. true if this object is a proper subset of ; otherwise, false. is null. Determines whether a object is a proper subset of the specified collection. The collection to compare to the current object. true if this object is a proper subset of ; otherwise, false. is null. Determines whether a object is a proper subset of the specified collection. The collection to compare to the current object. true if this object is a proper subset of ; otherwise, false. is null. Determines whether a object is a proper subset of the specified collection. The collection to compare to the current object. true if this object is a proper subset of ; otherwise, false. is null. Determines whether a object is a proper superset of the specified collection. The collection to compare to the current object. true if this object is a proper superset of ; otherwise, false. is null. Determines whether a object is a proper superset of the specified collection. The collection to compare to the current object. true if this object is a proper superset of ; otherwise, false. is null. Determines whether a object is a proper superset of the specified collection. The collection to compare to the current object. true if this object is a proper superset of ; otherwise, false. is null. Determines whether a object is a proper superset of the specified collection. The collection to compare to the current object. true if this object is a proper superset of ; otherwise, false. is null. Determines whether the current object and a specified collection share common elements. The collection to compare to the current object. true if the object and share at least one common element; otherwise, false. is null. Determines whether the current object and a specified collection share common elements. The collection to compare to the current object. true if the object and share at least one common element; otherwise, false. is null. Determines whether the current object and a specified collection share common elements. The collection to compare to the current object. true if the object and share at least one common element; otherwise, false. is null. Determines whether the current object and a specified collection share common elements. The collection to compare to the current object. true if this object and share at least one common element; otherwise, false. is null. Returns true if this collection contains all of the elements in the specified collection. collection to be checked for containment in this collection true if this contains all of the elements in the specified collection; otherwise, false. Returns true if this collection contains all of the elements in the specified collection. collection to be checked for containment in this collection true if this contains all of the elements in the specified collection; otherwise, false. Returns true if this collection contains all of the elements in the specified collection. collection to be checked for containment in this collection true if this contains all of the elements in the specified collection; otherwise, false. Returns true if this collection contains all of the elements in the specified collection. collection to be checked for containment in this collection true if this contains all of the elements in the specified collection; otherwise, false. Returns true if this collection contains all of the elements in the specified collection. collection to be checked for containment in this collection true if this contains all of the elements in the specified collection; otherwise, false. Returns true if this collection contains all of the elements in the specified collection. collection to be checked for containment in this collection true if this contains all of the elements in the specified collection; otherwise, false. Extensions to for . Returns a copy of this as a new instance of with the specified and ignoreCase set to false. The type of collection. Typically a or . This collection. Compatibility match version. A copy of this as a . is null. Returns a copy of this as a new instance of with the specified and . The type of collection. Typically a or . This collection. Compatibility match version. false if and only if the set should be case sensitive otherwise true. A copy of this as a . is null. Abstract parent class for analysis factories that create instances. looks up a charfilter by name from the host project's dependent assemblies looks up a charfilter class by name from the host project's dependent assemblies returns a list of all available charfilter names Reloads the factory list. Changes to the factories are visible after the method ends, all iterators (,...) stay consistent. NOTE: Only new factories are added, existing ones are never removed or replaced. This method is expensive and should only be called for discovery of new factories on the given classpath/classloader! Initialize this factory via a set of key-value pairs. Wraps the given with a . An abstract base class for simple, character-oriented tokenizers. You must specify the required compatibility when creating : As of 3.1, uses an int based API to normalize and detect token codepoints. See and for details. A new API has been introduced with Lucene 3.1. This API moved from UTF-16 code units to UTF-32 codepoints to eventually add support for supplementary characters. The old char based API has been deprecated and should be replaced with the int based methods and . As of Lucene 3.1 each - constructor expects a argument. Based on the given either the new API or a backwards compatibility layer is used at runtime. For < 3.1 the backwards compatibility layer ensures correct behavior even for indexes build with previous versions of Lucene. If a >= 3.1 is used requires the new API to be implemented by the instantiated class. Yet, the old char based API is not required anymore even if backwards compatibility must be preserved. subclasses implementing the new API are fully backwards compatible if instantiated with < 3.1. Note: If you use a subclass of with >= 3.1 on an index build with a version < 3.1, created tokens might not be compatible with the terms in your index. Creates a new instance Lucene version to match the input to split up into tokens Creates a new instance Lucene version to match the attribute factory to use for this the input to split up into tokens LUCENENET specific - Added in the .NET version to assist with setting the attributes from multiple constructors. Returns true iff a codepoint should be included in a token. This tokenizer generates as tokens adjacent sequences of codepoints which satisfy this predicate. Codepoints for which this is false are used to define token boundaries and are not included in tokens. Called on each token character to normalize it before it is added to the token. The default implementation does nothing. Subclasses may use this to, e.g., lowercase tokens. Simple that uses and to open resources and s, respectively. Creates an instance using the System.Assembly of the given class to load Resources and classes Resource paths must be absolute. Removes elisions from a . For example, "l'avion" (the plane) will be tokenized as "avion" (plane). Elision in Wikipedia Constructs an elision filter with a of stop words the source a set of stopword articles Increments the with a without elisioned start Factory for . <fieldType name="text_elsn" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ElisionFilterFactory" articles="stopwordarticles.txt" ignoreCase="true"/> </analyzer> </fieldType> Creates a new Simple that opens resource files from the local file system, optionally resolving against a base directory. This loader wraps a delegate that is used to resolve all files, the current base directory does not contain. is always resolved against the delegate, as an is needed. You can chain several s to allow lookup of files in more than one base directory. Creates a resource loader that requires absolute filenames or relative to CWD to resolve resources. Files not found in file system and class lookups are delegated to context classloader. Creates a resource loader that resolves resources against the given base directory (may be null to refer to CWD). Files not found in file system and class lookups are delegated to context classloader. Creates a resource loader that resolves resources against the given base directory (may be null to refer to CWD). Files not found in file system and class lookups are delegated to the given delegate . Abstract base class for TokenFilters that may remove tokens. You have to implement and return a boolean if the current token should be preserved. uses this method to decide if a token should be passed to the caller. As of Lucene 4.4, an is thrown when trying to disable position increments when filtering terms. Create a new . the Lucene match version whether to increment position increments when filtering out terms the input to consume @deprecated enablePositionIncrements=false is not supported anymore as of Lucene 4.4 Create a new . the Lucene match version the to consume Override this method and return if the current input token should be returned by . If true, this will preserve positions of the incoming tokens (ie, accumulate and set position increments of the removed tokens). Generally, true is best as it does not lose information (positions of the original tokens) during indexing. When set, when a token is stopped (omitted), the position increment of the following token is incremented. Add to any analysis factory component to allow returning an analysis component factory for use with partial terms in prefix queries, wildcard queries, range query endpoints, regex queries, etc. @lucene.experimental Returns an analysis component to handle analysis if multi-term queries. The returned component must be a , or . A StringBuilder that allows one to access the array. Abstraction for loading resources (streams, files, and classes). Opens a named resource Finds class of the name NOTE: This was findClass() in Lucene Creates an instance of the name and expected type Interface for a component that needs to be initialized by an implementation of . Initializes this component with the provided (used for loading types, embedded resources, files, etc). Acts like a forever growing as you read characters into it from the provided reader, but internally it uses a circular buffer to only hold the characters that haven't been freed yet. This is like a PushbackReader, except you don't have to specify up-front the max size of the buffer, but you do have to periodically call . Clear array and switch to new reader. Absolute position read. NOTE: pos must not jump ahead by more than 1! Ie, it's OK to read arbitarily far back (just not prior to the last , but NOT ok to read arbitrarily far ahead. Returns -1 if you hit EOF. Call this to notify us that no chars before this absolute position are needed anymore. Some commonly-used stemming functions @lucene.internal Returns true if the character array starts with the prefix. Input Buffer length of input buffer Prefix string to test true if starts with Returns true if the character array ends with the suffix. Input Buffer length of input buffer Suffix string to test true if ends with Returns true if the character array ends with the suffix. Input Buffer length of input buffer Suffix string to test true if ends with Delete a character in-place Input Buffer Position of character to delete length of input buffer length of input buffer after deletion Delete n characters in-place Input Buffer Position of character to delete Length of input buffer number of characters to delete length of input buffer after deletion Base class for s that need to make use of stopword sets. An immutable stopword set Returns the analyzer's stopword set or an empty set if the analyzer has no stopwords the analyzer's stopword set or an empty set if the analyzer has no stopwords Creates a new instance initialized with the given stopword set the Lucene version for cross version compatibility the analyzer's stopword set Creates a new with an empty stopword set the Lucene version for cross version compatibility Creates a from an embedded resource associated with a class. (See ). true if the set should ignore the case of the stopwords, otherwise false a class that is associated with the given stopwordResource name of the resource file associated with the given class comment string to ignore in the stopword file a containing the distinct stopwords from the given file if loading the stopwords throws an Creates a from a file. the stopwords file to load the Lucene version for cross version compatibility a containing the distinct stopwords from the given file if loading the stopwords throws an Creates a from a file. the stopwords reader to load the Lucene version for cross version compatibility a containing the distinct stopwords from the given reader if loading the stopwords throws an Abstract parent class for analysis factories that create instances. looks up a tokenfilter by name from the host project's referenced assemblies looks up a tokenfilter class by name from the host project's referenced assemblies returns a list of all available tokenfilter names from the host project's referenced assemblies Reloads the factory list. Changes to the factories are visible after the method ends, all iterators (,...) stay consistent. NOTE: Only new factories are added, existing ones are never removed or replaced. This method is expensive and should only be called for discovery of new factories on the given classpath/classloader! Initialize this factory via a set of key-value pairs. Transform the specified input Abstract parent class for analysis factories that create instances. looks up a tokenizer by name from the host project's referenced assemblies looks up a tokenizer class by name from the host project's referenced assemblies returns a list of all available tokenizer names from the host project's referenced assemblies Reloads the factory list. Changes to the factories are visible after the method ends, all iterators (,...) stay consistent. NOTE: Only new factories are added, existing ones are never removed or replaced. This method is expensive and should only be called for discovery of new factories on the given classpath/classloader! Initialize this factory via a set of key-value pairs. Creates a of the specified input using the default attribute factory. Creates a of the specified input using the given Loader for text files that represent a list of stopwords. to obtain instances. @lucene.internal Reads lines from a and adds every line as an entry to a (omitting leading and trailing whitespace). Every line of the should contain only one word. The words need to be in lowercase if you make use of an which uses (like ). containing the wordlist the to fill with the readers words the given with the reader's words Reads lines from a and adds every line as an entry to a (omitting leading and trailing whitespace). Every line of the should contain only one word. The words need to be in lowercase if you make use of an which uses (like ). containing the wordlist the A with the reader's words Reads lines from a and adds every non-comment line as an entry to a (omitting leading and trailing whitespace). Every line of the should contain only one word. The words need to be in lowercase if you make use of an which uses (like ). containing the wordlist The string representing a comment. the A CharArraySet with the reader's words Reads lines from a and adds every non-comment line as an entry to a (omitting leading and trailing whitespace). Every line of the should contain only one word. The words need to be in lowercase if you make use of an which uses (like ). containing the wordlist The string representing a comment. the to fill with the readers words the given with the reader's words Reads stopwords from a stopword list in Snowball format. The snowball format is the following: Lines may contain multiple words separated by whitespace. The comment character is the vertical line (|). Lines may contain trailing comments. containing a Snowball stopword list the to fill with the readers words the given with the reader's words Reads stopwords from a stopword list in Snowball format. The snowball format is the following: Lines may contain multiple words separated by whitespace. The comment character is the vertical line (|). Lines may contain trailing comments. containing a Snowball stopword list the Lucene A with the reader's words Reads a stem dictionary. Each line contains: word\tstem (i.e. two tab separated words) stem dictionary that overrules the stemming algorithm If there is a low-level I/O error. Accesses a resource by name and returns the (non comment) lines containing data using the given character encoding. A comment line is any line that starts with the character "#" a list of non-blank non-comment lines with whitespace trimmed If there is a low-level I/O error. Extension of that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete. @lucene.experimental String token types that correspond to token type int constants Only output tokens Only output untokenized tokens, which are tokens that would normally be split into several tokens Output the both the untokenized token and the splits This flag is used to indicate that the produced "Token" would, if was used, produce multiple tokens. A private instance of the JFlex-constructed scanner Creates a new instance of the . Attaches the to a newly created JFlex scanner. The Input Creates a new instance of the . Attaches the to a the newly created JFlex scanner. The input One of , , Untokenized types Creates a new instance of the . Attaches the to a the newly created JFlex scanner. Uses the given . The The input One of , , Untokenized types Factory for . <fieldType name="text_wiki" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WikipediaTokenizerFactory"/> </analyzer> </fieldType> Creates a new JFlex-generated tokenizer that is aware of Wikipedia syntax. This character denotes the end of file initial size of the lookahead buffer lexical states ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l at the beginning of a line l is of the form l = 2*k, k a non negative integer Translates characters to character classes Translates characters to character classes Translates DFA states to action switch labels. Translates a state to a row index in the transition table The transition table of the DFA ZZ_ATTRIBUTE[aState] contains the attributes of state aState the input device the current state of the DFA the current lexical state this buffer contains the current text to be matched and is the source of the YyText string the textposition at the last accepting state the current text position in the buffer startRead marks the beginning of the YyText string in the buffer endRead marks the last character in the buffer, that has been read from input the number of characters up to the start of the matched text zzAtEOF == true <=> the scanner is at the EOF Returns the number of tokens seen inside a category or link, etc. the number of tokens seen inside the context of wiki syntax. Fills Lucene token with the current token text. Creates a new scanner the TextReader to read input from. Unpacks the compressed character translation table. the packed character translation table the unpacked character translation table Refills the input buffer. false, iff there was new input. if any I/O-Error occurs Disposes the input stream. Resets the scanner to read from a new input stream. Does not close the old reader. All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to . Internal scan buffer is resized down to its initial length, if it has grown. the new input stream Returns the current lexical state. Enters a new lexical state the new lexical state Returns the text matched by the current regular expression. Returns the character at position from the matched text. It is equivalent to YyText[pos], but faster the position of the character to fetch. A value from 0 to YyLength-1. the character at position pos Returns the length of the matched text region. Reports an error that occured while scanning. In a wellformed scanner (no or only correct usage of YyPushBack(int) and a match-all fallback rule) this method will only be called with things that "Can't Possibly Happen". If this method is called, something is seriously wrong (e.g. a JFlex bug producing a faulty scanner etc.). Usual syntax/scanner level error handling should be done in error fallback rules. the code of the errormessage to display Pushes the specified amount of characters back into the input stream. They will be read again by then next call of the scanning method the number of characters to be read again. This number must not be greater than YyLength! Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs. the next token if any I/O-Error occurs Snowball's among construction. Search string. Index to longest matching substring. Result of the lookup. Action to be invoked. Initializes a new instance of the class. The search string. The index to the longest matching substring. The result of the lookup. Initializes a new instance of the class. The search string. The index to the longest matching substring. The result of the lookup. The action to be performed, if any. Returns a that represents this instance. A that represents this instance. This class was automatically generated by a Snowball to Java compiler It implements the stemming algorithm defined by a snowball script. This class was automatically generated by a Snowball to Java compiler It implements the stemming algorithm defined by a snowball script. This class was automatically generated by a Snowball to Java compiler It implements the stemming algorithm defined by a snowball script. This class was automatically generated by a Snowball to Java compiler It implements the stemming algorithm defined by a snowball script. This class was automatically generated by a Snowball to Java compiler It implements the stemming algorithm defined by a snowball script. This class was automatically generated by a Snowball to Java compiler It implements the stemming algorithm defined by a snowball script. This class was automatically generated by a Snowball to Java compiler It implements the stemming algorithm defined by a snowball script. This class was automatically generated by a Snowball to Java compiler It implements the stemming algorithm defined by a snowball script. This class was automatically generated by a Snowball to Java compiler It implements the stemming algorithm defined by a snowball script. This class was automatically generated by a Snowball to Java compiler It implements the stemming algorithm defined by a snowball script. This class was automatically generated by a Snowball to Java compiler It implements the stemming algorithm defined by a snowball script. This class was automatically generated by a Snowball to Java compiler It implements the stemming algorithm defined by a snowball script. This class was automatically generated by a Snowball to Java compiler It implements the stemming algorithm defined by a snowball script. This class was automatically generated by a Snowball to Java compiler It implements the stemming algorithm defined by a snowball script. This class was automatically generated by a Snowball to Java compiler It implements the stemming algorithm defined by a snowball script. This class was automatically generated by a Snowball to Java compiler It implements the stemming algorithm defined by a snowball script. This class was automatically generated by a Snowball to Java compiler It implements the stemming algorithm defined by a snowball script. This class was automatically generated by a Snowball to Java compiler It implements the stemming algorithm defined by a snowball script. This class was automatically generated by a Snowball to Java compiler It implements the stemming algorithm defined by a snowball script. This class was automatically generated by a Snowball to Java compiler It implements the stemming algorithm defined by a snowball script. This class was automatically generated by a Snowball to Java compiler It implements the stemming algorithm defined by a snowball script. This class was automatically generated by a Snowball to Java compiler It implements the stemming algorithm defined by a snowball script. This class was automatically generated by a Snowball to Java compiler It implements the stemming algorithm defined by a snowball script. This is the rev 502 of the Snowball SVN trunk, but modified: made abstract and introduced abstract method stem to avoid expensive reflection in filter class. refactored StringBuffers to StringBuilder uses char[] as buffer instead of StringBuffer/StringBuilder eq_s,eq_s_b,insert,replace_s take CharSequence like eq_v and eq_v_b reflection calls (Lovins, etc) use EMPTY_ARGS/EMPTY_PARAMS Set the current string. Set the current string. character array containing input valid length of text. Get the current string. Get the current buffer containing the stem. NOTE: this may be a reference to a different character array than the one originally provided with setCurrent, in the exceptional case that stemming produced a longer intermediate or result string. It is necessary to use to determine the valid length of the returned buffer. For example, many words are stemmed simply by subtracting from the length to remove suffixes. Get the valid length of the character array in to replace chars between and in current by the chars in . Copy the slice into the supplied