Lucene.Net.Queries
A container that allows Boolean composition of s.
s are allocated into one of three logical constructs;
SHOULD, MUST NOT, MUST
The results BitSet is constructed as follows:
SHOULD Filters are OR'd together
The resulting is NOT'd with the NOT s
The resulting is AND'd with the MUST s
Returns the a representing the Boolean composition
of the filters that have been added.
Adds a new to the Boolean container
A object containing a and an parameter
Gets the list of clauses
Returns an iterator on the clauses in this query. It implements the interface to
make it possible to do:
for (FilterClause clause : booleanFilter) {}
Prints a user-readable version of this .
The class can be used to effectively demote results that match a given query.
Unlike the "NOT" clause, this still selects documents that contain undesirable terms,
but reduces their overall score:
Query balancedQuery = new BoostingQuery(positiveQuery, negativeQuery, 0.01f);
In this scenario the positiveQuery contains the mandatory, desirable criteria which is used to
select all matching documents, and the negativeQuery contains the undesirable elements which
are simply used to lessen the scores. Documents that match the negativeQuery have their score
multiplied by the supplied "boost" parameter, so this should be less than 1 to achieve a
demoting effect
This code was originally made available here: [WWW] http://marc.theaimsgroup.com/?l=lucene-user&m=108058407130459&w=2
and is documented here: http://wiki.apache.org/lucene-java/CommunityContributions
Allows multiple s to be chained.
Logical operations such as NOT and XOR
are applied between filters. One operation can be used
for all filters, or a specific operation can be declared
for each filter.
Order in which filters are called depends on
the position of the filter in the chain. It's probably
more efficient to place the most restrictive filters/least
computationally-intensive filters first.
Logical operation when none is declared. Defaults to OR.
The filter chain
Ctor.
The chain of filters
Ctor.
The chain of filters
Logical operations to apply between filters
Ctor.
The chain of filters
Logical operation to apply to ALL filters
.
Delegates to each filter in the chain.
AtomicReaderContext
Logical operation
DocIdSet
Delegates to each filter in the chain.
AtomicReaderContext
Logical operation
DocIdSet
A query that executes high-frequency terms in a optional sub-query to prevent
slow queries due to "common" terms like stopwords. This query
builds 2 queries off the added terms: low-frequency
terms are added to a required boolean clause and high-frequency terms are
added to an optional boolean clause. The optional clause is only executed if
the required "low-frequency" clause matches. Scores produced by this query
will be slightly different than plain scorer mainly due to
differences in the number of leaf queries
in the required boolean clause. In most cases, high-frequency terms are
unlikely to significantly contribute to the document score unless at least
one of the low-frequency terms are matched. This query can improve
query execution times significantly if applicable.
has several advantages over stopword filtering at
index or query time since a term can be "classified" based on the actual
document frequency in the index and can prevent slow queries even across
domains without specialized stopword files.
Note: if the query only contains high-frequency terms the query is
rewritten into a plain conjunction query ie. all high-frequency terms need to
match in order to match a document.
Collection initializer note: To create and populate a
in a single statement, you can use the following example as a guide:
var query = new CommonTermsQuery() {
new Term("field", "microsoft"),
new Term("field", "office")
};
Creates a new
used for high frequency terms
used for low frequency terms
a value in [0..1) (or absolute number >=1) representing the
maximum threshold of a terms document frequency to be considered a
low frequency term.
if is pass as or
Creates a new
used for high frequency terms
used for low frequency terms
a value in [0..1) (or absolute number >=1) representing the
maximum threshold of a terms document frequency to be considered a
low frequency term.
disables in scoring for the low
/ high frequency sub-queries
if is pass as or
Adds a term to the
the term to add
Returns true iff is disabled in scoring
for the high and low frequency query instance. The top level query will
always disable coords.
Gets or Sets a minimum number of the low frequent optional BooleanClauses which must be
satisfied in order to produce a match on the low frequency terms query
part. This method accepts a float value in the range [0..1) as a fraction
of the actual query terms in the low frequent clause or a number
>=1 as an absolut number of clauses that need to match.
By default no optional clauses are necessary for a match (unless there are
no required clauses). If this method is used, then the specified number of
clauses is required.
Gets or Sets a minimum number of the high frequent optional BooleanClauses which must be
satisfied in order to produce a match on the low frequency terms query
part. This method accepts a float value in the range [0..1) as a fraction
of the actual query terms in the low frequent clause or a number
>=1 as an absolut number of clauses that need to match.
By default no optional clauses are necessary for a match (unless there are
no required clauses). If this method is used, then the specified number of
clauses is required.
Builds a new instance.
This is intended for subclasses that wish to customize the generated queries.
term
the to be used to create the low level term query. Can be null.
new instance
Returns an enumerator that iterates through the collection.
An enumerator that can be used to iterate through the collection.
Returns an enumerator that iterates through the collection.
An enumerator that can be used to iterate through the collection.
An instance of this subclass should be returned by
, if you want
to modify the custom score calculation of a .
Since Lucene 2.9, queries operate on each segment of an index separately,
so the protected field can be used to resolve doc IDs,
as the supplied doc ID is per-segment and without knowledge
of the you cannot access the document or .
@lucene.experimental
@since 2.9.2
Creates a new instance of the provider class for the given .
Compute a custom score by the subQuery score and a number of
scores.
Subclasses can override this method to modify the custom score.
If your custom scoring is different than the default herein you
should override at least one of the two methods.
If the number of s is always < 2 it is
sufficient to override the other
method, which is simpler.
The default computation herein is a multiplication of given scores:
ModifiedScore = valSrcScore * valSrcScores[0] * valSrcScores[1] * ...
id of scored doc.
score of that doc by the subQuery.
scores of that doc by the .
custom score.
Compute a custom score by the and the score.
Subclasses can override this method to modify the custom score.
If your custom scoring is different than the default herein you
should override at least one of the two methods.
If the number of s is always < 2 it is
sufficient to override this method, which is simpler.
The default computation herein is a multiplication of the two scores:
ModifiedScore = subQueryScore * valSrcScore
id of scored doc.
score of that doc by the subQuery.
score of that doc by the .
custom score.
Explain the custom score.
Whenever overriding ,
this method should also be overridden to provide the correct explanation
for the part of the custom scoring.
doc being explained.
explanation for the sub-query part.
explanation for the value source part.
an explanation for the custom score
Explain the custom score.
Whenever overriding ,
this method should also be overridden to provide the correct explanation
for the part of the custom scoring.
doc being explained.
explanation for the sub-query part.
explanation for the value source part.
an explanation for the custom score
Query that sets document score as a programmatic function of several (sub) scores:
- the score of its subQuery (any query)
- (optional) the score of its (or queries).
Subclasses can modify the computation by overriding .
@lucene.experimental
Create a over input .
the sub query whose scored is being customized. Must not be null.
Create a over input and a .
the sub query whose score is being customized. Must not be null.
a value source query whose scores are used in the custom score
computation. This parameter is optional - it can be null.
Create a over input and a .
the sub query whose score is being customized. Must not be null.
value source queries whose scores are used in the custom score
computation. This parameter is optional - it can be null or even an empty array.
Returns true if is equal to this.
Returns a hash code value for this object.
Returns a that calculates the custom scores
for the given . The default implementation returns a default
implementation as specified in the docs of .
@since 2.9.2
A scorer that applies a (callback) function on scores of the subQuery.
Checks if this is strict custom scoring.
In strict custom scoring, the part does not participate in weight normalization.
This may be useful when one wants full control over how scores are modified, and does
not care about normalizing by the part.
One particular case where this is useful if for testing this query.
Note: only has effect when the part is not null.
The sub-query that wraps, affecting both the score and which documents match.
The scoring queries that only affect the score of .
A short name of this query, used in .
A that wrapped with an indication of how that filter
is used when composed with another filter.
(Follows the boolean logic in for composition
of queries.)
Create a new
A object containing a BitSet
A parameter implementation indicating SHOULD, MUST or MUST NOT
Returns this 's filter
A object
Returns this 's occur parameter
An object
Query that is boosted by a
Abstract implementation which supports retrieving values.
Implementations can control how the values are loaded through
NOTE: This was shortVal() in Lucene
NOTE: This was floatVal() in Lucene
NOTE: This was intVal() in Lucene
NOTE: This was longVal() in Lucene
Serves as base class for based on DocTermsIndex.
@lucene.internal
Custom to be thrown when the DocTermsIndex for a field cannot be generated
Initializes a new instance of this class with serialized data.
The that holds the serialized object data about the exception being thrown.
The that contains contextual information about the source or destination.
Abstract implementation which supports retrieving values.
Implementations can control how the values are loaded through
NOTE: This was shortVal() in Lucene
NOTE: This was floatVal() in Lucene
NOTE: This was intVal() in Lucene
NOTE: This was longVal() in Lucene
Abstract implementation which supports retrieving values.
Implementations can control how the values are loaded through
NOTE: This was FloatDocValues in Lucene
NOTE: This was shortVal() in Lucene
NOTE: This was floatVal() in Lucene
NOTE: This was intVal() in Lucene
NOTE: This was longVal() in Lucene
Abstract implementation which supports retrieving values.
Implementations can control how the values are loaded through
NOTE: This was IntDocValues in Lucene
NOTE: This was shortVal() in Lucene
NOTE: This was floatVal() in Lucene
NOTE: This was intVal() in Lucene
NOTE: This was longVal() in Lucene
Abstract implementation which supports retrieving values.
Implementations can control how the values are loaded through
NOTE: This was LongDocValues in Lucene
NOTE: This was shortVal() in Lucene
NOTE: This was floatVal() in Lucene
NOTE: This was intVal() in Lucene
NOTE: This was externalToLong() in Lucene
Abstract implementation which supports retrieving values.
Implementations can control how the values are loaded through
Returns a score for each document based on a ,
often some function of the value of a field.
Note: This API is experimental and may change in non backward-compatible ways in the future
defines the function to be used for scoring
The associated
Prints a user-readable version of this query.
Returns true if is equal to this.
Returns a hash code value for this object.
Represents field values as different types.
Normally created via a for a particular field and reader.
NOTE: This was shortVal() in Lucene
NOTE: This was floatVal() in Lucene
NOTE: This was intVal() in Lucene
NOTE: This was longVal() in Lucene
returns the bytes representation of the str val - TODO: should this return the indexed raw bytes not?
Native representation of the value
Returns true if there is a value for this document
The doc to retrieve to sort ordinal for
the sort ordinal for the specified doc
TODO: Maybe we can just use intVal for this...
the number of unique sort ordinals this instance has
Abstraction of the logic required to fill the value of a specified doc into
a reusable . Implementations of
are encouraged to define their own implementations of if their
value is not a .
@lucene.experimental
will be reused across calls
will be reused across calls. Returns true if the value exists.
This class may be used to create instances anonymously.
@lucene.experimental
NOTE: This was shortVal() in Lucene
NOTE: This was floatVal() in Lucene
NOTE: This was intVal() in Lucene
NOTE: This was longVal() in Lucene
Instantiates for a particular reader.
Often used when creating a .
Gets the values for this reader and the context that was previously
passed to
description of field, used in Explain()
Implementations should propagate CreateWeight to sub-ValueSources which can optionally store
weight info in the context. The context object will be passed to GetValues()
where this info can be retrieved.
Returns a new non-threadsafe context map.
EXPERIMENTAL: This method is subject to change.
Get the for this . Uses the
to populate the .
true if this is a reverse sort.
The for the
Implement a that works
off of the for a
instead of the normal Lucene that works off of a .
which returns the result of as
the score for a document.
When overriding this class, be aware that ValueSourceScorer constructor is calling
its private SetCheckDeletesInternal method as opposed to virtual SetCheckDeletes method.
This is done to avoid virtual call in constructor. You can call your own private
method for CheckDeletes initialization in your constructor if you need to.
This class may be used to create instances anonymously.
Abstract parent class for those implementations which
apply boolean logic to their values
Obtains field values from the
using
and makes those values available as other numeric types, casting as needed. *
NOTE: This was shortVal() in Lucene
NOTE: This was floatVal() in Lucene
NOTE: This was intVal() in Lucene
NOTE: This was longVal() in Lucene
An implementation for retrieving instances for based fields.
is the base class for all constant numbers
NOTE: This was getInt() in Lucene
NOTE: This was getLong() in Lucene
NOTE: This was getFloat() in Lucene
returns a constant for all documents
NOTE: This was floatVal() in Lucene
NOTE: This was intVal() in Lucene
NOTE: This was longVal() in Lucene
NOTE: This was getInt() in Lucene
NOTE: This was getLong() in Lucene
NOTE: This was getFloat() in Lucene
implementation which only returns the values from the provided
s which are available for a particular docId. Consequently, when combined
with a , this function serves as a way to return a default
value when the values for a field are unavailable.
NOTE: This was shortVal() in Lucene
NOTE: This was floatVal() in Lucene
NOTE: This was intVal() in Lucene
NOTE: This was longVal() in Lucene
Function to divide "a" by "b"
NOTE: This was DivFloatFunction in Lucene
the numerator.
the denominator.
NOTE: This was ConstIntDocValues in Lucene
NOTE: This was floatVal() in Lucene
NOTE: This was intVal() in Lucene
NOTE: This was longVal() in Lucene
NOTE: This was floatVal() in Lucene
NOTE: This was intVal() in Lucene
NOTE: This was longVal() in Lucene
returns the number of documents containing the term.
@lucene.internal
Function that returns a constant double value for every document.
NOTE: This was floatVal() in Lucene
NOTE: This was intVal() in Lucene
NOTE: This was longVal() in Lucene
NOTE: This was getInt() in Lucene
NOTE: This was getLong() in Lucene
NOTE: This was getFloat() in Lucene
Obtains field values from and makes
those values available as other numeric types, casting as needed.
Abstract implementation which wraps two s
and applies an extendible function to their values.
NOTE: This was DualFloatFunction in Lucene
the base.
the exponent.
NOTE: This was floatVal() in Lucene
Obtains field values from and makes
those values available as other numeric types, casting as needed.
StrVal of the value is not the value, but its (displayed) value.
NOTE: This was intValueToStringValue() in Lucene
NOTE: This was stringValueToIntValue() in Lucene
NOTE: This was floatVal() in Lucene
NOTE: This was intVal() in Lucene
NOTE: This was longVal() in Lucene
A base class for implementations that retrieve values for
a single field from the .
Obtains field values from and makes those
values available as other numeric types, casting as needed.
NOTE: This was FloatFieldSource in Lucene
NOTE: This was floatVal() in Lucene
Function that returns
for every document.
Note that the configured Similarity for the field must be
a subclass of
@lucene.internal
Depending on the value of the function,
returns the value of the or function.
NOTE: This was shortVal() in Lucene
NOTE: This was floatVal() in Lucene
NOTE: This was intVal() in Lucene
NOTE: This was longVal() in Lucene
Obtains field values from and makes those
values available as other numeric types, casting as needed.
NOTE: This was IntFieldSource in Lucene
NOTE: This was floatVal() in Lucene
NOTE: This was intVal() in Lucene
NOTE: This was longVal() in Lucene
Use a field value and find the Document Frequency within another field.
@since solr 4.0
NOTE: This was intVal() in Lucene
implements a linear function over
another .
Normally Used as an argument to a
NOTE: This was LinearFloatFunction in Lucene
NOTE: This was floatVal() in Lucene
Pass a the field value through as a , no matter the type // Q: doesn't this mean it's a "str"?
returns the literal value
Obtains field values from and makes those
values available as other numeric types, casting as needed.
NOTE: This was LongFieldSource in Lucene
NOTE: This was externalToLong() in Lucene
NOTE: This was longToObject() in Lucene
NOTE: This was longToString() in Lucene
NOTE: This was longVal() in Lucene
NOTE: This was externalToLong() in Lucene
NOTE: This was longToString() in Lucene
Returns the value of
for every document. This is the number of documents
including deletions.
returns the max of it's components.
NOTE: This was MaxFloatFunction in Lucene
returns the min of it's components.
NOTE: This was MinFloatFunction in Lucene
Abstract implementation which wraps multiple s
and applies an extendible function to their values.
Abstract implementation which wraps multiple s
and applies an extendible function to their values.
NOTE: This was MultiFloatFunction in Lucene
NOTE: This was floatVal() in Lucene
Abstract parent class for implementations that wrap multiple
s and apply their own logic.
A that abstractly represents s for
poly fields, and other things.
Function that returns
for every document.
Note that the configured Similarity for the field must be
a subclass of
@lucene.internal
NOTE: This was floatVal() in Lucene
Returns the value of
for every document. This is the number of documents
excluding deletions.
Obtains the ordinal of the field value from the default Lucene using StringIndex.
The native lucene index order is used to assign an ordinal value for each field value.
Field values (terms) are lexicographically ordered by unicode value, and numbered starting at 1.
Example:
If there were only three field values: "apple","banana","pear"
then ord("apple")=1, ord("banana")=2, ord("pear")=3
WARNING: Ord depends on the position in an index and can thus change when other documents are inserted or deleted,
or if a MultiSearcher is used.
WARNING: as of Solr 1.4, ord() and rord() can cause excess memory use since they must use a FieldCache entry
at the top level reader, while sorting and function queries now use entries at the segment level. Hence sorting
or using a different function query, in addition to ord()/rord() will double memory use.
NOTE: This was intVal() in Lucene
Function to raise the base "a" to the power "b"
NOTE: This was PowFloatFunction in Lucene
the base.
the exponent.
returns the product of it's components.
NOTE: This was ProductFloatFunction in Lucene
returns the relevance score of the query
NOTE: This was floatVal() in Lucene
implements a map function over
another whose values fall within min and max inclusive to target.
Normally used as an argument to a
NOTE: This was RangeMapFloatFunction in Lucene
NOTE: This was floatVal() in Lucene
implements a reciprocal function f(x) = a/(mx+b), based on
the value of a field or function as exported by .
When a and b are equal, and x>=0, this function has a maximum value of 1 that drops as x increases.
Increasing the value of a and b together results in a movement of the entire function to a flatter part of the curve.
These properties make this an idea function for boosting more recent documents.
Example: recip(ms(NOW,mydatefield),3.16e-11,1,1)
A multiplier of 3.16e-11 changes the units from milliseconds to years (since there are about 3.16e10 milliseconds
per year). Thus, a very recent date will yield a value close to 1/(0+1) or 1,
a date a year in the past will get a multiplier of about 1/(1+1) or 1/2,
and date two years old will yield 1/(2+1) or 1/3.
NOTE: This was ReciprocalFloatFunction in Lucene
f(source) = a/(m*float(source)+b)
NOTE: This was floatVal() in Lucene
Obtains the ordinal of the field value from the default Lucene using
and reverses the order.
The native lucene index order is used to assign an ordinal value for each field value.
Field values (terms) are lexicographically ordered by unicode value, and numbered starting at 1.
Example of reverse ordinal (rord):
If there were only three field values: "apple","banana","pear"
then rord("apple")=3, rord("banana")=2, ord("pear")=1
WARNING: Ord depends on the position in an index and can thus change when other documents are inserted or deleted,
or if a MultiSearcher is used.
WARNING: as of Solr 1.4, ord() and rord() can cause excess memory use since they must use a FieldCache entry
at the top level reader, while sorting and function queries now use entries at the segment level. Hence sorting
or using a different function query, in addition to ord()/rord() will double memory use.
NOTE: This was intVal() in Lucene
Scales values to be between min and max.
This implementation currently traverses all of the source values to obtain
their min and max.
This implementation currently cannot distinguish when documents have been
deleted or documents that have no value, and 0.0 values will be used for
these cases. This means that if values are normally all greater than 0.0, one can
still end up with 0.0 as the min value to map from. In these cases, an
appropriate map() function could be used as a workaround to change 0.0
to a value in the real range.
NOTE: This was ScaleFloatFunction in Lucene
NOTE: This was floatVal() in Lucene
Obtains field values from the
using
and makes those values available as other numeric types, casting as needed.
NOTE: This was ShortFieldSource in Lucene
NOTE: This was shortVal() in Lucene
NOTE: This was floatVal() in Lucene
NOTE: This was intVal() in Lucene
NOTE: This was longVal() in Lucene
implementation which applies an extendible
function to the values of a single wrapped .
Functions this can be used for include whether a field has a value or not,
or inverting the value of the wrapped .
A simple function with a single argument
NOTE: This was SimpleFloatFunction in Lucene
NOTE: This was floatVal() in Lucene
A function with a single (one) argument.
NOTE: This was SingleFunction in Lucene, changed to avoid conusion with operations on the datatype .
returns the sum of its components.
NOTE: This was SumFloatFunction in Lucene
returns the number of tokens.
(sum of term freqs across all documents, across all terms).
Returns -1 if frequencies were omitted for the field, or if
the codec doesn't support this statistic.
@lucene.internal
NOTE: This was longVal() in Lucene
Function that returns for the
supplied term in every document.
If the term does not exist in the document, returns 0.
If frequencies are omitted, returns 1.
NOTE: This was intVal() in Lucene
Function that returns
for every document.
Note that the configured Similarity for the field must be
a subclass of
@lucene.internal
NOTE: This was floatVal() in Lucene
returns the total term freq
(sum of term freqs across all documents).
Returns -1 if frequencies were omitted for the field, or if
the codec doesn't support this statistic.
@lucene.internal
NOTE: This was longVal() in Lucene
Converts individual instances to leverage the FunctionValues *Val functions that work with multiple values,
i.e.
NOTE: This was shortVal() in Lucene
NOTE: This was intVal() in Lucene
NOTE: This was longVal() in Lucene
NOTE: This was floatVal() in Lucene
NOTE: This was shortVal() in Lucene
NOTE: This was floatVal() in Lucene
NOTE: This was intVal() in Lucene
NOTE: This was longVal() in Lucene
Generate "more like this" similarity queries.
Based on this mail:
Lucene does let you access the document frequency of terms, with .
Term frequencies can be computed by re-tokenizing the text, which, for a single document,
is usually fast enough. But looking up the of every term in the document is
probably too slow.
You can use some heuristics to prune the set of terms, to avoid calling too much,
or at all. Since you're trying to maximize a tf*idf score, you're probably most interested
in terms with a high tf. Choosing a tf threshold even as low as two or three will radically
reduce the number of terms under consideration. Another heuristic is that terms with a
high idf (i.e., a low df) tend to be longer. So you could threshold the terms by the
number of characters, not selecting anything less than, e.g., six or seven characters.
With these sorts of heuristics you can usually find small set of, e.g., ten or fewer terms
that do a pretty good job of characterizing a document.
It all depends on what you're trying to do. If you're trying to eek out that last percent
of precision and recall regardless of computational difficulty so that you can win a TREC
competition, then the techniques I mention above are useless. But if you're trying to
provide a "more like this" button on a search results page that does a decent job and has
good performance, such techniques might be useful.
An efficient, effective "more-like-this" query generator would be a great contribution, if
anyone's interested. I'd imagine that it would take a Reader or a String (the document's
text), analyzer Analyzer, and return a set of representative terms using heuristics like those
above. The frequency and length thresholds could be parameters, etc.
Doug
Initial Usage
This class has lots of options to try to make it efficient and flexible.
The simplest possible usage is as follows. The bold
fragment is specific to this class.
IndexReader ir = ...
IndexSearcher is = ...
MoreLikeThis mlt = new MoreLikeThis(ir);
TextReader target = ... // orig source of doc you want to find similarities to
Query query = mlt.Like(target);
Hits hits = is.Search(query);
// now the usual iteration thru 'hits' - the only thing to watch for is to make sure
//you ignore the doc if it matches your 'target' document, as it should be similar to itself
Thus you:
- do your normal, Lucene setup for searching,
- create a MoreLikeThis,
- get the text of the doc you want to find similarities to
- then call one of the calls to generate a similarity query
- call the searcher to find the similar docs
More Advanced Usage
You may want to use the setter for so you can examine
multiple fields (e.g. body and title) for similarity.
Depending on the size of your index and the size and makeup of your documents you
may want to call the other set methods to control how the similarity queries are
generated:
Changes: Mark Harwood 29/02/04
Some bugfixing, some refactoring, some optimisation.
- bugfix: retrieveTerms(int docNum) was not working for indexes without a termvector -added missing code
- bugfix: No significant terms being created for fields with a termvector - because
was only counting one occurrence per term/field pair in calculations(ie not including frequency info from TermVector)
- refactor: moved common code into isNoiseWord()
- optimise: when no termvector support available - used maxNumTermsParsed to limit amount of tokenization
Default maximum number of tokens to parse in each example doc field that is not stored with TermVector support.
Ignore terms with less than this frequency in the source doc.
Ignore words which do not occur in at least this many docs.
Ignore words which occur in more than this many docs.
Boost terms in query based on score.
Default field names. Null is used to specify that the field names should be looked
up at runtime from the provided reader.
Ignore words less than this length or if 0 then this has no effect.
Ignore words greater than this length or if 0 then this has no effect.
Default set of stopwords.
If null means to allow stop words.
Return a Query with no more than this many terms.
to use
Boost factor to use when boosting the terms
Gets or Sets the boost factor used when boosting terms
Constructor requiring an .
For idf() calculations.
Gets or Sets an analyzer that will be used to parse source doc with. The default analyzer
is not set. An analyzer is not required for generating a query with the
method, all other 'like' methods require an analyzer.
Gets or Sets the frequency below which terms will be ignored in the source doc. The default
frequency is the .
Gets or Sets the frequency at which words will be ignored which do not occur in at least this
many docs. The default frequency is .
Gets or Sets the maximum frequency in which words may still appear.
Words that appear in more than this many docs will be ignored. The default frequency is
.
Set the maximum percentage in which words may still appear. Words that appear
in more than this many percent of all docs will be ignored.
the maximum percentage of documents (0-100) that a term may appear
in to be still considered relevant
Gets or Sets whether to boost terms in query based on "score" or not. The default is
.
Gets or Sets the field names that will be used when generating the 'More Like This' query.
The default field names that will be used is .
Set this to null for the field names to be determined at runtime from the
provided in the constructor.
Gets or Sets the minimum word length below which words will be ignored. Set this to 0 for no
minimum word length. The default is .
Gets or Sets the maximum word length above which words will be ignored. Set this to 0 for no
maximum word length. The default is .
Gets or Sets the set of stopwords.
Any word in this set is considered "uninteresting" and ignored.
Even if your allows stopwords, you might want to tell the code to ignore them, as
for the purposes of document similarity it seems reasonable to assume that "a stop word is never interesting".
Gets or Sets the maximum number of query terms that will be included in any generated query.
The default is .
Gets or Sets the maximum number of tokens to parse in each example doc field that is not stored with TermVector support
Return a query that will return docs like the passed lucene document ID.
the documentID of the lucene doc to generate the 'More Like This" query for.
a query that will return docs like the passed lucene document ID.
Return a query that will return docs like the passed .
a query that will return docs like the passed .
Create the More like query from a
Create a from a word->tf map.
a map of words keyed on the word() with objects as the values.
Describe the parameters that control how the "more like this" query is formed.
Find words for a more-like-this query former.
the id of the lucene document from which to find terms
Adds terms and frequencies found in vector into the
a of terms and their frequencies
List of terms and their frequencies for a doc/field
Adds term frequencies found by tokenizing text from reader into the words
a source of text to be tokenized
a of terms and their frequencies
Used by analyzer for any special per-field analysis
determines if the passed term is likely to be of interest in "more like" comparisons
The word being considered
true if should be ignored, false if should be used in further analysis
Find words for a more-like-this query former.
The result is a priority queue of objects with one entry for every word in the document.
Each object has 6 properties.
The properties are:
- The ()
- The that this word comes from ()
- The for this word ()
- The value ()
- The (frequency of this word in the index ())
- The (frequency of this word in the source document ())
This is a somewhat "advanced" routine, and in general only the is of interest.
This method is exposed so that you can identify the "interesting words" in a document.
For an easier method to call see .
the reader that has the content of the document
field passed to the analyzer to use when analyzing the content
the most interesting words in the document ordered by score, with the highest scoring, or best entry, first
Convenience routine to make it easy to return the most interesting words in a document.
More advanced users will call directly.
the source document
field passed to analyzer to use when analyzing the content
the most interesting words in the document
that orders words by score.
Use for frequencies and to avoid renewing s.
NOTE: This was Int in Lucene
An "interesting word" and related top field, score and frequency information.
Gets the word.
Gets the top field that this word comes from.
Gets the score for this word ().
Gets the inverse document frequency (IDF) value ().
Gets the frequency of this word in the index ().
Gets the frequency of this word in the source document ().
A simple wrapper for for use in scenarios where a object is required eg
in custom QueryParser extensions. At query.Rewrite() time the reader is used to construct the
actual object and obtain the real object.
fields used for similarity measure
A filter that includes documents that match with a specific term.
The term documents need to have in order to be a match for this filter.
Gets the term this filter includes documents with.
Constructs a filter for docs matching any of the terms added to this class.
Unlike a RangeFilter this can be used for filtering on multiple terms that are not necessarily in
a sequence. An example might be a collection of primary keys from a database query result or perhaps
a choice of "category" labels picked by the end user. As a filter, this is much faster than the
equivalent query (a with many "should" s)
Creates a new from the given list. The list
can contain duplicate terms and multiple fields.
Creates a new from the given list for
a single field.
Creates a new from the given array for
a single field.
Creates a new from the given array. The array can
contain duplicate terms and multiple fields.