Taking a look to Lucene code, this seems the closest query to your
requirement :
org.apache.lucene.search.spans.SpanPositionRangeQuery
But it is not used in Solr out of the box according to what I know.
You may potentially develop a query parser and use it to reach your goals.
Given that, I thin
This seems different from what you initially asked ( and Diego responded)
"One is simple, search query will look for whole content indexed in XYZ
field
Other one is, search query will have to look for first 100 characters
indexed in same XYZ field. "
This is still doable at Indexing time using a
Generally speaking, if a full re-index is happening everyday, wouldn't be
better to use a technique such as collection alias ?
You could point your search clients to the "Alias" which points to the
online collection "collection1".
When you re-index you build "collection2", when it is finished you
when you say : "However, for the phonetic matches, there are some matches
closer to the query text than others. How can we boost these results ? "
Do you mean closer in String edit distance ?
If that is the case you could use the String distance metric implemented in
Solr with a function query :
>
Can you tell us the request parameters used for the spellcheck ?
In particular are you using these ? (from the wiki) :
" The *spellcheck.maxCollationTries* Parameter
This parameter specifies the number of collation possibilities for Solr to
try before giving up. Lower values ensure better perform
This is actually an interesting point.
The original Solr score alone will mean nothing, the ranking position of the
document would be a more relevant feature at that stage.
When you put the original score together with the rest of features, it may
be of potential usage ( number of query terms, tf
have you tried adding the "distrib =true" request parameter when building the
suggester ?
It should be by default, but trying explicitly won't harm.
I think nowadays the suggester component is Solr Cloud compatible, I have no
chance to test it right now but it should just works.
Worst case you can
Hi Mikhail,
but if he keeps the docs within a segment, the ordering may be correct just
temporary right ?
As soon as a segment merge happens ( for example after sequent indexing
sessions or updates) the internal Lucene doc Id may change and the default
order Solr side may change, right ?
I am just
I think this has nothing to do with LTR in particular.
have you tried executing the function query on its own ?
I think it doesn't exist at all, right ? [1]
So maybe the first approach to that would be to add this nested children
function query capability to Solr.
I think there is a document Trans
b2b-catalog-material-etl -> b2b-catalog-material
b2b-catalog-material -> b2b-catalog-material-180117
and we do a data load to b2b-catalog-material-etl
We see data being added to both b2b-catalog-material and
b2b-catalog-material-180117 -> *in here you wanted just to index in
b2b-catalog-mate
Hi,
let me see if I got your problem :
your "user specific" features are Query dependent features from Solr side.
The value of this feature depends on a query component ( the user Id) and a
document component( product Id)
You can definitely use them.
You can model this feature as a binary feature.
I have never been a big fan of " getting N results from Solr and then filter
them client side" .
I get your point about the document modelling, so I will assume you properly
tested it and having the small documents at Solr side is really not
sustainable.
I also appreciate the fact you want to fin
Thanks Yonik and thanks Doug.
I agree with Doug in adding few generics test corpora Jenkins automatically
runs some metrics on, to evaluate Apache Lucene/Solr changes don't affect a
golden truth too much.
This of course can be very complex, but I think it is a direction the Apache
Lucene/Solr comm
"Lucene/Solr doesn't actually delete documents when you delete them, it
just marks them as deleted. I'm pretty sure that the difference between
docCount and maxDoc is deleted documents. Maybe I don't understand what
I'm talking about, but that is the best I can come up with. "
Thanks Shawn, y
I would like to stress how important is what Erick explained.
A lot of times people want to use the score to show it to the
users/calculate probability/doing weird calculations.
Score is used to rank results, given a query.
To give a local ordering.
This is the only useful information for the end
Furthermore, taking a look to the code for BM25 similarity, it seems to me it
is currently working right :
- docCount is used per field if != -1
/**
* Computes a score factor for a simple term and returns an explanation
* for that score factor.
*
*
* The default implementation us
Hi Markus,
just out of interest, why did
" It was solved back then by using docCount instead of maxDoc when
calculating idf, it worked really well!" solve the problem ?
i assume you are using different fields, one per language.
Each field is appearing on a different number of docs I guess.
e.g.
t
"Can you please suggest suitable configuration for spell check to work
correctly. I am indexing all the words in one column. With current
configuration I am not getting good suggestions "
This is very vague.
Spellchecking is working correctly according to your configurations...
Let's start from
You spellcheck configurations are quite extensive !
In particular you specified :
0.01
This means that if the term appears in less than 1 % total docs it will be
considered misspelled.
Is cholera occurring in your corpus > 1% total docs ?
-
---
Alessandro Benedetti
S
Hi all,
it may sounds a silly question, but is there any reason that the term
positions in the inverted index are using 1 based numbering while the Term
Vector positions are using a 0 based numbering[1] ?
This may affect different areas in Solr and cause problems which are quite
tricky to spot.
R
Do you mean you are over-spellchecking ?
Correcting even "not mispelled words" ?
Can you give us the request handler configuration, spellcheck configuration
and the schema ?
Cheers
-
---
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.se
When you say " caching 100.000 docs" what do you mean ?
being able to quickly find information in a corpus which increases in size (
100.000 docs) everyday ?
I second Erick, I think this is fairly normal Solr use case.
If you really care about fast searches, you will get a fairly acceptable
defaul
Hi,
it is on my TO DO list with low priority, there is a Jira issue already[1],
feel free to contribute it !
[1] https://issues.apache.org/jira/browse/SOLR-8952
-
---
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent fro
"The main motivation is to support a geo-specific relevancy
model which can easily be customized without stepping into each other"
Is your relevancy tuning massively index time based ?
i.e. will create massively different index content based on the geo location
?
If it is just query time based o
Which Solr version are you using ?
>From the documentation :
"Only query words, which are absent in index or too rare ones (below
maxQueryFrequency ) are considered as misspelled and used for finding
suggestions.
...
These parameters (maxQueryFrequency and thresholdTokenFrequency) can be a
percen
It depends how you want to use the payloads.
If you want to use the payloads to calculate additional features, you can
implement a payload feature:
This feature could calculate the sum of numerical payload for the query
terms in each document ( so it will be a query dependent feature and will
lev
"In case you decide to use an entire new index for the
autosuggestion, you
can potentially manage that on your own"
This refers to the fact that is possible to define an index just for
autocompletion.
You can model the document as you prefer in this additional index, defining
the field types tha
Hi Ruby,
I partecipated at the discussion at the time,
It's definitely still open.
It's on my long TO DO list, I hope I will be able to contribute a solution
sooner or later.
In case you decide to use an entire new index for the autosuggestion, you
can potentially manage that on your own.
But out
Apart from the performance, to get a "word cloud" from a subset of documents
it is a slighly different problem than getting the facets out of it.
If my understanding is correct, what you want is to extract the "significant
terms" out of your results set.[1]
Using faceting is a rough approximation
I opened a ticket for RankLib long time ago to provide support for Solr Model
Json format[1]
It is on my TO DO list but unfortunately very low on priority.
Anyone that want to contribute is welcome, I will help and commit it when
ready.
Cheers
[1] https://sourceforge.net/p/lemur/feature-requests
I know it is obvious, but ...
have you done a full re-indexing or you used the Index migration tool ?
In the latter case, it could be a bug of the tool itself.
-
---
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: h
i think this can be actually a good idea and I think that would require a new
feature type implementation.
Specifically I think you could leverage the existing data structures ( such
TermVector) to calculate the matrix and then perform the calculations you
need.
Or maybe there is space for even a
In addition : bf=recip(ms(NOW/DAY,unixdate),3.16e-11,5,0.1)) is an additive
boost.
I tend to prefer multiplicative ones but that is up to you [1].
You can specify the order of magnitude of the values generated by that
function.
This means that you have control of how much the date will affect the
It strictly depends on the kind of features you are using.
At the moment there is just one cache for all the features.
This means that even if you have 1 query dependent feature and 100 document
dependent feature, a different value for the query dependent one will
invalidate the cache entry for the
Hi John,
first of all, I may state the obvious, but have you tried docValues ?
Apart from that a friend of mine ( Diego Ceccarelli) was discussing a
probabilistic implementation similar to the hyperloglog[1] to approximate
facets counting.
I didn't have time to take a look in details / implement
According to the concept of immutability that should drive Lucene segmenting
approach, I think Emir observation sounds correct.
Being docValues a column based data structure, stored on segments i guess
when an in place update happens it does just a re-index of just that field.
This means we need t
If you add a filter query to your original query :
fq=genre:A
You know that your results ( group heads included) will just be of that
genre.
So I think we are not getting your question properly.
Can you try to express your requirement from the beginning.
Leave outside grouping or field collapsing
Can results collapsing[1] be of use for you ?
if it is the case, you can use that feature and explore its flexibility in
selecting the group head :
1) min | max for a numeric field
2) min | max for a function query
3) sort
[1]
https://lucene.apache.org/solr/guide/6_6/collapse-and-expand-results.h
In addition to what Amrit correctly stated, if you need to search on your id,
especially range queries, I recommend to use a copy field and leave the id
field, almost as default.
Cheers
-
---
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - ww
But you used :
"spellcheck.q": "tag:polt",
Instead of :
"spellcheck.q": "polt",
Regards
-
---
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
I was having the discussion with a colleague of mine recently, about
E-commerce search.
Of course there are tons of things you can do to improve relevancy:
Custom similarity - edismax tuning - basic user events processing - machine
learning integrations - semantic search ect ect
more you do, bette
The Terms component[1] should do the trick for you.
Just use the regular expression or prefix filtering and you should be able
to get the stats you want.
If you were interested in extracting the DV when returning docs you may be
interested in function queries and specifically this one :
docfreq(f
This is interesting, the EFI parameter resolution should work using the
quotes independently of the query parser.
At that point, the query parsers (both) receive a multi term text.
Both of them should work the same.
At the time I saw the mail I tried to reproduce it through the LTR module
tests and
Interesting, what happens when you pass it as spellcheck.q=polt ?
What is the behavior you get ?
-
---
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Hi,
"And all what we got
only a overwriting doc came first by new one. Ok just put overwrite=false
to params, and dublicating docs appeare."
What is exactly the doc you get ?
Are the fields originally in the first doc before the atomic update stored ?
This is what you need to use :
https://luce
The only way Solr will fetch documents is through the Data Import Handler.
Take a look to the URLDataSource[1] to see if it fits.
Possibly you will need to customize it.
[1]
https://lucene.apache.org/solr/guide/6_6/uploading-structured-data-store-data-with-the-data-import-handler.html#urldatasourc
Nabble mutilated my reply :
*Comment*: If you remove this field, you must _also_ disable the update log
in solrconfig.xml
or Solr won't start. _version_ and update log are required for
SolrCloud
*Comment*:points to the root document of a block of nested documents.
Required for nested
1) "_version_" is not "unecessary", actually the contrary, it is fundamendal
for Solr to work.
The same for types you use across your field definitions.
There was a time you could see these comments in the schema.xml (doesn't
seem the case anymore):
2) https://lucen
I don't think this is actually that much related to LTR Solr Feature.
In the Solr feature I see you specify a query with a specific query parser
(field).
Unless there is a bug in the SolrFeature for LTR, I expect the query parser
you defined to be used[1].
This means :
"rawquerystring":"{!field f
In line :
/"1. No zookeeper - I have burnt my hands with some zookeeper issues in the
past and it is no fun to deal with. Kafka and Storm are also trying to
burden zookeeper less and less because ZK cannot handle heavy traffic."/
Where did you get this information ? is based on some publicly
rep
There was time ago a Solr installation which had the same problem, and the
author explained me that the choice was made for performance reasons.
Apparently he was sure that handling everything as primitive types would
give a boost to the Solr searching/faceting performance.
I never agreed ( and one
I expect the slides to be published here :
https://www.slideshare.net/lucidworks?utm_campaign=profiletracking&utm_medium=sssite&utm_source=ssslideview
The one you are looking for is not there yet, but keep an eye on it.
Regards
-
---
Alessandro Benedetti
Search Consultant, R&
Does spellcheck.q=polt help ?
How your queries normally look ?
How would you like the collation to be returned ?
-
---
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472
The weights you express could flag a probabilistic view or your final score.
The model you quoted will calculate the final score as :
0.9*scorePersonalId +0.1* originalScore
The final score will NOT necessarily be 0https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#the-dismax-q
In addition to what Emir mentioned, when Solr opens a new Transaction Log
file it will delete the older ones up to some conditions :
keep at least N number of records [1] and max K number of files[2].
N is specified in the solrconfig.xml ( in the update handler section) and
can be documents related
I would try to use an additive boost and the ^= boost operator:
- name_property :( test^=2 ) will assign a fixed score of 2 if the match
happens ( it is a constant score query)
- additive boost will be 0http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Are the norms a good approximation for you ?
If you preserve norms at indexing time ( it is a configuration that you can
operate in the schema.xml) you can retrieve them with this specific function
query :
*norm(field)*
Returns the "norm" stored in the index for the specified field. This is the
pr
Hi Erick,
you said :
""mentions that for soft commit, "new segments are created that will
be merged""
Wait, how did that get in there? Ignore it, I'm taking it out. "
but I think you were not wrong, based on another mailing list thread message
by Shawn, I read :
[1]
"If you are using the corre
Hi Reth,
there are some problem in the debug for the distributed IDF [1]
Your case seems different though.
It has been a while I experimented that feature but your config seems ok to
me.
What helped me a lot that time was to debug my Solr instance.
[1] https://issues.apache.org/jira/browse/SOLR
Hi Alex,
just to explore a bit your question, why do you need that ?
Do you need to reduce query time ?
Have you tried enabling docValues for the fields of interest ?
Doc Values seem to me a pretty useful data structure when sorting is a
requirement.
I am curious to understand why that was not an o
>From the Solr wiki[1] :
*Logical*
/Collection/ : It is a collection of documents which share the same logical
domain and data structure
*Physical*
/Solr Node/ : It is a single instance of a Solr Server. From OS point of
view it is a single Java Process ( internally it is the Solr Web App
deploy
I think this has nothing to do with the LTR plugin.
The problem here should be just the way you use the local params,
to properly pass multi term local params in Solr you need to use *'* :
efi.case_description='added couple of fiber channel'
This should work.
If not only the first term will be pa
Hi Billy,
there is a README.TXT in the contrib/ltr directory.
Reading that you find this useful link[1] .
>From that useful link you see where the Jar of the plugin is located.
Specifically :
Taking a look to the contrib and dist structure it seems quite a standard
approach to keep the readme i
Which version of Solr are you on?
Are you using SolrCloud or any distributed search?
In that case, I think( as already mentioned by Shawn) this could be related
[1] .
if it is just plain Solr, my shot in the dark is your boost function :
{!boost+b=recip(ms(NOW,field1),3.16e-11,1,1)}{!boost+b=reci
In addition to that, I still believe More Like This is a better option for
you.
The reason is that the MLT is able to evaluate the interesting terms from
your document (title is the only field of interest for you), and boost them
accordingly.
Related your "80% of similarity", this is more tricky.
Hi Tommaso,
you are definitely right!
I see that the method : MultiFields.getTerms
returns :
if (termsPerLeaf.size() == 0) {
return null;
}
As you correctly mentioned this is not handled in :
org/apache/lucene/classification/document/SimpleNaiveBayesDocumentClassifier.java:115
org/apac
I finally have an explanation, I post it here for future reference :
The cause was a combination of :
1) /select request handler has default with the spellcheck ON and few
spellcheck options ( such as collationQuery ON and max collation tries set
to 5)
2) the firstSearcher has a warm-up query wi
If you are referring to the number of words per suggestion, you may need to
play with the free text lookup type [1]
[1] http://alexbenedetti.blogspot.co.uk/2015/07/solr-you-complete-me.html
-
---
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd.
I get your point, the second KeepWordFilter is not keeping anything because
the token it gets is :
"hey you" and the word is supposed to keep is "hey".
Which does clearly not work.
The KeepWordFilter just consider each row a single token ( I may be wrong, i
didn't check the code, I am just asssumi
I think this bit is the problem :
"I am using a Shingle filter right after the StandardTokenizer, not sure if
that has anything to do with it. "
When using the FreeTextLookup approach, you don't need to use shingles in
your analyser, shingles are added by the suggester itself.
As Erick mentioned
1) nope, no big tlog or replaying problem
2) Solr just seem freezed. Not responsive and nothing in the log.
Now I just tried just to restart after the Zookeeper config deploy and on
restart the log complety freezes and the instances don't come up...
If I clean the indexes and then start, this work
hi Ryan,
the issue you mentioned was mine :
https://sourceforge.net/p/lemur/feature-requests/144/
My bad It got lost in sea of "To Dos" .
I still think it could be a good contribution to the library, but at the
moment I think going with a custom script/app to do the transformation is
the way to go
Additional information :
Try single core reload I identified that an entire shard is not reloading (
while the other shard is ).
Taking a look to the "not reloading" shard ( 2 replicas) , it seems that the
core reload stucks here :
org.apache.solr.core.SolrCores#waitAddPendingCoreOps
The problem
Taking a look to 4.10.2 source I may see why the async call does not work :
/log.info("Reloading Collection : " + req.getParamString());
String name = req.getParams().required().get("name");
*ZkNodeProps m = new ZkNodeProps(Overseer.QUEUE_OPERATION,
OverseerCollectionProce
Assuming the service solr service restart does its job, I think the only
thing I would do is to completely remove the data directory content, instead
of just running the delete query.
Bare in mind that when you delete a document in Solr, this is marked as
deleted, but it takes potentially a while
I doubt it is an environment problem at all.
How are you modifying your schema ?
How you reloading your core/collection ?
Are you restarting your Solr instance ?
Regards
-
---
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Vi
Thanks for the prompt response Erick,
the reason that I am issuing a Collection reload is because I modify from
time to the time the Solrconfig for example, with different spellcheck and
request parameter default params.
So after the upload to Zookeeper I reload the collection to reflect the
modifi
"I have different "sort preferences", so I can't build a index and use for
sorting.Maybe I have to sort by category then by source and by language or
by source, then by category and by date"
I would like to focus on this bit.
It is ok to go for a custom function and sort at query time, but I am
cu
I have been recently facing an issue with the Collection Reload in a couple
of Solr Cloud clusters :
1) re-index a collection
2) collection happily working
3) trigger collection reload
4) reload times out ( silently, no message in any of the Solr node logs)
5) no effect on the collection ( it sti
To do what ?
If it is a use case, please explain us.
If it is just to check that the analysis chain worked correctly, you can
check the schema browser or use Luke.
If you just want to test your analysis chain, you can use the analysis tool
in the Solr admin.
Cheers
-
---
Aless
You don't need the TermVectorComponent at all for MLT.
The reason the Term Vector is suggested for the fields you are interested
in, is just because this will speed up the way the MLT will retrieve the
"interesting terms" out of your seed document to build the MLT query.
If you don't have the Ter
I would recommend this blog of mine to get a better understanding of how
tokenization and the suggester work together [1] .
If you take a look to the FuzzyLookupFactory, you will see that it is one of
the suggesters that return the entire content of the field.
You may be interested to the FreeTex
I would recommend to play with the default, append and invariants [1] element
for the reuqestHandler node.
Identify the request handler you want to use in the solrconfig.xml and then
add the parameter you want.
You should be abkle to manage this through your source version control
system.
Cheers
+1
I was trying to understand a reload collection time out happening lately in
a Solr Cloud cluster, and the Overseer Status was hard to decipher.
More Human Readable names and some additional documentation could help here.
Cheers
-
---
Alessandro Benedetti
Search Consultant,
Point 2 was the ram Buffer size :
*ramBufferSizeMB* sets the amount of RAM that may be used by Lucene
indexing for buffering added documents and deletions before they
are
flushed to the Directory.
maxBufferedDocs sets a limit on the number of documents buffered
Is the phisical machine dedicated ? Is a dedicated VM on shared metal ?
Apart from this operational checks I will assume the machine is dedicated.
In Solr a write to the disk does not happen only on commit, I can think to
other scenarios :
1) *Transaction log* [1]
2)
3) Spellcheck and
In addition to what Chris has correctly suggested, I would like to focus on
this sentence :
" I am decently certain that at one point in time it worked in a way
that a higher match length would rank higher"
You mean a match in a longer field would rank higher than a match in a
shorter field ?
is
Hi Jacques,
this should satisfy your curiosity [1].
The mark is telling you the relative position in the sorted set ( and it is
mandatory to use the uniqueKey as tie breaker).
If you change your index, the query using an old mark should still work (but
may potentially return different docuements if
Another path to follow could be to design a specific collection(index) for
the auto-suggestion.
In there you can define the analysis chain as you like ( for example using
edge-ngram filtering on top of tokenisation) to provide infix
autocompletion.
Then you can play with your queries as you like an
Hi Angel,
can you give me an example of query, a couple of documents of example, and
the suggestions you get ( which you don't expect) ?
The config seems fine ( I remember there were some tricky problems with the
default separator, but a space should be fine there).
Cheers
-
-
" Don't use an heavy Analyzers, the suggested terms will come from the index,
so be sure they are meaningful tokens. A really basic analyser is suggested,
stop words and stemming are not "
This means that your suggestions will come from the index, so if you use
heavy analysers you can get terms su
Hi Angel,
your are looking for the Free Text lookup approach.
You find more info in [1] and [2]
[1]
https://lucene.apache.org/solr/guide/6_6/suggester.html#Suggester-FreeTextLookupFactory
[2] http://alexbenedetti.blogspot.co.uk/2015/07/solr-you-complete-me.html
-
---
Alessandro
Quoting the official solr documentation :
" You Can Still Be Explicit
Even if you want to use schemaless mode for most fields, you can still use
the Schema API to pre-emptively create some fields, with explicit types,
before you index documents that use them.
Internally, the Schema API and the Sc
Hi all,
I was just using the new Solr Ref Guide[1] (If I understood correctly this
is going to be the next official documentation for Solr).
Unfortunately search within the guide works really bad...
The autocomplete seems to be just on page title ( including headings would
help a lot).
If you don'
With automatic schema do you mean schemaless ?
You will need to define a schema managed/old legacy style as you prefer.
Then you define a field type that suites your needs ( for example with an
edge n-gram token filter[1] ).
And you assign that field type to a specific field.
Than in your request
I second Erick,
it would be as easy as adding this field to the schema :
"/>
If you are using inter collections queries, just be aware there a lot of
tricky and subtle problems with it ( such as unique Identifier must have
same field name, distributed IDF inter collections ect ect)
I am preparing
A short answer seems to be No [1] .
On the other side I discussed in a couple of related Jira issues in the past
as I( + other people) believe we should anyway always return unique
suggestions [2] .
Despite it passed a year, nor me nor others have actually progressed on that
issue :(
[1] o
As any other search you can paginate playing with the 'row' and 'start'
parameters ( or cursors if you want to go deep), show only the first K
results is your responsibility.
Is it not possible in your domain to identify a limit d ( out of that your
results will lose meaning ?)
You can not match
Hi Ryan,
first thing to know is that Learning To Rank is about relevancy and
specifically it is about to improve your relevancy function.
Deciding if to use or not LTR has nothing to do with your index size or
update frequency ( although LTR brings some performance consideration you
will need to ev
Interesting.
it seems almost correct to me.
Have you explored the content of the field ( for example using the schema
browser) ?
When you say " don't match" it means you don't get results at all or just
the boost is not applied ?
I would recommend to simply the request handler, maybe just introduci
1 - 100 of 169 matches
Mail list logo