Re: Give boost only if entire value is present in Query

2017-06-19 Thread alessandro.benedetti
Isn't this a case where you don't want the query parser to split by space before the analyser ? Take a look to the "sow" param for the edismax query parser. In your case you should be ok but Be aware that is not a silver bullet for everything and that other problems could arise in similar scenarios

Re: Give boost only if entire value is present in Query

2017-06-20 Thread alessandro.benedetti
Interesting. it seems almost correct to me. Have you explored the content of the field ( for example using the schema browser) ? When you say " don't match" it means you don't get results at all or just the boost is not applied ? I would recommend to simply the request handler, maybe just introduci

Re: When to use LTR

2017-06-21 Thread alessandro.benedetti
Hi Ryan, first thing to know is that Learning To Rank is about relevancy and specifically it is about to improve your relevancy function. Deciding if to use or not LTR has nothing to do with your index size or update frequency ( although LTR brings some performance consideration you will need to ev

Re: Spatial Search based on the amount of docs, not the distance

2017-06-21 Thread alessandro.benedetti
As any other search you can paginate playing with the 'row' and 'start' parameters ( or cursors if you want to go deep), show only the first K results is your responsibility. Is it not possible in your domain to identify a limit d ( out of that your results will lose meaning ?) You can not match

Re: Mixing distrib=true and false in one request handler?

2017-06-22 Thread alessandro.benedetti
A short answer seems to be No [1] . On the other side I discussed in a couple of related Jira issues in the past as I( + other people) believe we should anyway always return unique suggestions [2] . Despite it passed a year, nor me nor others have actually progressed on that issue :( [1] o

Re: Collection name in result

2017-06-23 Thread alessandro.benedetti
I second Erick, it would be as easy as adding this field to the schema : "/> If you are using inter collections queries, just be aware there a lot of tricky and subtle problems with it ( such as unique Identifier must have same field name, distributed IDF inter collections ect ect) I am preparing

Re: Query Partial Matching on auto schema

2017-06-23 Thread alessandro.benedetti
With automatic schema do you mean schemaless ? You will need to define a schema managed/old legacy style as you prefer. Then you define a field type that suites your needs ( for example with an edge n-gram token filter[1] ). And you assign that field type to a specific field. Than in your request

[Solr Ref guide 6.6] Search not working

2017-06-23 Thread alessandro.benedetti
Hi all, I was just using the new Solr Ref Guide[1] (If I understood correctly this is going to be the next official documentation for Solr). Unfortunately search within the guide works really bad... The autocomplete seems to be just on page title ( including headings would help a lot). If you don'

Re: Query Partial Matching on auto schema

2017-06-23 Thread alessandro.benedetti
Quoting the official solr documentation : " You Can Still Be Explicit Even if you want to use schemaless mode for most fields, you can still use the Schema API to pre-emptively create some fields, with explicit types, before you index documents that use them. Internally, the Schema API and the Sc

Re: SOLR Suggester returns either the full field value or single terms only

2017-06-26 Thread alessandro.benedetti
Hi Angel, your are looking for the Free Text lookup approach. You find more info in [1] and [2] [1] https://lucene.apache.org/solr/guide/6_6/suggester.html#Suggester-FreeTextLookupFactory [2] http://alexbenedetti.blogspot.co.uk/2015/07/solr-you-complete-me.html - --- Alessandro

Re: SOLR Suggester returns either the full field value or single terms only

2017-06-26 Thread alessandro.benedetti
" Don't use an heavy Analyzers, the suggested terms will come from the index, so be sure they are meaningful tokens. A really basic analyser is suggested, stop words and stemming are not " This means that your suggestions will come from the index, so if you use heavy analysers you can get terms su

Re: SOLR Suggester returns either the full field value or single terms only

2017-06-27 Thread alessandro.benedetti
Hi Angel, can you give me an example of query, a couple of documents of example, and the suggestions you get ( which you don't expect) ? The config seems fine ( I remember there were some tricky problems with the default separator, but a space should be fine there). Cheers - -

Re: Suggester and fuzzy/infix suggestions

2017-06-29 Thread alessandro.benedetti
Another path to follow could be to design a specific collection(index) for the auto-suggestion. In there you can define the analysis chain as you like ( for example using edge-ngram filtering on top of tokenisation) to provide infix autocompletion. Then you can play with your queries as you like an

Re: cursorMark / Deep Paging and SEO

2017-06-30 Thread alessandro.benedetti
Hi Jacques, this should satisfy your curiosity [1]. The mark is telling you the relative position in the sorted set ( and it is mandatory to use the uniqueKey as tie breaker). If you change your index, the query using an old mark should still work (but may potentially return different docuements if

Re: Same score for different length matches

2017-07-03 Thread alessandro.benedetti
In addition to what Chris has correctly suggested, I would like to focus on this sentence : " I am decently certain that at one point in time it worked in a way that a higher match length would rank higher" You mean a match in a longer field would rank higher than a match in a shorter field ? is

Re: High disk write usage

2017-07-05 Thread alessandro.benedetti
Is the phisical machine dedicated ? Is a dedicated VM on shared metal ? Apart from this operational checks I will assume the machine is dedicated. In Solr a write to the disk does not happen only on commit, I can think to other scenarios : 1) *Transaction log* [1] 2) 3) Spellcheck and

Re: High disk write usage

2017-07-05 Thread alessandro.benedetti
Point 2 was the ram Buffer size : *ramBufferSizeMB* sets the amount of RAM that may be used by Lucene indexing for buffering added documents and deletions before they are flushed to the Directory. maxBufferedDocs sets a limit on the number of documents buffered

Re: Collections API Overseer Status

2017-07-12 Thread alessandro.benedetti
+1 I was trying to understand a reload collection time out happening lately in a Solr Cloud cluster, and the Overseer Status was hard to decipher. More Human Readable names and some additional documentation could help here. Cheers - --- Alessandro Benedetti Search Consultant,

Re: enable fault-tolerance by default on collection?

2017-07-13 Thread alessandro.benedetti
I would recommend to play with the default, append and invariants [1] element for the reuqestHandler node. Identify the request handler you want to use in the solrconfig.xml and then add the parameter you want. You should be abkle to manage this through your source version control system. Cheers

Re: suggestors on shingles

2017-07-13 Thread alessandro.benedetti
I would recommend this blog of mine to get a better understanding of how tokenization and the suggester work together [1] . If you take a look to the FuzzyLookupFactory, you will see that it is one of the suggesters that return the entire content of the field. You may be interested to the FreeTex

Re: Do I need to declare TermVectorComponent for best MoreLikeThis results?

2017-07-13 Thread alessandro.benedetti
You don't need the TermVectorComponent at all for MLT. The reason the Term Vector is suggested for the fields you are interested in, is just because this will speed up the way the MLT will retrieve the "interesting terms" out of your seed document to build the MLT query. If you don't have the Ter

Re: suggestors on shingles

2017-07-13 Thread alessandro.benedetti
To do what ? If it is a use case, please explain us. If it is just to check that the analysis chain worked correctly, you can check the schema browser or use Luke. If you just want to test your analysis chain, you can use the analysis tool in the Solr admin. Cheers - --- Aless

Apache Solr 4.10.x - Collection Reload times out

2017-07-14 Thread alessandro.benedetti
I have been recently facing an issue with the Collection Reload in a couple of Solr Cloud clusters : 1) re-index a collection 2) collection happily working 3) trigger collection reload 4) reload times out ( silently, no message in any of the Solr node logs) 5) no effect on the collection ( it sti

Re: Get results in multiple orders (multiple boosts)

2017-07-18 Thread alessandro.benedetti
"I have different "sort preferences", so I can't build a index and use for sorting.Maybe I have to sort by category then by source and by language or by source, then by category and by date" I would like to focus on this bit. It is ok to go for a custom function and sort at query time, but I am cu

Re: Apache Solr 4.10.x - Collection Reload times out

2017-07-20 Thread alessandro.benedetti
Thanks for the prompt response Erick, the reason that I am issuing a Collection reload is because I modify from time to the time the Solrconfig for example, with different spellcheck and request parameter default params. So after the upload to Zookeeper I reload the collection to reflect the modifi

Re: multiValued=false is not working in Solr 6.4 in RHEL/CentOS

2017-07-20 Thread alessandro.benedetti
I doubt it is an environment problem at all. How are you modifying your schema ? How you reloading your core/collection ? Are you restarting your Solr instance ? Regards - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Vi

Re: multiValued=false is not working in Solr 6.4 in RHEL/CentOS

2017-07-20 Thread alessandro.benedetti
Assuming the service solr service restart does its job, I think the only thing I would do is to completely remove the data directory content, instead of just running the delete query. Bare in mind that when you delete a document in Solr, this is marked as deleted, but it takes potentially a while

Re: Apache Solr 4.10.x - Collection Reload times out

2017-07-20 Thread alessandro.benedetti
Taking a look to 4.10.2 source I may see why the async call does not work : /log.info("Reloading Collection : " + req.getParamString()); String name = req.getParams().required().get("name"); *ZkNodeProps m = new ZkNodeProps(Overseer.QUEUE_OPERATION, OverseerCollectionProce

Re: Apache Solr 4.10.x - Collection Reload times out

2017-07-20 Thread alessandro.benedetti
Additional information : Try single core reload I identified that an entire shard is not reloading ( while the other shard is ). Taking a look to the "not reloading" shard ( 2 replicas) , it seems that the core reload stucks here : org.apache.solr.core.SolrCores#waitAddPendingCoreOps The problem

Re: LambdaMART XML model to JSON

2017-07-24 Thread alessandro.benedetti
hi Ryan, the issue you mentioned was mine : https://sourceforge.net/p/lemur/feature-requests/144/ My bad It got lost in sea of "To Dos" . I still think it could be a good contribution to the library, but at the moment I think going with a custom script/app to do the transformation is the way to go

Re: Apache Solr 4.10.x - Collection Reload times out

2017-07-24 Thread alessandro.benedetti
1) nope, no big tlog or replaying problem 2) Solr just seem freezed. Not responsive and nothing in the log. Now I just tried just to restart after the Zookeeper config deploy and on restart the log complety freezes and the instances don't come up... If I clean the indexes and then start, this work

Re: FreeTextSuggester throwing error "token must not contain separator byte"

2017-07-25 Thread alessandro.benedetti
I think this bit is the problem : "I am using a Shingle filter right after the StandardTokenizer, not sure if that has anything to do with it. " When using the FreeTextLookup approach, you don't need to use shingles in your analyser, shingles are added by the suggester itself. As Erick mentioned

Re: Copy field a source of copy field

2017-07-26 Thread alessandro.benedetti
I get your point, the second KeepWordFilter is not keeping anything because the token it gets is : "hey you" and the word is supposed to keep is "hey". Which does clearly not work. The KeepWordFilter just consider each row a single token ( I may be wrong, i didn't check the code, I am just asssumi

Re: Solr - google like suggestion

2017-09-18 Thread alessandro.benedetti
If you are referring to the number of words per suggestion, you may need to play with the free text lookup type [1] [1] http://alexbenedetti.blogspot.co.uk/2015/07/solr-you-complete-me.html - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd.

Re: Apache Solr 4.10.x - Collection Reload times out

2017-09-18 Thread alessandro.benedetti
I finally have an explanation, I post it here for future reference : The cause was a combination of : 1) /select request handler has default with the spellcheck ON and few spellcheck options ( such as collationQuery ON and max collation tries set to 5) 2) the firstSearcher has a warm-up query wi

Re: Knn classifier doesn't work

2017-09-18 Thread alessandro.benedetti
Hi Tommaso, you are definitely right! I see that the method : MultiFields.getTerms returns : if (termsPerLeaf.size() == 0) { return null; } As you correctly mentioned this is not handled in : org/apache/lucene/classification/document/SimpleNaiveBayesDocumentClassifier.java:115 org/apac

Re: Search by similarity?

2017-09-19 Thread alessandro.benedetti
In addition to that, I still believe More Like This is a better option for you. The reason is that the MLT is able to evaluate the interesting terms from your document (title is the only field of interest for you), and boost them accordingly. Related your "80% of similarity", this is more tricky.

Re: Solr returning same object in different page

2017-09-19 Thread alessandro.benedetti
Which version of Solr are you on? Are you using SolrCloud or any distributed search? In that case, I think( as already mentioned by Shawn) this could be related [1] . if it is just plain Solr, my shot in the dark is your boost function : {!boost+b=recip(ms(NOW,field1),3.16e-11,1,1)}{!boost+b=reci

Re: Cannot load LTRQParserPlugin inot my core

2017-09-20 Thread alessandro.benedetti
Hi Billy, there is a README.TXT in the contrib/ltr directory. Reading that you find this useful link[1] . >From that useful link you see where the Jar of the plugin is located. Specifically : Taking a look to the contrib and dist structure it seems quite a standard approach to keep the readme i

Re: Strange Behavior When Extracting Features

2017-09-22 Thread alessandro.benedetti
I think this has nothing to do with the LTR plugin. The problem here should be just the way you use the local params, to properly pass multi term local params in Solr you need to use *'* : efi.case_description='added couple of fiber channel' This should work. If not only the first term will be pa

Re: SOLR terminology

2017-09-28 Thread alessandro.benedetti
>From the Solr wiki[1] : *Logical* /Collection/ : It is a collection of documents which share the same logical domain and data structure *Physical* /Solr Node/ : It is a single instance of a Solr Server. From OS point of view it is a single Java Process ( internally it is the Solr Web App deploy

Re: Keeping the index naturally ordered by some field

2017-10-02 Thread alessandro.benedetti
Hi Alex, just to explore a bit your question, why do you need that ? Do you need to reduce query time ? Have you tried enabling docValues for the fields of interest ? Doc Values seem to me a pretty useful data structure when sorting is a requirement. I am curious to understand why that was not an o

Re: Distributed IDF configuration query

2017-10-02 Thread alessandro.benedetti
Hi Reth, there are some problem in the debug for the distributed IDF [1] Your case seems different though. It has been a while I experimented that feature but your config seems ok to me. What helped me a lot that time was to debug my Solr instance. [1] https://issues.apache.org/jira/browse/SOLR

Re: solr cloud without hard commit?

2017-10-02 Thread alessandro.benedetti
Hi Erick, you said : ""mentions that for soft commit, "new segments are created that will be merged"" Wait, how did that get in there? Ignore it, I'm taking it out. " but I think you were not wrong, based on another mailing list thread message by Shawn, I read : [1] "If you are using the corre

Re: length of indexed value

2017-10-04 Thread alessandro.benedetti
Are the norms a good approximation for you ? If you preserve norms at indexing time ( it is a configuration that you can operate in the schema.xml) you can retrieve them with this specific function query : *norm(field)* Returns the "norm" stored in the index for the specified field. This is the pr

Re: Solr boost function taking precedence over relevance boosting

2017-10-05 Thread alessandro.benedetti
I would try to use an additive boost and the ^= boost operator: - name_property :( test^=2 ) will assign a fixed score of 2 if the match happens ( it is a constant score query) - additive boost will be 0http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: when transaction logs are closing?

2017-10-09 Thread alessandro.benedetti
In addition to what Emir mentioned, when Solr opens a new Transaction Log file it will delete the older ones up to some conditions : keep at least N number of records [1] and max K number of files[2]. N is specified in the solrconfig.xml ( in the update handler section) and can be documents related

Re: Rescoring from 0 - full

2017-10-09 Thread alessandro.benedetti
The weights you express could flag a probabilistic view or your final score. The model you quoted will calculate the final score as : 0.9*scorePersonalId +0.1* originalScore The final score will NOT necessarily be 0https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#the-dismax-q

Re: spell-check does not return collations when using search query with filter

2017-10-09 Thread alessandro.benedetti
Does spellcheck.q=polt help ? How your queries normally look ? How would you like the collation to be returned ? - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472

Re: Semantic Knowledge Graph

2017-10-09 Thread alessandro.benedetti
I expect the slides to be published here : https://www.slideshare.net/lucidworks?utm_campaign=profiletracking&utm_medium=sssite&utm_source=ssslideview The one you are looking for is not there yet, but keep an eye on it. Regards - --- Alessandro Benedetti Search Consultant, R&

Re: Newbie question about why represent timestamps as "float" values

2017-10-10 Thread alessandro.benedetti
There was time ago a Solr installation which had the same problem, and the author explained me that the choice was made for performance reasons. Apparently he was sure that handling everything as primitive types would give a boost to the Solr searching/faceting performance. I never agreed ( and one

Re: Solr staying constant on popularity indexes

2017-10-10 Thread alessandro.benedetti
In line : /"1. No zookeeper - I have burnt my hands with some zookeeper issues in the past and it is no fun to deal with. Kafka and Storm are also trying to burden zookeeper less and less because ZK cannot handle heavy traffic."/ Where did you get this information ? is based on some publicly rep

RE: Parsing of rq queries in LTR

2017-10-12 Thread alessandro.benedetti
I don't think this is actually that much related to LTR Solr Feature. In the Solr feature I see you specify a query with a specific query parser (field). Unless there is a bug in the SolrFeature for LTR, I expect the query parser you defined to be used[1]. This means : "rawquerystring":"{!field f

Re: Solr related questions

2017-10-13 Thread alessandro.benedetti
1) "_version_" is not "unecessary", actually the contrary, it is fundamendal for Solr to work. The same for types you use across your field definitions. There was a time you could see these comments in the schema.xml (doesn't seem the case anymore): 2) https://lucen

Re: Solr related questions

2017-10-13 Thread alessandro.benedetti
Nabble mutilated my reply : *Comment*: If you remove this field, you must _also_ disable the update log in solrconfig.xml or Solr won't start. _version_ and update log are required for SolrCloud *Comment*:points to the root document of a block of nested documents. Required for nested

Re: Solr related questions

2017-10-13 Thread alessandro.benedetti
The only way Solr will fetch documents is through the Data Import Handler. Take a look to the URLDataSource[1] to see if it fits. Possibly you will need to customize it. [1] https://lucene.apache.org/solr/guide/6_6/uploading-structured-data-store-data-with-the-data-import-handler.html#urldatasourc

Re: Appending fields to pre-existed document

2017-10-13 Thread alessandro.benedetti
Hi, "And all what we got only a overwriting doc came first by new one. Ok just put overwrite=false to params, and dublicating docs appeare." What is exactly the doc you get ? Are the fields originally in the first doc before the atomic update stored ? This is what you need to use : https://luce

Re: spell-check does not return collations when using search query with filter

2017-10-16 Thread alessandro.benedetti
Interesting, what happens when you pass it as spellcheck.q=polt ? What is the behavior you get ? - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Strange Behavior When Extracting Features

2017-10-16 Thread alessandro.benedetti
This is interesting, the EFI parameter resolution should work using the quotes independently of the query parser. At that point, the query parsers (both) receive a multi term text. Both of them should work the same. At the time I saw the mail I tried to reproduce it through the LTR module tests and

Re: HOW DO I UNSUBSCRIBE FROM GROUP?

2017-10-16 Thread alessandro.benedetti
The Terms component[1] should do the trick for you. Just use the regular expression or prefix filtering and you should be able to get the stats you want. If you were interested in extracting the DV when returning docs you may be interested in function queries and specifically this one : docfreq(f

Re: E-Commerce Search: tf-idf, tie-break and boolean model

2017-10-16 Thread alessandro.benedetti
I was having the discussion with a colleague of mine recently, about E-commerce search. Of course there are tons of things you can do to improve relevancy: Custom similarity - edismax tuning - basic user events processing - machine learning integrations - semantic search ect ect more you do, bette

Re: spell-check does not return collations when using search query with filter

2017-10-17 Thread alessandro.benedetti
But you used : "spellcheck.q": "tag:polt", Instead of : "spellcheck.q": "polt", Regards - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Using pint field as uniqueKey

2017-10-17 Thread alessandro.benedetti
In addition to what Amrit correctly stated, if you need to search on your id, especially range queries, I recommend to use a copy field and leave the id field, almost as default. Cheers - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - ww

Re: Influencing representing document in grouped search.

2017-10-17 Thread alessandro.benedetti
Can results collapsing[1] be of use for you ? if it is the case, you can use that feature and explore its flexibility in selecting the group head : 1) min | max for a numeric field 2) min | max for a function query 3) sort [1] https://lucene.apache.org/solr/guide/6_6/collapse-and-expand-results.h

Re: Influencing representing document in grouped search.

2017-10-18 Thread alessandro.benedetti
If you add a filter query to your original query : fq=genre:A You know that your results ( group heads included) will just be of that genre. So I think we are not getting your question properly. Can you try to express your requirement from the beginning. Leave outside grouping or field collapsing

Re: AW: Howto verify that update is "in-place"

2017-10-18 Thread alessandro.benedetti
According to the concept of immutability that should drive Lucene segmenting approach, I think Emir observation sounds correct. Being docValues a column based data structure, stored on segments i guess when an in place update happens it does just a re-index of just that field. This means we need t

Re: Facets based on sampling

2017-10-23 Thread alessandro.benedetti
Hi John, first of all, I may state the obvious, but have you tried docValues ? Apart from that a friend of mine ( Diego Ceccarelli) was discussing a probabilistic implementation similar to the hyperloglog[1] to approximate facets counting. I didn't have time to take a look in details / implement

Re: LTR feature extraction performance issues

2017-10-23 Thread alessandro.benedetti
It strictly depends on the kind of features you are using. At the moment there is just one cache for all the features. This means that even if you have 1 query dependent feature and 100 document dependent feature, a different value for the query dependent one will invalidate the cache entry for the

Re: Goal: reverse chronological display Methods? (1) boost, and/or (2) disable idf

2017-10-23 Thread alessandro.benedetti
In addition : bf=recip(ms(NOW/DAY,unixdate),3.16e-11,5,0.1)) is an additive boost. I tend to prefer multiplicative ones but that is up to you [1]. You can specify the order of magnitude of the values generated by that function. This means that you have control of how much the date will affect the

Re: How to Efficiently Extract Learning to Rank Similarity Features From Solr?

2017-10-24 Thread alessandro.benedetti
i think this can be actually a good idea and I think that would require a new feature type implementation. Specifically I think you could leverage the existing data structures ( such TermVector) to calculate the matrix and then perform the calculations you need. Or maybe there is space for even a

Re: Date range queries no longer work 6.6 to 7.1

2017-10-24 Thread alessandro.benedetti
I know it is obvious, but ... have you done a full re-indexing or you used the Index migration tool ? In the latter case, it could be a bug of the tool itself. - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: h

Re: Given path of Ranklib model in Solr Model Json

2017-11-08 Thread alessandro.benedetti
I opened a ticket for RankLib long time ago to provide support for Solr Model Json format[1] It is on my TO DO list but unfortunately very low on priority. Anyone that want to contribute is welcome, I will help and commit it when ready. Cheers [1] https://sourceforge.net/p/lemur/feature-requests

Re: Faceting Word Count

2017-11-08 Thread alessandro.benedetti
Apart from the performance, to get a "word cloud" from a subset of documents it is a slighly different problem than getting the facets out of it. If my understanding is correct, what you want is to extract the "significant terms" out of your results set.[1] Using faceting is a rough approximation

Re: Solr - phrase suggestion returning duplicate

2017-11-08 Thread alessandro.benedetti
Hi Ruby, I partecipated at the discussion at the time, It's definitely still open. It's on my long TO DO list, I hope I will be able to contribute a solution sooner or later. In case you decide to use an entire new index for the autosuggestion, you can potentially manage that on your own. But out

Re: Solr - phrase suggestion returning duplicate

2017-11-10 Thread alessandro.benedetti
"In case you decide to use an entire new index for the autosuggestion, you can potentially manage that on your own" This refers to the fact that is possible to define an index just for autocompletion. You can model the document as you prefer in this additional index, defining the field types tha

Re: Using Ltr and payload together

2017-11-13 Thread alessandro.benedetti
It depends how you want to use the payloads. If you want to use the payloads to calculate additional features, you can implement a payload feature: This feature could calculate the sum of numerical payload for the query terms in each document ( so it will be a query dependent feature and will lev

Re: Spellcheck returning suggestions for words that exist in the dictionary

2017-11-13 Thread alessandro.benedetti
Which Solr version are you using ? >From the documentation : "Only query words, which are absent in index or too rare ones (below maxQueryFrequency ) are considered as misspelled and used for finding suggestions. ... These parameters (maxQueryFrequency and thresholdTokenFrequency) can be a percen

Re: Sol rCloud collection design considerations / best practice

2017-11-15 Thread alessandro.benedetti
"The main motivation is to support a geo-specific relevancy model which can easily be customized without stepping into each other" Is your relevancy tuning massively index time based ? i.e. will create massively different index content based on the geo location ? If it is just query time based o

Re: TimeZone issue

2017-11-27 Thread alessandro.benedetti
Hi, it is on my TO DO list with low priority, there is a Jira issue already[1], feel free to contribute it ! [1] https://issues.apache.org/jira/browse/SOLR-8952 - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- Sent fro

Re: Embedded SOLR - Best practice?

2017-11-27 Thread alessandro.benedetti
When you say " caching 100.000 docs" what do you mean ? being able to quickly find information in a corpus which increases in size ( 100.000 docs) everyday ? I second Erick, I think this is fairly normal Solr use case. If you really care about fast searches, you will get a fairly acceptable defaul

Re: Solr Spellcheck

2017-11-27 Thread alessandro.benedetti
Do you mean you are over-spellchecking ? Correcting even "not mispelled words" ? Can you give us the request handler configuration, spellcheck configuration and the schema ? Cheers - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.se

Inverted Index positions vs Term Vector positions

2017-11-27 Thread alessandro.benedetti
Hi all, it may sounds a silly question, but is there any reason that the term positions in the inverted index are using 1 based numbering while the Term Vector positions are using a 0 based numbering[1] ? This may affect different areas in Solr and cause problems which are quite tricky to spot. R

RE: Solr Spellcheck

2017-11-28 Thread alessandro.benedetti
You spellcheck configurations are quite extensive ! In particular you specified : 0.01 This means that if the term appears in less than 1 % total docs it will be considered misspelled. Is cholera occurring in your corpus > 1% total docs ? - --- Alessandro Benedetti S

RE: Solr Spellcheck

2017-11-29 Thread alessandro.benedetti
"Can you please suggest suitable configuration for spell check to work correctly. I am indexing all the words in one column. With current configuration I am not getting good suggestions " This is very vague. Spellchecking is working correctly according to your configurations... Let's start from

Re: Skewed IDF in multi lingual index, again

2017-12-04 Thread alessandro.benedetti
Hi Markus, just out of interest, why did " It was solved back then by using docCount instead of maxDoc when calculating idf, it worked really well!" solve the problem ? i assume you are using different fields, one per language. Each field is appearing on a different number of docs I guess. e.g. t

Re: Skewed IDF in multi lingual index, again

2017-12-04 Thread alessandro.benedetti
Furthermore, taking a look to the code for BM25 similarity, it seems to me it is currently working right : - docCount is used per field if != -1 /** * Computes a score factor for a simple term and returns an explanation * for that score factor. * * * The default implementation us

Re: Solr score use cases

2017-12-04 Thread alessandro.benedetti
I would like to stress how important is what Erick explained. A lot of times people want to use the score to show it to the users/calculate probability/doing weird calculations. Score is used to rank results, given a query. To give a local ordering. This is the only useful information for the end

Re: Skewed IDF in multi lingual index, again

2017-12-05 Thread alessandro.benedetti
"Lucene/Solr doesn't actually delete documents when you delete them, it just marks them as deleted. I'm pretty sure that the difference between docCount and maxDoc is deleted documents. Maybe I don't understand what I'm talking about, but that is the best I can come up with. " Thanks Shawn, y

Re: Skewed IDF in multi lingual index, again

2017-12-05 Thread alessandro.benedetti
Thanks Yonik and thanks Doug. I agree with Doug in adding few generics test corpora Jenkins automatically runs some metrics on, to evaluate Apache Lucene/Solr changes don't affect a golden truth too much. This of course can be very complex, but I think it is a direction the Apache Lucene/Solr comm

Re: Can't get spelling suggestions to work properly

2017-01-12 Thread alessandro.benedetti
Hi Jimi, taking a look to the *maxQueryFrequency* param : Your understanding is correct. 1) we don't provide misspelled suggestions if we set the param to 1, and we have a minimum of 1 doc freq for the term . 2) we don't provide misspelled suggestions if the doc frequency of the term is greater

Re: How to train the model using user clicks when use ltr(learning to rank) module?

2017-01-12 Thread alessandro.benedetti
Hi Jeffery, Just noticed your comment to my blog, I will try to respond asap. Related your doubt, I second Diego's readme. If you have other user signals as well ( apart from clicks) it may be interesting to use them as well. Users signals such as : "Add To Favorites" , "Add to the basket" , "Shar

Re: Commit required after delete ?

2017-01-12 Thread alessandro.benedetti
Interesting Michael, can you pass me the code reference? Cheers -- View this message in context: http://lucene.472066.n3.nabble.com/Commit-required-after-delete-tp4312697p4313692.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: A feature idea for discussion -- fields that can only be explicitly retrieved

2017-01-17 Thread alessandro.benedetti
Shawn Heisey wrote > If the data for a field in the results comes from docValues instead of > stored fields, I don't think it is compressed, which hopefully means > that if a field is NOT requested, the corresponding docValues data is > never read. I think we need to make a consideration here.

Re: Information on classifier based key word suggestion

2017-01-24 Thread alessandro.benedetti
Hi Shamik, for classification you can take a look to the Lucene module and the Solr integration ( through UpdateRequestProcessor [1] ) . Unfortunately I didn't have the time to work on the request handler version [2], anyway you are free to contribute ! Related the extraction of interesting terms

Re: Upgrade SOLR version - facets perfomance regression

2017-01-24 Thread alessandro.benedetti
Hi Solr, I admit the issue you mentioned has not been transparently solved, and indeed you would need to explicitly use the method=uif to get 4.10.1 behavior. This is valid if you were using fc/fcs approaches with high cardinality fields. In the case you facet method is enum ( Term Enumeration),

Re: Upgrade SOLR version - facets perfomance regression

2017-01-27 Thread alessandro.benedetti
Which kind of field are you faceting on ? Cardinality ? Field Type ? Doc Valued ? Which facet algorithm are you using ? Which facet parameters ? Cheers -- View this message in context: http://lucene.472066.n3.nabble.com/Upgrade-SOLR-version-facets-perfomance-regression-tp4315027p4317513.html S

Re: Documents issue

2017-01-27 Thread alessandro.benedetti
I may be wrong and don't have time to check the code in details now, but I would say you need to define the default in the destination field as well. The copy field should take in input the plain content of the field ( which is null) and then pass that content to the destination field. Properties

Distributed IDF in inter collections distributed queries

2017-01-27 Thread alessandro.benedetti
Hi all, I was playing a bit with the distributed IDF, I debugged and explored a lot the code and it is a nice feature in a shared environment. I tried to see what is the behaviour in case we run a distributed query across collections ( ...&collection=a,b,c) Distributed IDF should work in this sce

Re: Distributed IDF in inter collections distributed queries

2017-01-27 Thread alessandro.benedetti
I have an update on this, I have identified at least 2 bugs : 1) Real score / Debug score is not aligned When we operate a shard request with purpose '16388' ( GET_TOP_IDS,SET_TERM_STATS) we correctly pass the global collection stats and we calculate the real score. When we operate a shard reques

RE: Distributed IDF in inter collections distributed queries

2017-01-27 Thread alessandro.benedetti
Thanks Markus, I commented the Jira issue with a very naive approach to solve that. It's a shot in the dark, I will double check if it makes sense at all :) Cheers -- View this message in context: http://lucene.472066.n3.nabble.com/Distributed-IDF-in-inter-collections-distributed-queries-tp431

  1   2   >