Access permission
Hi, I'm indexing data off a DB. The data is secured with access permission. That is record-A can be seen by users-x, while record-B can be seen by users-y and yet record-C can be seen by users x and y. Even more, the group access permission can change over time. The question I have is this: how to handle this in Solr? Is there anything I can do during index and / or search time? What's the best practice to handle access permission in search? Thanks! - MJ
Cores and and ranking (search quality)
Hi, I have data in which I will index and search on. This data is well define such that I can index into a single core or multiple cores like so: core_1:Jan2015, core_2:Feb2015, core_3:Mar2015, etc. My question is this: if I put my data in multiple cores and use distributed search will the ranking be different if I had all my data in a single core? If yes, how will it be different? Also, will facet and more-like-this quality / result be the same? Also, reading the distributed search wiki (http://wiki.apache.org/solr/DistributedSearch) it looks like Solr does the search and result merging (all I have to do is issue a search), is this correct? Thanks! - MJ
RE: Cores and and ranking (search quality)
Help me understand this better (regarding ranking). If I have two docs that are 100% identical with the exception of uid (which is stored but not indexed). In a single core setup, if I search "xyz" such that those 2 docs end up ranking as #1 and #2. When I switch over to two core setup, doc-A goes to core-A (which has 10 records) and doc-B goes to core-B (which has 100,000 records). Now, are you saying in 2 core setup if I search on "xyz" (just like in singe core setup) this time I will not see doc-A and doc-B as #1 and #2 in ranking? That is, are you saying doc-A may now be somewhere at the top / bottom far away from doc-B? If so, which will be #1: the doc off core-A (that has 10 records) or doc-B off core-B (that has 100,000 records)? If I got all this right, are you saying SOLR-1632 will fix this issue such that the end result will now be as if I had 1 core? - MJ -Original Message- From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk] Sent: Thursday, March 5, 2015 9:06 AM To: solr-user@lucene.apache.org Subject: Re: Cores and and ranking (search quality) On Thu, 2015-03-05 at 14:34 +0100, johnmu...@aol.com wrote: > My question is this: if I put my data in multiple cores and use > distributed search will the ranking be different if I had all my data > in a single core? Yes, it will be different. The practical impact depends on how homogeneous your data are across the shards and how large your shards are. If you have small and dissimilar shards, your ranking will suffer a lot. Work is being done to remedy this: https://issues.apache.org/jira/browse/SOLR-1632 > Also, will facet and more-like-this quality / result be the same? It is not formally guaranteed, but for most practical purposes, faceting on multi-shards will give you the same results as single-shards. I don't know about more-like-this. My guess is that it will be affected in the same way that standard searches are. > Also, reading the distributed search wiki > (http://wiki.apache.org/solr/DistributedSearch) it looks like Solr > does the search and result merging (all I have to do is issue a > search), is this correct? Yes. From a user-perspective, searches are no different. - Toke Eskildsen, State and University Library, Denmark
Re: Cores and and ranking (search quality)
(reposing this to see if anyone can help) Help me understand this better (regarding ranking). If I have two docs that are 100% identical with the exception of uid (which is stored but not indexed). In a single core setup, if I search "xyz" such that those 2 docs end up ranking as #1 and #2. When I switch over to two core setup, doc-A goes to core-A (which has 10 records) and doc-B goes to core-B (which has 100,000 records). Now, are you saying in 2 core setup if I search on "xyz" (just like in singe core setup) this time I will not see doc-A and doc-B as #1 and #2 in ranking? That is, are you saying doc-A may now be somewhere at the top / bottom far away from doc-B? If so, which will be #1: the doc off core-A (that has 10 records) or doc-B off core-B (that has 100,000 records)? If I got all this right, are you saying SOLR-1632 will fix this issue such that the end result will now be as if I had 1 core? - MJ -Original Message- From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk] Sent: Thursday, March 5, 2015 9:06 AM To: solr-user@lucene.apache.org Subject: Re: Cores and and ranking (search quality) On Thu, 2015-03-05 at 14:34 +0100, johnmu...@aol.com wrote: > My question is this: if I put my data in multiple cores and use > distributed search will the ranking be different if I had all my data > in a single core? Yes, it will be different. The practical impact depends on how homogeneous your data are across the shards and how large your shards are. If you have small and dissimilar shards, your ranking will suffer a lot. Work is being done to remedy this: https://issues.apache.org/jira/browse/SOLR-1632 > Also, will facet and more-like-this quality / result be the same? It is not formally guaranteed, but for most practical purposes, faceting on multi-shards will give you the same results as single-shards. I don't know about more-like-this. My guess is that it will be affected in the same way that standard searches are. > Also, reading the distributed search wiki > (http://wiki.apache.org/solr/DistributedSearch) it looks like Solr > does the search and result merging (all I have to do is issue a > search), is this correct? Yes. From a user-perspective, searches are no different. - Toke Eskildsen, State and University Library, Denmark
Re: Cores and and ranking (search quality)
Thanks Erick for trying to help, I really appreciate it. Unfortunately, I'm still stuck. There are times one must know the inner working and behavior of the software to make design decision and this one is one of them. If I know the inner working of Solr, I would not be asking. In addition, I'm in the design process, so I'm not able to fully test. Beside my test could be invalid because I may not set it up right due to my lack of understanding the inner working of Solr. Given this, I hope you don't mind me asking again. If I have two cores, one core has 10 docs another has 100,000 docs. I then submit two docs that are 100% identical (with the exception of the unique-ID fields, which is stored but not indexed) one to each core. The question is, during search, will both of those docs rank near each other or not? If so, this is great because it will behave the same as if I had one core and index both docs to this single core. If not, which core's doc will rank higher and how far apart the two docs be from each other in the ranking? Put another way: are docs from the smaller core (the one has 10 docs only) rank higher or lower compared to docs from the larger core (the one with 100,000) docs? Thanks! -- MJ -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Tuesday, March 10, 2015 11:47 AM To: solr-user@lucene.apache.org Subject: Re: Cores and and ranking (search quality) SOLR-1632 will certainly help. But trying to predict whether your core A or core B will appear first doesn't really seem like a good use of time. If you actually have a setup like you describe, add &debug=all to your query on both cores and you'll see all the gory detail of how the scores are calculated, providing a definitive answer in _your_ situation. Best, Erick On Mon, Mar 9, 2015 at 5:44 AM, wrote: > (reposing this to see if anyone can help) > > > Help me understand this better (regarding ranking). > > If I have two docs that are 100% identical with the exception of uid (which > is stored but not indexed). In a single core setup, if I search "xyz" such > that those 2 docs end up ranking as #1 and #2. When I switch over to two > core setup, doc-A goes to core-A (which has 10 records) and doc-B goes to > core-B (which has 100,000 records). > > Now, are you saying in 2 core setup if I search on "xyz" (just like in singe > core setup) this time I will not see doc-A and doc-B as #1 and #2 in ranking? > That is, are you saying doc-A may now be somewhere at the top / bottom far > away from doc-B? If so, which will be #1: the doc off core-A (that has 10 > records) or doc-B off core-B (that has 100,000 records)? > > If I got all this right, are you saying SOLR-1632 will fix this issue such > that the end result will now be as if I had 1 core? > > - MJ > > > -Original Message- > From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk] > Sent: Thursday, March 5, 2015 9:06 AM > To: solr-user@lucene.apache.org > Subject: Re: Cores and and ranking (search quality) > > On Thu, 2015-03-05 at 14:34 +0100, johnmu...@aol.com wrote: >> My question is this: if I put my data in multiple cores and use >> distributed search will the ranking be different if I had all my data >> in a single core? > > Yes, it will be different. The practical impact depends on how homogeneous > your data are across the shards and how large your shards are. If you have > small and dissimilar shards, your ranking will suffer a lot. > > Work is being done to remedy this: > https://issues.apache.org/jira/browse/SOLR-1632 > >> Also, will facet and more-like-this quality / result be the same? > > It is not formally guaranteed, but for most practical purposes, faceting on > multi-shards will give you the same results as single-shards. > > I don't know about more-like-this. My guess is that it will be affected in > the same way that standard searches are. > >> Also, reading the distributed search wiki >> (http://wiki.apache.org/solr/DistributedSearch) it looks like Solr >> does the search and result merging (all I have to do is issue a >> search), is this correct? > > Yes. From a user-perspective, searches are no different. > > - Toke Eskildsen, State and University Library, Denmark >
Re: Cores and and ranking (search quality)
Thanks Walter. The design decision I'm trying to solve is this: using multiple cores, will my ranking be impacted vs. using single core? I have records to index and each record can be grouped into object-types, such as object-A, object-B, object-C, etc. I have a total of 30 (maybe more) object-types. There may be only 10 records of object-A, but 10 million records of object-B or 1 million of object-C, etc. I need to be able to search against a single object-type and / or across all object-types. >From my past experience, in a single core setup, if I have two identical >records, and I search on the term " XYZ" that matches one of the records, the >second record ranks right next to the other (because it too contains "XYZ"). >This is good and is the expected behavior. If I want to limit my search to an >object-type, I AND "XYZ" with that object-type. So all is well. What I'm considering to do for my new design is use multi-cores and distributed search. I am considering to create a core for each object-type: core-A will hold records from object-A, core-B will hold records from object-B, etc. Before I can make a decision on this design, I need to know how ranking will be impacted. Going back to my earlier example: if I have 2 identical records, one of them went to core-A which has 10 records, and the other went to core-B which has 10 million records, using distributed search, if I now search across all cores on the term " XYZ" (just like in the single core case), it will match both of those records all right, but will those two records be ranked next to each other just like in the single core case? If not, which will rank higher, the one from core-A or the one from core-B? My concern is, using multi-cores and distributed search means I will give up on rank quality when records are not distributed across cores evenly. If so, than maybe this is not a design I can use. - MJ -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Tuesday, March 10, 2015 2:39 PM To: solr-user@lucene.apache.org Subject: Re: Cores and and ranking (search quality) On Mar 10, 2015, at 10:17 AM, johnmu...@aol.com wrote: > If I have two cores, one core has 10 docs another has 100,000 docs. I then > submit two docs that are 100% identical (with the exception of the unique-ID > fields, which is stored but not indexed) one to each core. The question is, > during search, will both of those docs rank near each other or not? […] > > Put another way: are docs from the smaller core (the one has 10 docs only) > rank higher or lower compared to docs from the larger core (the one with > 100,000) docs? These are not quite the same question. tf.idf ranking depends on the other documents in the collection (the idf term). With 10 docs, the document frequency statistics are effectively random noise, so the ranking is unpredictable. Identical documents should rank identically, but whether they are higher or lower in the two cores depends on the rest of the docs. idf statistics don’t settle down until at least 10K docs. You still sometimes see anomalies under a million documents. What design decision do you need to make? We can probably answer that for you. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog)
Re: Cores and and ranking (search quality)
Thanks Walter. This explains a lot. - MJ -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Tuesday, March 10, 2015 4:41 PM To: solr-user@lucene.apache.org Subject: Re: Cores and and ranking (search quality) If the documents are distributed randomly across shards/cores, then the statistics will be similar in each core and the results will be similar. If the documents are distributed semantically (say, by topic or type), the statistics of each core will be skewed towards that set of documents and the results could be quite different. Assume I have tech support documents and I put all the LaserJet docs in one core. That term is very common in that core (poor idf) and rare in other cores (strong idf). But for the query “laserjet”, all the good answers are in the LaserJet-specific core, where they will be scored low. An identical document that mentions “LaserJet” once will score fairly low in the LaserJet-specific collection and fairly high in the other collection. Global IDF fixes this, by using corpus-wide statistics. That’s how we ran Infoseek and Ultraseek in the late 1990’s. Random allocation to cores avoids it. If you have significant traffic directed to one object type AND you need peak performance, you may want to segregate your cores by object type. Otherwise, I’d let SolrCloud spread them around randomly and filter based on an object type field. That should work well for most purposes. Any core with less than 1000 records is likely to give somewhat mysterious results. A word that is common in English, like “next”, will only be in one document and will score too high. A less-common word, like “unreasonably”, will be in 20 and will score low. You need lots of docs for the language statistics to even out. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Mar 10, 2015, at 1:23 PM, johnmu...@aol.com wrote: > Thanks Walter. > > The design decision I'm trying to solve is this: using multiple cores, will > my ranking be impacted vs. using single core? > > I have records to index and each record can be grouped into object-types, > such as object-A, object-B, object-C, etc. I have a total of 30 (maybe more) > object-types. There may be only 10 records of object-A, but 10 million > records of object-B or 1 million of object-C, etc. I need to be able to > search against a single object-type and / or across all object-types. > > From my past experience, in a single core setup, if I have two identical > records, and I search on the term " XYZ" that matches one of the records, the > second record ranks right next to the other (because it too contains "XYZ"). > This is good and is the expected behavior. If I want to limit my search to > an object-type, I AND "XYZ" with that object-type. So all is well. > > What I'm considering to do for my new design is use multi-cores and > distributed search. I am considering to create a core for each object-type: > core-A will hold records from object-A, core-B will hold records from > object-B, etc. Before I can make a decision on this design, I need to know > how ranking will be impacted. > > Going back to my earlier example: if I have 2 identical records, one of them > went to core-A which has 10 records, and the other went to core-B which has > 10 million records, using distributed search, if I now search across all > cores on the term " XYZ" (just like in the single core case), it will match > both of those records all right, but will those two records be ranked next to > each other just like in the single core case? If not, which will rank > higher, the one from core-A or the one from core-B? > > My concern is, using multi-cores and distributed search means I will give up > on rank quality when records are not distributed across cores evenly. If so, > than maybe this is not a design I can use. > > - MJ > > -Original Message- > From: Walter Underwood [mailto:wun...@wunderwood.org] > Sent: Tuesday, March 10, 2015 2:39 PM > To: solr-user@lucene.apache.org > Subject: Re: Cores and and ranking (search quality) > > On Mar 10, 2015, at 10:17 AM, johnmu...@aol.com wrote: > >> If I have two cores, one core has 10 docs another has 100,000 docs. I then >> submit two docs that are 100% identical (with the exception of the unique-ID >> fields, which is stored but not indexed) one to each core. The question is, >> during search, will both of those docs rank near each other or not? […] >> >> Put another way: are docs from the smaller core (the one has 10 docs only) >> rank higher or lower compared to docs from the larger core (the one with >> 100,000) docs? > > These are not quite the same question. > > tf.idf ranking depends on the other documents in the collection (the idf > term). With 10 docs, the document frequency statistics are effectively random > noise, so the ranking is unpredictable. > > Identical documents s
Re: [Poll]: User need for Solr security
I would love to see record level (or even field level) restricted access in Solr / Lucene. This should be group level, LDAP like or some rule base (which can be dynamic). If the solution means having a second core, so be it. The following is the closest I found: https://wiki.apache.org/solr/SolrSecurity#Document_Level_Security but I cannot use Manifold CF (Connector Framework). Does anyone know how Manifold does it? - MJ -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Thursday, March 12, 2015 6:51 PM To: solr-user@lucene.apache.org Subject: RE: [Poll]: User need for Solr security Jan - we don't really need any security for our products, nor for most clients. However, one client does deal with very sensitive data so we proposed to encrypt the transfer of data and the data on disk through a Lucene Directory. It won't fill all gaps but it would adhere to such a client's guidelines. I think many approaches of security in Solr/Lucene would find advocates, be it index encryption or authentication/authorization or transport security, which is now possible. I understand the reluctance of the PMC, and i agree with it, but some users would definitately benefit and it would certainly make Solr/Lucene the search platform to use for some enterprises. Markus -Original message- > From:Henrique O. Santos > Sent: Thursday 12th March 2015 23:43 > To: solr-user@lucene.apache.org > Subject: Re: [Poll]: User need for Solr security > > Hi, > > I’m currently working with indexes that need document level security. Based > on the user logged in, query results would omit documents that this user > doesn’t have access to, with LDAP integration and such. > > I think that would be nice to have on a future Solr release. > > Henrique. > > > On Mar 12, 2015, at 7:32 AM, Jan Høydahl wrote: > > > > Hi, > > > > Securing various Solr APIs has once again surfaced as a discussion > > in the developer list. See e.g. SOLR-7236 Would be useful to get some > > feedback from Solr users about needs "in the field". > > > > Please reply to this email and let us know what security aspect(s) would be > > most important for your company to see supported in a future version of > > Solr. > > Examples: Local user management, AD/LDAP integration, SSL, > > authenticated login to Admin UI, authorization for Admin APIs, e.g. > > admin user vs read-only user etc > > > > -- > > Jan Høydahl, search solution architect Cominvent AS - > > www.cominvent.com > > > >
Which Lucene search syntax is faster
Hi, Given the following Lucene document that I’m adding to my index(and I expect to have over 10 million of them, each with various sizes from 1 Kbto 50 Kb: PDF Some name Some summary Who owns this 10 1234567890 DOC Some name Some summary Who owns this 10 0987654321 My question is this: what Lucene search syntax will give meback result the fastest? If my user is interestedin finding data within “title” and “owner” fields only “doc_type” “DOC”, shouldI build my Lucene search syntax as: 1) skyfall ian fleming AND doc_type:DOC 2) title:(skyfall OR ian OR fleming) owner:(skyfall OR ian ORfleming) AND doc_type:DOC 3) Something else I don't know about. Of the 10 million documents I will be indexing, 80% will be of "doc_type" PDF, and about 10% of type DOC, so please keep that in mind as a factor (if that will mean anything in terms of which syntax I should use). Thanks in advanced, - MJ
Re: Which Lucene search syntax is faster
Thank you Shawn and Erick for the quick response. A follow up question. Basedon https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-Thefq%28FilterQuery%29Parameter,I see the "fl" (field list) parameter. Does this mean I canbuild my Lucene search syntax as follows: q=skyfall OR ian ORfleming&fl=title&fl=owner&fq=doc_type:DOC And get the same result as (per Shawn's example changed it bit toadd OR): q=title:(skyfall OR ian OR fleming)owner:(skyfall OR ian OR fleming)&fq=doc_type:DOC Btw, my default search operator is set to AND. My need is tofind whatever the user types in both of those two fields (or maybe some otherfields which is controlled by the UI).. For example, user types"skyfall ian fleming" and selected 3 fields, and want to narrowdown to doc_type DOC. - MJ -Original Message- From: Erick Erickson To: solr-user Sent: Wed, Apr 30, 2014 5:33 pm Subject: Re: Which Lucene search syntax is faster I'd add that I think you're worrying about the wrong thing. 10M documents is not very many by modern Solr standards. I rather suspect that you won't notice much difference in performance due to how you construct the query. Shawn's suggestion to use fq clauses is spot on, though. fq clauses are re-used (see filterCache in solrconfig.xml). My rule of thumb is to use fq clauses for most everything that does NOT contribute to scoring... Best, Erick On Wed, Apr 30, 2014 at 2:18 PM, Shawn Heisey wrote: > On 4/30/2014 2:29 PM, johnmu...@aol.com wrote: >> My question is this: what Lucene search syntax will give meback result the fastest? If my user is interestedin finding data within “title” and “owner” fields only “doc_type” “DOC”, shouldI build my Lucene search syntax as: >> >> 1) skyfall ian fleming AND doc_type:DOC > > If your default field is text, I'm fairly sure this will become > equivalent to the following which is probably NOT what you want. > Parentheses can be very important. > > text:skyfall OR text:ian OR (text:fleming AND doc_type:DOC) > >> 2) title:(skyfall OR ian OR fleming) owner:(skyfall OR ian OR fleming) AND doc_type:DOC > > This kind of query syntax is probably what you should shoot for. Not > from a performance perspective -- just from the perspective of making > your queries completely correct. Note that the +/- syntax combined with > parentheses is far more precise than using AND/OR/NOT. > >> 3) Something else I don't know about. > > The edismax query parser is very powerful. That might be something > you're interested in. > > https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser > > >> Of the 10 million documents I will be indexing, 80% will be of "doc_type" PDF, and about 10% of type DOC, so please keep that in mind as a factor (if that will mean anything in terms of which syntax I should use). > > For the most part, whatever general query format you choose to use will > not matter very much. There are exceptions, but mostly Solr (Lucene) is > smart enough to convert your query to an efficient final parsed format. > Turn on the debugQuery parameterto see what it does with each query. > > Regardless of whether you use the standard lucene query parser or > edismax, incorporate filter queries into your query constructing logic. > Your second example above would be better to express like this, with the > default operator set to OR. This uses both q and fq parameters: > > q=title:(skyfall ian fleming) owner:(skyfall ian fleming)&fq=doc_type:DOC > > https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-Thefq%28FilterQuery%29Parameter > > Thanks, > Shawn >
Using fq as OR
Hi, Currently, I'm building my search as follows: q=(search string ...) AND (type:type_a OR type:type_b OR type:type_c OR ...) Which means anything I search for will be AND'ed to be in either fields that have "type_a", "type_b", "type_c", etc. (I have defaultOperator set to "AND") Now, I need to use "fq" so I'm not sure how to build my search string to get the same result!! I have tried the following: q=search string ...&fq=type:type_a&fq=type:type_b&fq=type:type_c&... But this isn't the same because each additional "fq" is now being treated as AND (keep in mind, I have defaultOperator set to "AND" and I cannot change that). I have tried the following: q=search string ...&fq=type:(type_a OR type_b OR type_c OR ...) But the result I get back is not the same. Thanks in advanced !!! -- MJ
Re: Using fq as OR
Answering Jack's question first: the result is different, by few counts, but I found my problem:I was using the wrong syntax in my code vs. what I posted here: I was using q=(search string ...) AND (type:type_a OR type_b OR type_c OR ...) (see how I left out "type:" from "type_b" and "type_c", etc.?! Shawn and all, now the hit count is the same but ranking is totally different, how come ?!!! I'm not using edismax, I'm using the default query parser, I'm also using the default sort. You said the "order" will likely be different, which it is, why? If I cannot explain it to my users, they will be confused because they can type in directly the search syntax (when "fq" is not used) and expect to see the same result for when I grammatically in my code apply "fq". Same data, but different path, giving me different rank result, is not good. -- MJ -Original Message- From: Shawn Heisey To: solr-user Sent: Wed, May 21, 2014 11:42 am Subject: Re: Using fq as OR On 5/21/2014 9:26 AM, johnmu...@aol.com wrote: > Currently, I'm building my search as follows: > > > q=(search string ...) AND (type:type_a OR type:type_b OR type:type_c OR ...) > > > Which means anything I search for will be AND'ed to be in either fields that have "type_a", "type_b", "type_c", etc. (I have defaultOperator set to "AND") > > > Now, I need to use "fq" so I'm not sure how to build my search string to get the same result!! > > > I have tried the following: > > > q=search string ...&fq=type:type_a&fq=type:type_b&fq=type:type_c&... > > > But this isn't the same because each additional "fq" is now being treated as AND (keep in mind, I have defaultOperator set to "AND" and I cannot change that). > > > I have tried the following: > > > q=search string ...&fq=type:(type_a OR type_b OR type_c OR ...) > > > But the result I get back is not the same. If you are using the standard (lucene) query parser for your queries, then fq should behave exactly the same. If you are using a different query parser (edismax, for example) then fq may not behave the same, because it will use the lucene query parser. With the standard query parser, if your original query looks like the following: q=(query) AND (filter) The query below should produce exactly the same results -- although if you are using the default relevance sort, the *order* is likely to be different, because filter queries do not affect the document scores, but everything in the q parameter does. q=(query)&fq=(filter) Thanks, Shawn
Re: Using fq as OR
Hi Jack, I'm going after speed per: https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-Thefq%28FilterQuery%29Parameter If using "fq" ranking will now be different, I need to understand why. Even more, I'm now wandering, which ranking is correct the one with "fq" or without ?!!! I'm now more puzzled about this than ever If the following two q=(searchstring ...) AND (type:type_a OR type:type_b OR type:type_c OR ...) q=search string...&fq=type:(type_a OR type_b OR type_c OR ...) will not give me the same ranking, than why? -- MJ -Original Message- From: Jack Krupansky To: solr-user Sent: Wed, May 21, 2014 5:06 pm Subject: Re: Using fq as OR The whole point of a filter query is to hide data but without impacting the scoring for the non-hidden data. A second goal is performance since the filter query can be cached. So, the immediate question for you is whether you really want a true filter query, or if you actually do what the filtering terms to participate in the document scoring. In other words, what exactly were you trying to achieve by using fq? -- Jack Krupansky -Original Message- From: johnmu...@aol.com Sent: Wednesday, May 21, 2014 12:19 PM To: solr-user@lucene.apache.org Subject: Re: Using fq as OR Answering Jack's question first: the result is different, by few counts, but I found my problem:I was using the wrong syntax in my code vs. what I posted here: I was using q=(search string ...) AND (type:type_a OR type_b OR type_c OR ...) (see how I left out "type:" from "type_b" and "type_c", etc.?! Shawn and all, now the hit count is the same but ranking is totally different, how come ?!!! I'm not using edismax, I'm using the default query parser, I'm also using the default sort. You said the "order" will likely be different, which it is, why? If I cannot explain it to my users, they will be confused because they can type in directly the search syntax (when "fq" is not used) and expect to see the same result for when I grammatically in my code apply "fq". Same data, but different path, giving me different rank result, is not good. -- MJ -Original Message- From: Shawn Heisey To: solr-user Sent: Wed, May 21, 2014 11:42 am Subject: Re: Using fq as OR On 5/21/2014 9:26 AM, johnmu...@aol.com wrote: > Currently, I'm building my search as follows: > > > q=(search string ...) AND (type:type_a OR type:type_b OR type:type_c > OR ...) > > > Which means anything I search for will be AND'ed to be in either fields > that have "type_a", "type_b", "type_c", etc. (I have defaultOperator set to "AND") > > > Now, I need to use "fq" so I'm not sure how to build my search string to > get the same result!! > > > I have tried the following: > > > q=search string ...&fq=type:type_a&fq=type:type_b&fq=type:type_c&... > > > But this isn't the same because each additional "fq" is now being treated > as AND (keep in mind, I have defaultOperator set to "AND" and I cannot change that). > > > I have tried the following: > > > q=search string ...&fq=type:(type_a OR type_b OR type_c OR ...) > > > But the result I get back is not the same. If you are using the standard (lucene) query parser for your queries, then fq should behave exactly the same. If you are using a different query parser (edismax, for example) then fq may not behave the same, because it will use the lucene query parser. With the standard query parser, if your original query looks like the following: q=(query) AND (filter) The query below should produce exactly the same results -- although if you are using the default relevance sort, the *order* is likely to be different, because filter queries do not affect the document scores, but everything in the q parameter does. q=(query)&fq=(filter) Thanks, Shawn
Re: Using fq as OR
Interesting!! I did not know that using "fq" means the result will NOT be scored. When you say "add a boosting query using the bq parameter" can you give me an example? I read on "bq" but could not figure out how to convert: q=(searchstring ...) AND (type:type_a OR type:type_b OR type:type_c OR ...) to use "bq boosting". Maybe my question should be rephrase to this: to narrow down my search to within 1 or more fields, is the syntax that I'm currently using the optimal one or is there some Solr trick I should be using? My users are currently used to the score result that I give them with the syntax that I am currently using (that I showed above). I looking to see if there is some other way to get the same result faster. This is why I ended up looking into "fq" after reading about it Thanks to everyone for helping out with this topic. I am learning a lot -- MJ -Original Message- From: Jack Krupansky To: solr-user Sent: Wed, May 21, 2014 6:07 pm Subject: Re: Using fq as OR As I indicated in my original response, the fq query terms do not participate in any way in the scoring of documents - they merely filter (eliminate or keep) documents. If you actually do want the fq terms to participate in the scoring of documents, either keep them on the original q query, or add a boosting query using the bq parameter. The latter approach works for the dismax and edismax query parsers only. -- Jack Krupansky -Original Message- From: johnmu...@aol.com Sent: Wednesday, May 21, 2014 5:51 PM To: solr-user@lucene.apache.org Subject: Re: Using fq as OR Hi Jack, I'm going after speed per: https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-Thefq%28FilterQuery%29Parameter If using "fq" ranking will now be different, I need to understand why. Even more, I'm now wandering, which ranking is correct the one with "fq" or without ?!!! I'm now more puzzled about this than ever If the following two q=(searchstring ...) AND (type:type_a OR type:type_b OR type:type_c OR ...) q=search string...&fq=type:(type_a OR type_b OR type_c OR ...) will not give me the same ranking, than why? -- MJ -Original Message- From: Jack Krupansky To: solr-user Sent: Wed, May 21, 2014 5:06 pm Subject: Re: Using fq as OR The whole point of a filter query is to hide data but without impacting the scoring for the non-hidden data. A second goal is performance since the filter query can be cached. So, the immediate question for you is whether you really want a true filter query, or if you actually do what the filtering terms to participate in the document scoring. In other words, what exactly were you trying to achieve by using fq? -- Jack Krupansky -Original Message- From: johnmu...@aol.com Sent: Wednesday, May 21, 2014 12:19 PM To: solr-user@lucene.apache.org Subject: Re: Using fq as OR Answering Jack's question first: the result is different, by few counts, but I found my problem:I was using the wrong syntax in my code vs. what I posted here: I was using q=(search string ...) AND (type:type_a OR type_b OR type_c OR ...) (see how I left out "type:" from "type_b" and "type_c", etc.?! Shawn and all, now the hit count is the same but ranking is totally different, how come ?!!! I'm not using edismax, I'm using the default query parser, I'm also using the default sort. You said the "order" will likely be different, which it is, why? If I cannot explain it to my users, they will be confused because they can type in directly the search syntax (when "fq" is not used) and expect to see the same result for when I grammatically in my code apply "fq". Same data, but different path, giving me different rank result, is not good. -- MJ -Original Message- From: Shawn Heisey To: solr-user Sent: Wed, May 21, 2014 11:42 am Subject: Re: Using fq as OR On 5/21/2014 9:26 AM, johnmu...@aol.com wrote: > Currently, I'm building my search as follows: > > > q=(search string ...) AND (type:type_a OR type:type_b OR type:type_c > OR ...) > > > Which means anything I search for will be AND'ed to be in either fields > that have "type_a", "type_b", "type_c", etc. (I have defaultOperator set to "AND") > > > Now, I need to use "fq" so I'm not sure how to build my search string to > get the same result!! > > > I have tried the following: > > > q=search string ...&fq=type:type_a&fq=type:type_b&fq=type:type_c&... > > > But this isn't the same because each additional "fq" is now being treated > as AND (keep in mind, I have defaultOperator set to "AND" and I cannot change that). > > > I have tried the following: > > > q=search string ...&fq=type:(type_a OR type_b OR type_c OR ...) > > > But the result I get back is not the same. If you are using the standard (lucene) query parser for your queries, then fq should behave exactly the same. If you are using a different query parser (edismax, for e
How much free disk space will I need to optimize my index
Hi, I need to de-fragment my index. My question is, how much free disk space I need before I can do so? My understanding is, I need 1X free disk space of my current index un-optimized index size before I can optimize it. Is this true? That is, let say my index is 20 GB (un-optimized) then I must have 20 GB of free disk space to make sure the optimization is successful. The reason for this is because during optimization the index is re-written (is this the case?) and if it is already optimized, the re-write will create a new 20 GB index before it deletes the old one (is this true?), thus why there must be at least 20 GB free disk space. Can someone help me with this or point me to a wiki on this topic? Thanks!!! - MJ
Re: How much free disk space will I need to optimize my index
Thank you all for the reply and shedding more light on this topic. A follow up question: during optimization, If I run out of disk space, what happens other than the optimizer failing? Am I now left with even a larger index than I started with or am I back to the original none optimized index size?!!! -- MJ -Original Message- From: Walter Underwood To: solr-user Sent: Thu, Jun 26, 2014 10:50 am Subject: Re: How much free disk space will I need to optimize my index The 3x worst case is: 1. All documents are in one segment. 2. Without merging, all documents are deleted, then re-added and committed. 3. A merge is done. At the end of step 2, there are two equal-sized segments, 2X the space needed. During step 3, a third segment of that size is created. This can only happen if you disable merging. 2X is a conservative margin that should work fine for regular merges. Forced full merges ("optimize") can use more overhead because they move every document in the index. Yet another reason to avoid forced merges. wunder On Jun 26, 2014, at 12:50 AM, Thomas Egense wrote: > That is correct, but twice the disk space is theoretically not enough. > Worst case is actually three times the storage, I guess this worst case can > happen if you also submit new documents to the index while optimizing. > I have experienced 2.5 times the disk space during an optimize for a large > index, it was a 1TB index that temporarily used 2.5TB disc space during the > optimize (near the end of the optimization). > > From, > Thomas Egense > > > On Wed, Jun 25, 2014 at 8:21 PM, Markus Jelsma > wrote: > >> >> >> >> >> -Original message- >>> From:johnmu...@aol.com >>> Sent: Wednesday 25th June 2014 20:13 >>> To: solr-user@lucene.apache.org >>> Subject: How much free disk space will I need to optimize my index >>> >>> Hi, >>> >>> >>> I need to de-fragment my index. My question is, how much free disk >> space I need before I can do so? My understanding is, I need 1X free disk >> space of my current index un-optimized index size before I can optimize it. >> Is this true? >> >> Yes, 20 GB of FREE space to force merge an existing 20 GB index. >> >>> >>> >>> That is, let say my index is 20 GB (un-optimized) then I must have 20 GB >> of free disk space to make sure the optimization is successful. The reason >> for this is because during optimization the index is re-written (is this >> the case?) and if it is already optimized, the re-write will create a new >> 20 GB index before it deletes the old one (is this true?), thus why there >> must be at least 20 GB free disk space. >>> >>> >>> Can someone help me with this or point me to a wiki on this topic? >>> >>> >>> Thanks!!! >>> >>> >>> - MJ >>> >> -- Walter Underwood wun...@wunderwood.org
Searching on special characters
Hi, How should I setup Solr so I can search and get hit on special characters such as: + - && || ! ( ) { } [ ] ^ " ~ * ? : \ My need is, if a user has text like so: Doc-#1: "(Solr)" Doc-#2: "Solr" And they type "(solr)" I want a hit on "(solr)" only in document #1, with the brackets matching. And if they type "solr", they will get a hit in Document #2 only. An additional nice-to-have is, if they type "solr", I want a hit in both document #1 and #2. Here is what my current schema.xml looks like: Currently, special characters are being stripped. Any idea how I can configure Solr to do this? I'm using Solr 3.6. Thanks !! -MJ
Re: Searching on special characters
I'm not sure what you mean. Based on what you are saying, is there an example of how I can setup my schema.xml to get the result I need? Also, the way I execute a search is using http://localhost:8080/solr/select/?q= Does your solution require me to change this? If so, in what way? It would be great if all this is documented somewhere, so I won't have to bug you guys !!! --MJ -Original Message- From: Jack Krupansky To: solr-user Sent: Thu, Oct 24, 2013 9:39 am Subject: Re: Searching on special characters Have two or three copies of the text, one field could be raw string and boosted heavily for exact match, a second could be text using the keyword tokenizer but with lowercase filter also heavily boosted, and the third field general, tokenized text with a lower boost. You could also have a copy that uses the keyword tokenizer to maintain a single token but also applies a regex filter to strip special characters and applies a lower case filter and give that an intermediate boost. -- Jack Krupansky -Original Message- From: johnmu...@aol.com Sent: Thursday, October 24, 2013 9:20 AM To: solr-user@lucene.apache.org Subject: Searching on special characters Hi, How should I setup Solr so I can search and get hit on special characters such as: + - && || ! ( ) { } [ ] ^ " ~ * ? : \ My need is, if a user has text like so: Doc-#1: "(Solr)" Doc-#2: "Solr" And they type "(solr)" I want a hit on "(solr)" only in document #1, with the brackets matching. And if they type "solr", they will get a hit in Document #2 only. An additional nice-to-have is, if they type "solr", I want a hit in both document #1 and #2. Here is what my current schema.xml looks like: Currently, special characters are being stripped. Any idea how I can configure Solr to do this? I'm using Solr 3.6. Thanks !! -MJ
How to deal with underscore
Hi, In my schema.xml, I have the following settings: This does great job for most of my text, but one thing I does that I don't like is it won't replace underscores to spaces; it strips them. For example, if I have "Solr_Lucene" it becomes "solrlucene" (one word). What I want is two words "solr lucene". Thanks -MJ
Will Solr work with a mapped drive?
Hi, I'm having this same problem as described here: http://stackoverflow.com/questions/17708163/absolute-paths-in-solr-xml-configuration-using-tomcat6-on-windows Any one knows if this is a limitation of Solr or not? I searched the web, nothing came up. Thanks!!! -- MJ
Unsubscribing from JIRA
Hi, Can someone show me how to unsubscribe from JIRA? Years ago, I subscribed to JIRA and since then I have been receiving emails from JIRA for all kind of issues: when an issue is created, closed or commented on. Yes, I looked around and could not figure out how to unsubscribe, but maybe I didn't look hard enough? Here is an example email subject line header from JIRA: "[jira] [Commented] (LUCENE-3842) Analyzing Suggester" I have the same issue from "Jenkins" (and example: "[JENKINS] Lucene-Solr-Tests-4.x-Java6 - Build # 1537 - Still Failing"). Thanks in advance!!! -MJ
Re: Unsubscribing from JIRA
Are you saying because I'm subscribed to dev, which I'm, is why I'm getting JIRA mails too, and the only way I can stop JIRA mails is to unsubscribe from dev? I don't think so. I'm subscribed to other projects, both dev and user, and yet I do not receive JIRA mails. --MJ -Original Message- From: Alan Woodward To: solr-user Sent: Wed, May 1, 2013 12:52 pm Subject: Re: Unsubscribing from JIRA Hi MJ, It looks like you're subscribed to the lucene dev list. Send an email to dev-unsubscr...@lucene.apache.org to get yourself taken off the list. Alan Woodward www.flax.co.uk On 1 May 2013, at 17:25, johnmu...@aol.com wrote: > Hi, > > > Can someone show me how to unsubscribe from JIRA? > > > Years ago, I subscribed to JIRA and since then I have been receiving emails from JIRA for all kind of issues: when an issue is created, closed or commented on. Yes, I looked around and could not figure out how to unsubscribe, but maybe I didn't look hard enough? > > > Here is an example email subject line header from JIRA: "[jira] [Commented] (LUCENE-3842) Analyzing Suggester" I have the same issue from "Jenkins" (and example: "[JENKINS] Lucene-Solr-Tests-4.x-Java6 - Build # 1537 - Still Failing"). > > > Thanks in advance!!! > > > -MJ
RE: Unsubscribing from JIRA
For someone link me, who want to follow dev discussions but not JIRA, having a separate mailing list subscription for each would be ideal. The incoming mail traffic would be cut drastically (for me, I get far more non relevant emails from JIRA vs. dev). -- MJ -Original Message- From: Raymond Wiker [mailto:rwi...@gmail.com] Sent: Wednesday, May 01, 2013 2:01 PM To: solr-user@lucene.apache.org Subject: Re: Unsubscribing from JIRA On May 1, 2013,at 19:07 , johnmunir@aol.comwrote: > Are yousaying because I'm subscribed to dev, which I'm, is why I'm getting > JIRA mailstoo, and the only way I can stop JIRA mails is to unsubscribe from > dev? I don't think so. I'm subscribed to other projects, both devand user, > and yet I do not receive JIRA mails. > I'm pretty surethat's the case... I subscribed to dev, and got the JIRA mails. I unsubscribedfrom dev, and the JIRA mails stopped.
Phrase search
Hi All, I don't understand why i'm getting this behavior. I was under the impression if I search for "Apple 2" (with quotes and space before “2”) it will give me different results vs. if I search for "Apple2" (with quotes and no space before “2”), but I'm not! Why? Here is my fieldType setting from my schema.xml: What I am missing?!! What part of my solr.WordDelimiterFilterFactory need to change (if that’s where the issue is)? I’m using Solr 1.2 Thanks in advanced. -M
Re: Phrase search
Thanks for the quick response. Which part of my WordDelimiterFilterFactory is changing "Apple 2" to "Apple2"? How do I fix it? Also, I'm really confused about this. I was under the impression a phrase search is not impacted by the analyzer, no? -M -Original Message- From: Markus Jelsma To: solr-user@lucene.apache.org Sent: Mon, Aug 2, 2010 2:27 pm Subject: RE: Phrase search Well, the WordDelimiterFilterFactory in your query analyzer clearly makes "Apple " out of "Apple2", that's what it's for. If you're looking for an exact match, se a string field. Check the output with the debugQuery=true parameter. Cheers, Original message- rom: johnmu...@aol.com ent: Mon 02-08-2010 20:18 o: solr-user@lucene.apache.org; ubject: Phrase search i All, I don't understand why i'm getting this behavior. I was under the impression if search for "Apple 2" (with quotes and space before 2 ) it will give me ifferent results vs. if I search for "Apple2" (with quotes and no space before ), but I'm not! Why? Here is my fieldType setting from my schema.xml: What I am missing?!! What part of my solr.WordDelimiterFilterFactory need to hange (if that s where the issue is)? I m using Solr 1.2 Thanks in advanced. -M
Re: Phrase search
I'm using Solr 1.2, so I don't have splitOnNumerics. Reading that URL, is my use of catenateNumbers="1" causing this? Should I set it to "0" vs. "1" as I have it now? -M -Original Message- From: Markus Jelsma To: solr-user@lucene.apache.org Sent: Mon, Aug 2, 2010 3:54 pm Subject: RE: Re: Phrase search Hi, Queries on an analyzed field will need to be analyzed as well or it might not atch. You can configure the WordDelimiterFilterFactory so it will not split nto multiple tokens because of numerics, see the splitOnNumerics parameter [1]. [1]: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory Cheers, Original message- rom: johnmu...@aol.com ent: Mon 02-08-2010 21:29 o: solr-user@lucene.apache.org; ubject: Re: Phrase search Thanks for the quick response. Which part of my WordDelimiterFilterFactory is changing "Apple 2" to "Apple2"? How do I fix it? Also, I'm really confused about this. I was under the mpression a phrase search is not impacted by the analyzer, no? -M Original Message- rom: Markus Jelsma o: solr-user@lucene.apache.org ent: Mon, Aug 2, 2010 2:27 pm ubject: RE: Phrase search ell, the WordDelimiterFilterFactory in your query analyzer clearly makes "Apple " out of "Apple2", that's what it's for. If you're looking for an exact match, e a string field. Check the output with the debugQuery=true parameter. Cheers, Original message- om: johnmu...@aol.com nt: Mon 02-08-2010 20:18 : solr-user@lucene.apache.org; bject: Phrase search i All, don't understand why i'm getting this behavior. I was under the impression if search for "Apple 2" (with quotes and space before 2 ) it will give me fferent results vs. if I search for "Apple2" (with quotes and no space before , but I'm not! Why? ere is my fieldType setting from my schema.xml: hat I am missing?!! What part of my solr.WordDelimiterFilterFactory need to ange (if that s where the issue is)? m using Solr 1.2 hanks in advanced. M
Re: Phrase search
I'm trying to match "Apple 2" but not "Apple2" using phrase search, this is why I have it quoted. I was under the impression --when I use phrase search-- all the analyzer magic would not apply, but it is!!! Otherwise, how would I search for a phrase?! Using Google, when I search for "Windows 7" (with quotes), unlike Solr, I don't get hits on "Window7". I want to use catenateNumbers="1" which I want it to take effect on other searches but no phrase searches. Is this possible ? Yes, we are in the process of planning to upgrade to Solr 1.4.1 -- it takes time and a lot of effort to do such an upgrade at where I work. Thank you for your help and understanding. -M -Original Message- From: Chris Hostetter To: solr-user@lucene.apache.org Sent: Mon, Aug 2, 2010 5:41 pm Subject: Re: Phrase search I don't understand why i'm getting this behavior. I was under the impression if I search for "Apple 2" (with quotes and space before “2”) it will give me different results vs. if I search for "Apple2" (with quotes and no space before “2”), but I'm not! Why? if you search "Apple 2" in quotes, then the analyzer for your field gets he full string (with the space) and whatever it does with it and whatever erms it produces determs what Query gets executed. If you search Apple2" (w/ or w/o quotes) then the analyzer for your field gets the full tring and whatever it does with it and whatever Terms it produces determs hat Query gets executed. None of that changes based on the analyzer you use. With that in mind: i relaly don't understand your question. Let's step ack and instead of trying to explain *why* you are getting the results ou are getting (short answer: because that's how your analyzer works) et's ask the quetsion: what do you *want* to do? What do you *want* to ee happen when you enter various query strings? http://people.apache.org/~hossman/#xyproblem Y Problem Your question appears to be an "XY Problem" ... that is: you are dealing ith "X", you are assuming "Y" will help you, and you are asking about "Y" ithout giving more details about the "X" so that we can understand the ull issue. Perhaps the best solution doesn't involve "Y" at all? ee Also: http://www.perlmonks.org/index.pl?node_id=542341 : I’m using Solr 1.2 PS: Solr 1.2 had numerous bugs which were really really bad and which were ixed in Solr 1.3. Solr 1.3 had numerous bugs where were really really ad and were fixed in Solr 1.4. Solr 1.4 had a couple of bugs where eally really bad and which were fixed in Solr 1.4.1 ... so even if you on't want any of hte new features, you should *REALLY* consider pgrading. Hoss
Upgrading from Solr 1.2 to 1.4.1
I'm using Solr 1.2. If I upgrade to 1.4.1, must I re-index because of LUCENE-1142? If so, how will this affect me if I don’t re-index (I'm using EnglishPorterFilterFactory)? What about when I’m using non-English stammers from Snowball? Beside the brief note "IMPORTANT UPGRADE NOTE" about this in CHANGES.txt, where can I read more about this? I looked in JIRA, LUCENE-1142, there isn't much. -M
XML 1.1 and Solr 3.6.1
Can someone tell me if Solr 3.6.1 supports XML 1.1 or must I stick with XML 1.0? Thanks! -MJ
Please ignore, testing my email
Hi, Please ignore, I'm testing my email (I have not received any email from Solr mailing list for over 12 hours now). -- MJ
Questions about schema.xml
HI, Can someone help me understand the meaning of and in schema.xml, how they are used and what do I get back when the values are not the same? For example, given: If I make the entire content of "index" the same as "query" (or the other way around) how will that impact my search? And why would I want to not make those two blocks the same? Thanks!!! -MJ
Re: Questions about schema.xml
Thanks Prithu. But why would I use different settings for the index and query? I would think that if the setting is not the same for both, then search results for end users would be confusing, no? To illustrate my point (this maybe drastic) if I don't "solr.LowerCaseFilterFactory" in one case, then many searches (mix-case for example) won't give me any hits. A more realistic example is, if I don't match the rules for "solr.WordDelimiterFilterFactory", again, I could miss hits. If my understanding is correct, and there is value in using different rules for "query" and "index", I like to see a concrete example, a use-case I can apply. -- MJ -Original Message- From: Prithu Banerjee To: solr-user Sent: Thu, Nov 8, 2012 12:34 am Subject: Re: Questions about schema.xml Those two values are used to specify the analyzer type you want. That can be of two kinds, one for the indexer- the analyzer you specify analyzes the input documents accordingly to build the index. The other one is for query, it analyzes your query. Typically the specified analyzer for index and query are same so that you can search over exactly the token you created while indexing. But you are free to provide any customized analyzer according to your need. -- best regards, Prithu On Thu, Nov 8, 2012 at 8:43 AM, wrote: > > HI, > > > Can someone help me understand the meaning of and > in schema.xml, how they are used and what do I get > back when the values are not the same? > > > For example, given: > > > autoGeneratePhraseQueries="true"> > > >words="stopwords.txt" enablePositionIncrements="true" /> >generateWordParts="1" generateNumberParts="1" catenateWords="1" > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > >protected="protwords.txt"/> > > > > >ignoreCase="true" expand="true"/> >words="stopwords.txt" enablePositionIncrements="true" /> >generateWordParts="1" generateNumberParts="1" catenateWords="0" > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> > >protected="protwords.txt"/> > > > > > > If I make the entire content of "index" the same as "query" (or the other > way around) how will that impact my search? And why would I want to not > make those two blocks the same? > > > Thanks!!! > > > -MJ >
Questions about schema.xml
HI, Can someone help me understand the meaning of and in schema.xml, how they are used and what do I get back when the values are not the same? For example, given: If I make the entire content of "index" the same as "query" (or the other way around) how will that impact my search? And why would I want to not make those two blocks the same? Thanks!!! -M
Re: Questions about schema.xml
Thank you everyone for your explanation. So for WordDelimiterFilter, let me see if I got it right. Given that out-of-the box setting for catenateWords is "0" for query but is "1" for index, then I don't see how this will give me any hits. That is, if my document has "wi-fi", at index time it will be stored as "wifi". Well, than at query time if I type "wi-fi" (without quotes) I will be searching for "wi fi" and thus won't get a hit. no? What about when I *do* quote my search, i.e.: I search for "wi-fi" with quotes, now what am I sending to the searcher, "wi-fi", "wi fi" or "wifi"? Again, this is using the default out-of-the box setting per the above. The same applies for catenateNumbers. Btw, I'm looking at this link for the above values: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters --MJ -Original Message- From: Erick Erickson To: solr-user Sent: Thu, Nov 8, 2012 6:57 pm Subject: Re: Questions about schema.xml And, in fact, you do NOT need to have two. If they are both identical, just specify one analysis chain with no qualifier, i.e. On Thu, Nov 8, 2012 at 9:44 AM, Jack Krupansky wrote: > Many token filters will be used 100% identically for both "index" and > "query" analysis, but WordDelimiterFilter is a rare exception. The issue is > that at index time it has the ability to generate multiple tokens at the > same position (the "catenate" options), any of which can be queried, but at > query time it can be problematic to have these "extra" terms (except in > some conditions), so the WDF settings suppress generation of the extra > terms. > > Another example is synonyms - generate extra terms at index time for > greater precision of searches, but limit the query terms to exclude the > "extra" terms. > > That's the reason for the occaassional asymmetry between index-time and > query-time analyzers. > > -- Jack Krupansky > > -Original Message- From: johnmu...@aol.com > Sent: Wednesday, November 07, 2012 7:13 PM > To: solr-user@lucene.apache.org > Subject: Questions about schema.xml > > > > HI, > > > Can someone help me understand the meaning of and > in schema.xml, how they are used and what do I get > back when the values are not the same? > > > For example, given: > > > autoGeneratePhraseQueries="**true"> > > > words="stopwords.txt" enablePositionIncrements="**true" /> > generateWordParts="1" generateNumberParts="1" catenateWords="1" > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > > protected="protwords.txt"/> > > > > > ignoreCase="true" expand="true"/> > words="stopwords.txt" enablePositionIncrements="**true" /> > generateWordParts="1" generateNumberParts="1" catenateWords="0" > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> > > protected="protwords.txt"/> > > > > > > If I make the entire content of "index" the same as "query" (or the other > way around) how will that impact my search? And why would I want to not > make those two blocks the same? > > > Thanks!!! > > > -MJ >
Is leading wildcard search turned on by default in Solr 3.6.1?
Hi, I'm migrating from Solr 1.2 to 3.6.1. I used the same analyzer as I was, and re-indexed my data. I did not add solr.ReversedWildcardFilterFactory to my index analyzer, but yet leading wild cards are working!! Does this mean it's turned on by default? If so, how do I turn it off, and what are the implication of leaving ON? Won't my searches be slower and consume more memory? Thanks, --MJ
Re: Is leading wildcard search turned on by default in Solr 3.6.1?
Thanks for the quick response. So, I do not want to use ReversedWildcardFilterFactory, but leading wildcard is working and thus is ON by default. How do I disable it to prevent the use of it and the issues that come with it? -- MJ -Original Message- From: François Schiettecat te To: solr-user Sent: Mon, Nov 12, 2012 5:39 pm Subject: Re: Is leading wildcard search turned on by default in Solr 3.6.1? John You can still use leading wildcards even if you dont have the ReversedWildcardFilterFactory in your analysis but it means you will be scanning the entire dictionary when the search is run which can be a performance issue. If you do use ReversedWildcardFilterFactory you wont have that performance issue but you will increase the overall size of your index. Its a tradeoff. When I looked into it for a site I built I decided that the tradeoff was not worth it (after benchmarking) given how few leading wildcards searches it was getting. Best regards François On Nov 12, 2012, at 5:33 PM, johnmu...@aol.com wrote: > > > Hi, > > > I'm migrating from Solr 1.2 to 3.6.1. I used the same analyzer as I was, and re-indexed my data. I did not add > solr.ReversedWildcardFilterFactory to my index analyzer, but yet leading wild cards are working!! Does this mean it's turned on by default? If so, how do I turn it off, and what are the implication of leaving ON? Won't my searches be slower and consume more memory? > > > Thanks, > > > --MJ >
RE: Is leading wildcard search turned on by default in Solr 3.6.1?
At one point, in some version of Solr, it was OFF by default, and you had to enable it via a setting (either in solrconfig.xml or schema.xml, I don't remember). It looks like this is no longer the case. Even worse, and if this is true, disabling it no longer seems to be possible to disable it via a Solr setting!! -- MJ -Original Message- From: François Schiettecatte [mailto:fschietteca...@gmail.com] Sent: Monday, November 12, 2012 7:48 PM To: solr-user@lucene.apache.org Subject: Re: Is leading wildcard search turned on by default in Solr 3.6.1? I suspect it is just part of the wildcard handling, maybe someone can chime in here, you may need to catch this before it gets to SOLR. François On Nov 12, 2012, at 5:44 PM, johnmu...@aol.com wrote: > Thanks for the quick response. > > > So, I do not want to use ReversedWildcardFilterFactory, but leading wildcard > is working and thus is ON by default. How do I disable it to prevent the use > of it and the issues that come with it? > > > -- MJ > > > > -Original Message- > From: François Schiettecat > te > To: solr-user > Sent: Mon, Nov 12, 2012 5:39 pm > Subject: Re: Is leading wildcard search turned on by default in Solr 3.6.1? > > > John > > You can still use leading wildcards even if you dont have the > ReversedWildcardFilterFactory in your analysis but it means you will > be scanning the entire dictionary when the search is run which can be a > performance issue. > If you do use ReversedWildcardFilterFactory you wont have that > performance issue but you will increase the overall size of your index. Its a > tradeoff. > > When I looked into it for a site I built I decided that the tradeoff > was not worth it (after benchmarking) given how few leading wildcards > searches it was getting. > > Best regards > > François > > > On Nov 12, 2012, at 5:33 PM, johnmu...@aol.com wrote: > >> >> >> Hi, >> >> >> I'm migrating from Solr 1.2 to 3.6.1. I used the same analyzer as I >> was, and > re-indexed my data. I did not add >> solr.ReversedWildcardFilterFactory to my index analyzer, but yet >> leading wild > cards are working!! Does this mean it's turned on by default? If so, > how do I turn it off, and what are the implication of leaving ON? > Won't my searches be slower and consume more memory? >> >> >> Thanks, >> >> >> --MJ >> > > > >
RE: Is leading wildcard search turned on by default in Solr 3.6.1?
I'm surprised that this has not been logged as adefect. The fact that this is ON bydefault, means someone can bring down a server; this is bad enough to categorizethis as a security issue. --MJ -Original Message- From: Michael Ryan [mailto:mr...@moreover.com] Sent: Monday, November 12, 2012 8:10 PM To: solr-user@lucene.apache.org Subject: RE: Is leading wildcard search turned on by default in Solr 3.6.1? Yeah, thesituation is kind of a pain right now. In https://issues.apache.org/jira/browse/SOLR-2438, it was enabled by default and there is noway to disable without patching SolrQueryParser. There's also the edismaxparser which doesn't have a setting for this, which I've made a jira for at https://issues.apache.org/jira/browse/SOLR-3031. I'm surprisedother people haven't requested this, as any instance of serious size can bebrought to its knees by a wildcard query. -Michael -OriginalMessage- From: johnmu...@aol.com [mailto:johnmu...@aol.com] Sent: Monday,November 12, 2012 7:58 PM To: solr-user@lucene.apache.org Subject: RE: Isleading wildcard search turned on by default in Solr 3.6.1? At one point, insome version of Solr, it was OFF by default, and you had to enable it via asetting (either in solrconfig.xml or schema.xml, I don't remember). It looks like this is no longer thecase. Even worse, and if this is true,disabling it no longer seems to be possible to disable it via a Solr setting!! -- MJ
Using CJK analyzer
Hi, Using Solr 1.2.0, the following works (and I get hits searching on Chinese text): and it won't work. I run it through the analyzer and I see this (I hope the table will show up fine on the mailing list): Index Analyzer org.apache.lucene.analysis.cn.ChineseAnalyzer {} position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 term text 去 除 商 品 操 作 在 订 购 单 中 留 下 空 白 行 startOffset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 endOffset 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Query Analyzer org.apache.lucene.analysis.cn.ChineseAnalyzer {} position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 term text 去 除 商 品 操 作 在 订 购 单 中 留 下 空 白 行 startOffset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 endOffset 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 --MJ
Tokenization and wild card search
Hi, I have an issue and I'm not sure how to address it, so I hope someone can help me. I have the following text in one of my fields: "ABC_Expedition_ERROR". When I search on it like: "MyField:SDD_Expedition_PCB" (without quotes) it will fail to find me only this word “ABC_Expedition_ERROR” which I think is due to tokenization because of the underscore. My solution is: "MyField:"SDD_Expedition_PCB"" (without the outer quotes, but quotes around the word “ABC_Expedition_ERROR”). This works fine. But then, how do I search on "SDD_Expedition_PCB" with wild card? For example: "MyField:SDD_Expedition*" will not work. Any help is greatly appreciated. Thanks. -- JM =
RE: Tokenization and wild card search
I want the following searches to work: MyField:SDD_Expedition_PCB This should match the word "SDD_Expedition_PCB" only, and not matching individual words such as "SDD" or "Expedition", or "PCB". And the following search: MyField:SDD_Expedition* Should match any word starting with "SDD_Expedition" and ending with anything else such as "SDD_Expedition_PBC", "SDD_Expedition_One", "SDD_Expedition_Two", "SDD_ExpeditionSolr", "SDD_ExpeditionSolr1.4", etc, but not matching individual words such as "SDD" or "Expedition". The field type for "MyField" is (the field name is keywords): And here is the analyzer I'm using: Any help on how I can achieve the above is greatly appreciated. Btw, if at all possible, I would like to be able to achieve this search without having to change how I'm indexing / tokenizing the data. I'm looking for search syntax to make this work. -- JM -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: Tuesday, January 19, 2010 7:57 AM To: solr-user@lucene.apache.org Subject: Re: Tokenization and wild card search > I have an issue and I'm not sure how to address it, so I > hope someone can help me. > > I have the following text in one of my fields: > "ABC_Expedition_ERROR".���When I search on it > like: "MyField:SDD_Expedition_PCB" (without quotes) it will > fail to find me only this word �ABC_Expedition_ERROR� > which I think is due to tokenization because of the > underscore. Do you want or do not want your query MyField:SDD_Expedition_PCB to return documents containing ABC_Expedition_ERROR? > My solution is: "MyField:"SDD_Expedition_PCB"" (without the > outer quotes, but quotes around the word > �ABC_Expedition_ERROR�).� This works fine.� > But then, how do I search on "SDD_Expedition_PCB" with wild > card?� For example: "MyField:SDD_Expedition*" will not > work. Can you paste your field type of MyField? And give some examples what queries should return what documents.
Re: Tokenization and wild card search
You are correct, the way I'm using tokenization is my issue. It's too late to re-index now, this is why I'm looking for a search syntax that will to make the search work. I have tried various search syntax with no luck. Is there no search syntax to make this work without re-indexing?! -- JM -Original Message- From: Erick Erickson To: solr-user@lucene.apache.org Sent: Tue, Jan 19, 2010 10:30 am Subject: Re: Tokenization and wild card search I'm pretty sure you're going to be disappointed about he re-indexing part. I'm pretty sure that WordDelimiterFilterFactory is tokenizing our input in ways you don't expect, making your use-case ard to accomplish. It's basically splitting your input on all non-alpha characters, o you're indexing see ttp://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory I'd *strongly* suggest you examine the results of your indexing n order to understand what's possible. Get a copy of luke and examine your index or use the OLR admin Analysis page... I suspect what you're really looking for is WhitespaceAnalyzer r Keyword On Tue, Jan 19, 2010 at 9:50 AM, wrote: > I want the following searches to work: MyField:SDD_Expedition_PCB This should match the word "SDD_Expedition_PCB" only, and not matching individual words such as "SDD" or "Expedition", or "PCB". And the following search: MyField:SDD_Expedition* Should match any word starting with "SDD_Expedition" and ending with anything else such as "SDD_Expedition_PBC", "SDD_Expedition_One", "SDD_Expedition_Two", "SDD_ExpeditionSolr", "SDD_ExpeditionSolr1.4", etc, but not matching individual words such as "SDD" or "Expedition". The field type for "MyField" is (the field name is keywords): And here is the analyzer I'm using: Any help on how I can achieve the above is greatly appreciated. Btw, if at all possible, I would like to be able to achieve this search without having to change how I'm indexing / tokenizing the data. I'm looking for search syntax to make this work. -- JM -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: Tuesday, January 19, 2010 7:57 AM To: solr-user@lucene.apache.org Subject: Re: Tokenization and wild card search > I have an issue and I'm not sure how to address it, so I > hope someone can help me. > > I have the following text in one of my fields: > "ABC_Expedition_ERROR".���When I search on it > like: "MyField:SDD_Expedition_PCB" (without quotes) it will > fail to find me only this word �ABC_Expedition_ERROR� > which I think is due to tokenization because of the > underscore. Do you want or do not want your query MyField:SDD_Expedition_PCB to return documents containing ABC_Expedition_ERROR? > My solution is: "MyField:"SDD_Expedition_PCB"" (without the > outer quotes, but quotes around the word > �ABC_Expedition_ERROR�).� This works fine.� > But then, how do I search on "SDD_Expedition_PCB" with wild > card?� For example: "MyField:SDD_Expedition*" will not > work. Can you paste your field type of MyField? And give some examples what queries should return what documents.
what's up with: java -Ddata=args -jar post.jar ""
Hi, I'm a new Solr user. I figured my way around Solr just fine (I think) ... I can index and search ets. And so far I have indexed over 300k documents. What I can't figure out is the following. I'm using: ??? java -Ddata=args -jar post.jar "" to post an optimize command. What I'm finding is that I have to do it twice in order for the files to be "optimized" ... i.e.: the first post takes 3-4 minutes but leaves the file count as is at 44 ... the second post takes 2-3 seconds but shrinks the file count from 44 to 8. So my question is the following, is this the expected behavior or am I doing something wrong? Do I need two optimize posts to really optimize my index?! Thanks in advance -JM