Access permission

2015-03-03 Thread johnmunir

Hi,


I'm indexing data off a DB.  The data is secured with access permission.  That 
is record-A can be seen by users-x, while record-B can be seen by users-y and 
yet record-C can be seen by users x and y.  Even more, the group access 
permission can change over time.


The question I have is this: how to handle this in Solr?  Is there anything I 
can do during index and / or search time?  What's the best practice to handle 
access permission in search?


Thanks!


- MJ



Cores and and ranking (search quality)

2015-03-05 Thread johnmunir
Hi,

I have data in which I will index and search on.  This data is well define such 
that I can index into a single core or multiple cores like so: core_1:Jan2015, 
core_2:Feb2015, core_3:Mar2015, etc.

My question is this: if I put my data in multiple cores and use distributed 
search will the ranking be different if I had all my data in a single core?  If 
yes, how will it be different?  Also, will facet and more-like-this quality / 
result be the same?

Also, reading the distributed search wiki 
(http://wiki.apache.org/solr/DistributedSearch) it looks like Solr does the 
search and result merging (all I have to do is issue a search), is this correct?

Thanks!

- MJ


RE: Cores and and ranking (search quality)

2015-03-06 Thread johnmunir
Help me understand this better (regarding ranking).

If I have two docs that are 100% identical with the exception of uid (which is 
stored but not indexed).  In a single core setup, if I search "xyz" such that 
those 2 docs end up ranking as #1 and #2.  When I switch over to two core 
setup, doc-A goes to core-A (which has 10 records) and doc-B goes to core-B 
(which has 100,000 records).

Now, are you saying in 2 core setup if I search on "xyz" (just like in singe 
core setup) this time I will not see doc-A and doc-B as #1 and #2 in ranking?  
That is, are you saying doc-A may now be somewhere at the top / bottom far away 
from doc-B?  If so, which will be #1: the doc off core-A (that has 10 records) 
or doc-B off core-B (that has 100,000 records)?

If I got all this right, are you saying SOLR-1632 will fix this issue such that 
the end result will now be as if I had 1 core?

- MJ


-Original Message-
From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk] 
Sent: Thursday, March 5, 2015 9:06 AM
To: solr-user@lucene.apache.org
Subject: Re: Cores and and ranking (search quality)

On Thu, 2015-03-05 at 14:34 +0100, johnmu...@aol.com wrote:
> My question is this: if I put my data in multiple cores and use 
> distributed search will the ranking be different if I had all my data 
> in a single core?

Yes, it will be different. The practical impact depends on how homogeneous your 
data are across the shards and how large your shards are. If you have small and 
dissimilar shards, your ranking will suffer a lot.

Work is being done to remedy this:
https://issues.apache.org/jira/browse/SOLR-1632

> Also, will facet and more-like-this quality / result be the same?

It is not formally guaranteed, but for most practical purposes, faceting on 
multi-shards will give you the same results as single-shards.

I don't know about more-like-this. My guess is that it will be affected in the 
same way that standard searches are.

> Also, reading the distributed search wiki
> (http://wiki.apache.org/solr/DistributedSearch) it looks like Solr 
> does the search and result merging (all I have to do is issue a 
> search), is this correct?

Yes. From a user-perspective, searches are no different.

- Toke Eskildsen, State and University Library, Denmark



Re: Cores and and ranking (search quality)

2015-03-09 Thread johnmunir
(reposing this to see if anyone can help)


Help me understand this better (regarding ranking).

If I have two docs that are 100% identical with the exception of uid (which is 
stored but not indexed).  In a single core setup, if I search "xyz" such that 
those 2 docs end up ranking as #1 and #2.  When I switch over to two core 
setup, doc-A goes to core-A (which has 10 records) and doc-B goes to core-B 
(which has 100,000 records).

Now, are you saying in 2 core setup if I search on "xyz" (just like in singe 
core setup) this time I will not see doc-A and doc-B as #1 and #2 in ranking?  
That is, are you saying doc-A may now be somewhere at the top / bottom far away 
from doc-B?  If so, which will be #1: the doc off core-A (that has 10 records) 
or doc-B off core-B (that has 100,000 records)?

If I got all this right, are you saying SOLR-1632 will fix this issue such that 
the end result will now be as if I had 1 core?

- MJ


-Original Message-
From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk]
Sent: Thursday, March 5, 2015 9:06 AM
To: solr-user@lucene.apache.org
Subject: Re: Cores and and ranking (search quality)

On Thu, 2015-03-05 at 14:34 +0100, johnmu...@aol.com wrote:
> My question is this: if I put my data in multiple cores and use 
> distributed search will the ranking be different if I had all my data 
> in a single core?

Yes, it will be different. The practical impact depends on how homogeneous your 
data are across the shards and how large your shards are. If you have small and 
dissimilar shards, your ranking will suffer a lot.

Work is being done to remedy this:
https://issues.apache.org/jira/browse/SOLR-1632

> Also, will facet and more-like-this quality / result be the same?

It is not formally guaranteed, but for most practical purposes, faceting on 
multi-shards will give you the same results as single-shards.

I don't know about more-like-this. My guess is that it will be affected in the 
same way that standard searches are.

> Also, reading the distributed search wiki
> (http://wiki.apache.org/solr/DistributedSearch) it looks like Solr 
> does the search and result merging (all I have to do is issue a 
> search), is this correct?

Yes. From a user-perspective, searches are no different.

- Toke Eskildsen, State and University Library, Denmark



Re: Cores and and ranking (search quality)

2015-03-10 Thread johnmunir
Thanks Erick for trying to help, I really appreciate it.  Unfortunately, I'm 
still stuck.

There are times one must know the inner working and behavior of the software to 
make design decision and this one is one of them.  If I know the inner working 
of Solr, I would not be asking.  In addition, I'm in the design process, so I'm 
not able to fully test.  Beside my test could be invalid because I may not set 
it up right due to my lack of understanding the inner working of Solr.

Given this, I hope you don't mind me asking again.

If I have two cores, one core has 10 docs another has 100,000 docs.  I then 
submit two docs that are 100% identical (with the exception of the unique-ID 
fields, which is stored but not indexed) one to each core.  The question is, 
during search, will both of those docs rank near each other or not?  If so, 
this is great because it will behave the same as if I had one core and index 
both docs to this single core.  If not, which core's doc will rank higher and 
how far apart the two docs be from each other in the ranking?

Put another way: are docs from the smaller core (the one has 10 docs only) rank 
higher or lower compared to docs from the larger core (the one with 100,000) 
docs?

Thanks!

-- MJ

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Tuesday, March 10, 2015 11:47 AM
To: solr-user@lucene.apache.org
Subject: Re: Cores and and ranking (search quality)

SOLR-1632 will certainly help. But trying to predict whether your core A or 
core B will appear first doesn't really seem like a good use of time. If you 
actually have a setup like you describe, add &debug=all to your query on both 
cores and you'll see all the gory detail of how the scores are calculated, 
providing a definitive answer in _your_ situation.

Best,
Erick

On Mon, Mar 9, 2015 at 5:44 AM,   wrote:
> (reposing this to see if anyone can help)
>
>
> Help me understand this better (regarding ranking).
>
> If I have two docs that are 100% identical with the exception of uid (which 
> is stored but not indexed).  In a single core setup, if I search "xyz" such 
> that those 2 docs end up ranking as #1 and #2.  When I switch over to two 
> core setup, doc-A goes to core-A (which has 10 records) and doc-B goes to 
> core-B (which has 100,000 records).
>
> Now, are you saying in 2 core setup if I search on "xyz" (just like in singe 
> core setup) this time I will not see doc-A and doc-B as #1 and #2 in ranking? 
>  That is, are you saying doc-A may now be somewhere at the top / bottom far 
> away from doc-B?  If so, which will be #1: the doc off core-A (that has 10 
> records) or doc-B off core-B (that has 100,000 records)?
>
> If I got all this right, are you saying SOLR-1632 will fix this issue such 
> that the end result will now be as if I had 1 core?
>
> - MJ
>
>
> -Original Message-
> From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk]
> Sent: Thursday, March 5, 2015 9:06 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Cores and and ranking (search quality)
>
> On Thu, 2015-03-05 at 14:34 +0100, johnmu...@aol.com wrote:
>> My question is this: if I put my data in multiple cores and use 
>> distributed search will the ranking be different if I had all my data 
>> in a single core?
>
> Yes, it will be different. The practical impact depends on how homogeneous 
> your data are across the shards and how large your shards are. If you have 
> small and dissimilar shards, your ranking will suffer a lot.
>
> Work is being done to remedy this:
> https://issues.apache.org/jira/browse/SOLR-1632
>
>> Also, will facet and more-like-this quality / result be the same?
>
> It is not formally guaranteed, but for most practical purposes, faceting on 
> multi-shards will give you the same results as single-shards.
>
> I don't know about more-like-this. My guess is that it will be affected in 
> the same way that standard searches are.
>
>> Also, reading the distributed search wiki
>> (http://wiki.apache.org/solr/DistributedSearch) it looks like Solr 
>> does the search and result merging (all I have to do is issue a 
>> search), is this correct?
>
> Yes. From a user-perspective, searches are no different.
>
> - Toke Eskildsen, State and University Library, Denmark
>



Re: Cores and and ranking (search quality)

2015-03-10 Thread johnmunir
Thanks Walter.

The design decision I'm trying to solve is this: using multiple cores, will my 
ranking be impacted vs. using single core?

I have records to index and each record can be grouped into object-types, such 
as object-A, object-B, object-C, etc.  I have a total of 30 (maybe more) 
object-types.  There may be only 10 records of object-A, but 10 million records 
of object-B or 1 million of object-C, etc.  I need to be able to search against 
a single object-type and / or across all object-types.

>From my past experience, in a single core setup, if I have two identical 
>records, and I search on the term " XYZ" that matches one of the records, the 
>second record ranks right next to the other (because it too contains "XYZ").  
>This is good and is the expected behavior.  If I want to limit my search to an 
>object-type, I AND "XYZ" with that object-type.  So all is well.

What I'm considering to do for my new design is use multi-cores and distributed 
search.  I am considering to create a core for each object-type: core-A will 
hold records from object-A, core-B will hold records from object-B, etc.  
Before I can make a decision on this design, I need to know how ranking will be 
impacted.

Going back to my earlier example: if I have 2 identical records, one of them 
went to core-A which has 10 records, and the other went to core-B which has 10 
million records, using distributed search, if I now search across all cores on 
the term " XYZ" (just like in the single core case), it will match both of 
those records all right, but will those two records be ranked next to each 
other just like in the single core case?  If not, which will rank higher, the 
one from core-A or the one from core-B?

My concern is, using multi-cores and distributed search means I will give up on 
rank quality when records are not distributed across cores evenly.  If so, than 
maybe this is not a design I can use.

- MJ

-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Tuesday, March 10, 2015 2:39 PM
To: solr-user@lucene.apache.org
Subject: Re: Cores and and ranking (search quality)

On Mar 10, 2015, at 10:17 AM, johnmu...@aol.com wrote:

> If I have two cores, one core has 10 docs another has 100,000 docs.  I then 
> submit two docs that are 100% identical (with the exception of the unique-ID 
> fields, which is stored but not indexed) one to each core.  The question is, 
> during search, will both of those docs rank near each other or not? […]
> 
> Put another way: are docs from the smaller core (the one has 10 docs only) 
> rank higher or lower compared to docs from the larger core (the one with 
> 100,000) docs?

These are not quite the same question.

tf.idf ranking depends on the other documents in the collection (the idf term). 
With 10 docs, the document frequency statistics are effectively random noise, 
so the ranking is unpredictable.

Identical documents should rank identically, but whether they are higher or 
lower in the two cores depends on the rest of the docs.

idf statistics don’t settle down until at least 10K docs. You still sometimes 
see anomalies under a million documents. 

What design decision do you need to make? We can probably answer that for you.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



Re: Cores and and ranking (search quality)

2015-03-11 Thread johnmunir
Thanks Walter.  This explains a lot.

- MJ

-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Tuesday, March 10, 2015 4:41 PM
To: solr-user@lucene.apache.org
Subject: Re: Cores and and ranking (search quality)

If the documents are distributed randomly across shards/cores, then the 
statistics will be similar in each core and the results will be similar.

If the documents are distributed semantically (say, by topic or type), the 
statistics of each core will be skewed towards that set of documents and the 
results could be quite different.

Assume I have tech support documents and I put all the LaserJet docs in one 
core. That term is very common in that core (poor idf) and rare in other cores 
(strong idf). But for the query “laserjet”, all the good answers are in the 
LaserJet-specific core, where they will be scored low.

An identical document that mentions “LaserJet” once will score fairly low in 
the LaserJet-specific collection and fairly high in the other collection.

Global IDF fixes this, by using corpus-wide statistics. That’s how we ran 
Infoseek and Ultraseek in the late 1990’s.

Random allocation to cores avoids it.

If you have significant traffic directed to one object type AND you need peak 
performance, you may want to segregate your cores by object type. Otherwise, 
I’d let SolrCloud spread them around randomly and filter based on an object 
type field. That should work well for most purposes.

Any core with less than 1000 records is likely to give somewhat mysterious 
results. A word that is common in English, like “next”, will only be in one 
document and will score too high. A less-common word, like “unreasonably”, will 
be in 20 and will score low. You need lots of docs for the language statistics 
to even out.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Mar 10, 2015, at 1:23 PM, johnmu...@aol.com wrote:

> Thanks Walter.
> 
> The design decision I'm trying to solve is this: using multiple cores, will 
> my ranking be impacted vs. using single core?
> 
> I have records to index and each record can be grouped into object-types, 
> such as object-A, object-B, object-C, etc.  I have a total of 30 (maybe more) 
> object-types.  There may be only 10 records of object-A, but 10 million 
> records of object-B or 1 million of object-C, etc.  I need to be able to 
> search against a single object-type and / or across all object-types.
> 
> From my past experience, in a single core setup, if I have two identical 
> records, and I search on the term " XYZ" that matches one of the records, the 
> second record ranks right next to the other (because it too contains "XYZ").  
> This is good and is the expected behavior.  If I want to limit my search to 
> an object-type, I AND "XYZ" with that object-type.  So all is well.
> 
> What I'm considering to do for my new design is use multi-cores and 
> distributed search.  I am considering to create a core for each object-type: 
> core-A will hold records from object-A, core-B will hold records from 
> object-B, etc.  Before I can make a decision on this design, I need to know 
> how ranking will be impacted.
> 
> Going back to my earlier example: if I have 2 identical records, one of them 
> went to core-A which has 10 records, and the other went to core-B which has 
> 10 million records, using distributed search, if I now search across all 
> cores on the term " XYZ" (just like in the single core case), it will match 
> both of those records all right, but will those two records be ranked next to 
> each other just like in the single core case?  If not, which will rank 
> higher, the one from core-A or the one from core-B?
> 
> My concern is, using multi-cores and distributed search means I will give up 
> on rank quality when records are not distributed across cores evenly.  If so, 
> than maybe this is not a design I can use.
> 
> - MJ
> 
> -Original Message-
> From: Walter Underwood [mailto:wun...@wunderwood.org] 
> Sent: Tuesday, March 10, 2015 2:39 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Cores and and ranking (search quality)
> 
> On Mar 10, 2015, at 10:17 AM, johnmu...@aol.com wrote:
> 
>> If I have two cores, one core has 10 docs another has 100,000 docs.  I then 
>> submit two docs that are 100% identical (with the exception of the unique-ID 
>> fields, which is stored but not indexed) one to each core.  The question is, 
>> during search, will both of those docs rank near each other or not? […]
>> 
>> Put another way: are docs from the smaller core (the one has 10 docs only) 
>> rank higher or lower compared to docs from the larger core (the one with 
>> 100,000) docs?
> 
> These are not quite the same question.
> 
> tf.idf ranking depends on the other documents in the collection (the idf 
> term). With 10 docs, the document frequency statistics are effectively random 
> noise, so the ranking is unpredictable.
> 
> Identical documents s

Re: [Poll]: User need for Solr security

2015-03-12 Thread johnmunir
I would love to see record level (or even field level) restricted access in 
Solr / Lucene.

This should be group level, LDAP like or some rule base (which can be dynamic). 
 If the solution means having a second core, so be it.

The following is the closest I found: 
https://wiki.apache.org/solr/SolrSecurity#Document_Level_Security but I cannot 
use Manifold CF (Connector Framework).  Does anyone know how Manifold does it?

- MJ

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Thursday, March 12, 2015 6:51 PM
To: solr-user@lucene.apache.org
Subject: RE: [Poll]: User need for Solr security

Jan - we don't really need any security for our products, nor for most clients. 
However, one client does deal with very sensitive data so we proposed to 
encrypt the transfer of data and the data on disk through a Lucene Directory. 
It won't fill all gaps but it would adhere to such a client's guidelines. 

I think many approaches of security in Solr/Lucene would find advocates, be it 
index encryption or authentication/authorization or transport security, which 
is now possible. I understand the reluctance of the PMC, and i agree with it, 
but some users would definitately benefit and it would certainly make 
Solr/Lucene the search platform to use for some enterprises.

Markus 
 
-Original message-
> From:Henrique O. Santos 
> Sent: Thursday 12th March 2015 23:43
> To: solr-user@lucene.apache.org
> Subject: Re: [Poll]: User need for Solr security
> 
> Hi,
> 
> I’m currently working with indexes that need document level security. Based 
> on the user logged in, query results would omit documents that this user 
> doesn’t have access to, with LDAP integration and such.
> 
> I think that would be nice to have on a future Solr release.
> 
> Henrique.
> 
> > On Mar 12, 2015, at 7:32 AM, Jan Høydahl  wrote:
> > 
> > Hi,
> > 
> > Securing various Solr APIs has once again surfaced as a discussion 
> > in the developer list. See e.g. SOLR-7236 Would be useful to get some 
> > feedback from Solr users about needs "in the field".
> > 
> > Please reply to this email and let us know what security aspect(s) would be 
> > most important for your company to see supported in a future version of 
> > Solr.
> > Examples: Local user management, AD/LDAP integration, SSL, 
> > authenticated login to Admin UI, authorization for Admin APIs, e.g. 
> > admin user vs read-only user etc
> > 
> > --
> > Jan Høydahl, search solution architect Cominvent AS - 
> > www.cominvent.com
> > 
> 
> 



Which Lucene search syntax is faster

2014-04-30 Thread johnmunir

Hi,


Given the following Lucene document that I’m adding to my index(and I expect to 
have over 10 million of them, each with various sizes from 1 Kbto 50 Kb:



  
PDF
Some name
Some summary
Who owns this
10
1234567890
  
  
DOC
Some name
Some summary
Who owns this
10
0987654321
  
  




My question is this: what Lucene search syntax will give meback result the 
fastest?  If my user is interestedin finding data within “title” and “owner” 
fields only “doc_type” “DOC”, shouldI build my Lucene search syntax as:
 
1) skyfall ian fleming AND doc_type:DOC
2) title:(skyfall OR ian OR fleming) owner:(skyfall OR ian ORfleming) AND 
doc_type:DOC
3) Something else I don't know about.


Of the 10 million documents I will be indexing, 80% will be of "doc_type" PDF, 
and about 10% of type DOC, so please keep that in mind as a factor (if that 
will mean anything in terms of which syntax I should use).


Thanks in advanced,
 
- MJ 


Re: Which Lucene search syntax is faster

2014-04-30 Thread johnmunir

Thank you Shawn and Erick for the quick response.


A follow up question.


Basedon 
https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-Thefq%28FilterQuery%29Parameter,I
 see the "fl" (field list) parameter.  Does this mean I canbuild my Lucene 
search syntax as follows:


q=skyfall OR ian ORfleming&fl=title&fl=owner&fq=doc_type:DOC


And get the same result as (per Shawn's example changed it bit toadd OR):


q=title:(skyfall OR ian OR fleming)owner:(skyfall OR ian OR 
fleming)&fq=doc_type:DOC


Btw, my default search operator is set to AND.  My need is tofind whatever the 
user types in both of those two fields (or maybe some otherfields which is 
controlled by the UI).. For example, user types"skyfall ian fleming" and 
selected 3 fields, and want to narrowdown to doc_type DOC.


- MJ




-Original Message-
From: Erick Erickson 
To: solr-user 
Sent: Wed, Apr 30, 2014 5:33 pm
Subject: Re: Which Lucene search syntax is faster


I'd add that I think you're worrying about the wrong thing. 10M
documents is not very many by modern Solr standards. I rather suspect
that you won't notice much difference in performance due to how you
construct the query.

Shawn's suggestion to use fq clauses is spot on, though. fq clauses
are re-used (see filterCache in solrconfig.xml). My rule of thumb is
to use fq clauses for most everything that does NOT contribute to
scoring...

Best,
Erick

On Wed, Apr 30, 2014 at 2:18 PM, Shawn Heisey  wrote:
> On 4/30/2014 2:29 PM, johnmu...@aol.com wrote:
>> My question is this: what Lucene search syntax will give meback result the 
fastest?  If my user is interestedin finding data within “title” and “owner” 
fields only “doc_type” “DOC”, shouldI build my Lucene search syntax as:
>>
>> 1) skyfall ian fleming AND doc_type:DOC
>
> If your default field is text, I'm fairly sure this will become
> equivalent to the following which is probably NOT what you want.
> Parentheses can be very important.
>
> text:skyfall OR text:ian OR (text:fleming AND doc_type:DOC)
>
>> 2) title:(skyfall OR ian OR fleming) owner:(skyfall OR ian OR fleming) AND 
doc_type:DOC
>
> This kind of query syntax is probably what you should shoot for.  Not
> from a performance perspective -- just from the perspective of making
> your queries completely correct.  Note that the +/- syntax combined with
> parentheses is far more precise than using AND/OR/NOT.
>
>> 3) Something else I don't know about.
>
> The edismax query parser is very powerful.  That might be something
> you're interested in.
>
> https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser
>
>
>> Of the 10 million documents I will be indexing, 80% will be of "doc_type" 
PDF, and about 10% of type DOC, so please keep that in mind as a factor (if 
that 
will mean anything in terms of which syntax I should use).
>
> For the most part, whatever general query format you choose to use will
> not matter very much.  There are exceptions, but mostly Solr (Lucene) is
> smart enough to convert your query to an efficient final parsed format.
> Turn on the debugQuery parameterto see what it does with each query.
>
> Regardless of whether you use the standard lucene query parser or
> edismax, incorporate filter queries into your query constructing logic.
> Your second example above would be better to express like this, with the
> default operator set to OR.  This uses both q and fq parameters:
>
> q=title:(skyfall ian fleming) owner:(skyfall ian fleming)&fq=doc_type:DOC
>
> https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-Thefq%28FilterQuery%29Parameter
>
> Thanks,
> Shawn
>

 


Using fq as OR

2014-05-21 Thread johnmunir
Hi,


Currently, I'm building my search as follows:


q=(search string ...) AND (type:type_a OR type:type_b OR type:type_c OR ...)


Which means anything I search for will be AND'ed to be in either fields that 
have "type_a", "type_b", "type_c", etc.  (I have defaultOperator set to "AND")


Now, I need to use "fq" so I'm not sure how to build my search string to get 
the same result!!


I have tried the following:


q=search string ...&fq=type:type_a&fq=type:type_b&fq=type:type_c&...


But this isn't the same because each additional "fq" is now being treated as 
AND (keep in mind, I have defaultOperator set to "AND" and I cannot change 
that).


I have tried the following:


q=search string ...&fq=type:(type_a OR type_b OR type_c OR ...)


But the result I get back is not the same.


Thanks in advanced !!!


-- MJ






Re: Using fq as OR

2014-05-21 Thread johnmunir

Answering Jack's question first: the result is different, by few counts, but I 
found my problem:I was using the wrong syntax in my code vs. what I posted here:


I was using


q=(search string ...) AND (type:type_a OR type_b OR type_c OR ...)


(see how I left out "type:" from "type_b" and "type_c", etc.?!


Shawn and all, now the hit count is the same but ranking is totally different, 
how come ?!!!  I'm not using edismax, I'm using the default query parser, I'm 
also using the default sort.  You said the "order" will likely be different, 
which it is, why?  If I cannot explain it to my users, they will be confused 
because they can type in directly the search syntax (when "fq" is not used) and 
expect to see the same result for when I grammatically in my code apply "fq".


Same data, but different path, giving me different rank result, is not good.


-- MJ



-Original Message-
From: Shawn Heisey 
To: solr-user 
Sent: Wed, May 21, 2014 11:42 am
Subject: Re: Using fq as OR


On 5/21/2014 9:26 AM, johnmu...@aol.com wrote:
> Currently, I'm building my search as follows:
>
>
> q=(search string ...) AND (type:type_a OR type:type_b OR type:type_c OR 
...)
>
>
> Which means anything I search for will be AND'ed to be in either fields that 
have "type_a", "type_b", "type_c", etc.  (I have defaultOperator set to "AND")
>
>
> Now, I need to use "fq" so I'm not sure how to build my search string to get 
the same result!!
>
>
> I have tried the following:
>
>
> q=search string ...&fq=type:type_a&fq=type:type_b&fq=type:type_c&...
>
>
> But this isn't the same because each additional "fq" is now being treated as 
AND (keep in mind, I have defaultOperator set to "AND" and I cannot change 
that).
>
>
> I have tried the following:
>
>
> q=search string ...&fq=type:(type_a OR type_b OR type_c OR ...)
>
>
> But the result I get back is not the same.

If you are using the standard (lucene) query parser for your queries,
then fq should behave exactly the same.  If you are using a different
query parser (edismax, for example) then fq may not behave the same,
because it will use the lucene query parser.

With the standard query parser, if your original query looks like the
following:

q=(query) AND (filter)

The query below should produce exactly the same results -- although if
you are using the default relevance sort, the *order* is likely to be
different, because filter queries do not affect the document scores, but
everything in the q parameter does.

q=(query)&fq=(filter)

Thanks,
Shawn


 


Re: Using fq as OR

2014-05-21 Thread johnmunir

Hi Jack,


I'm going after speed per: 
https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-Thefq%28FilterQuery%29Parameter


If using "fq" ranking will now be different, I need to understand why.  Even 
more, I'm now wandering, which ranking is correct the one with "fq" or without 
?!!!


I'm now more puzzled about this than ever   If the following two



q=(searchstring ...) AND (type:type_a OR type:type_b OR type:type_c OR ...)



q=search string...&fq=type:(type_a OR type_b OR type_c OR ...)



will not give me the same ranking, than why?


-- MJ


-Original Message-
From: Jack Krupansky 
To: solr-user 
Sent: Wed, May 21, 2014 5:06 pm
Subject: Re: Using fq as OR


The whole point of a filter query is to hide data but without impacting the 
scoring for the non-hidden data. A second goal is performance since the 
filter query can be cached.

So, the immediate question for you is whether you really want a true filter 
query, or if you actually do what the filtering terms to participate in the 
document scoring.

In other words, what exactly were you trying to achieve by using fq?

-- Jack Krupansky

-Original Message- 
From: johnmu...@aol.com
Sent: Wednesday, May 21, 2014 12:19 PM
To: solr-user@lucene.apache.org
Subject: Re: Using fq as OR


Answering Jack's question first: the result is different, by few counts, but 
I found my problem:I was using the wrong syntax in my code vs. what I posted 
here:


I was using


q=(search string ...) AND (type:type_a OR type_b OR type_c OR ...)


(see how I left out "type:" from "type_b" and "type_c", etc.?!


Shawn and all, now the hit count is the same but ranking is totally 
different, how come ?!!!  I'm not using edismax, I'm using the default query 
parser, I'm also using the default sort.  You said the "order" will likely 
be different, which it is, why?  If I cannot explain it to my users, they 
will be confused because they can type in directly the search syntax (when 
"fq" is not used) and expect to see the same result for when I grammatically 
in my code apply "fq".


Same data, but different path, giving me different rank result, is not good.


-- MJ



-Original Message-
From: Shawn Heisey 
To: solr-user 
Sent: Wed, May 21, 2014 11:42 am
Subject: Re: Using fq as OR


On 5/21/2014 9:26 AM, johnmu...@aol.com wrote:
> Currently, I'm building my search as follows:
>
>
> q=(search string ...) AND (type:type_a OR type:type_b OR type:type_c 
> OR
...)
>
>
> Which means anything I search for will be AND'ed to be in either fields 
> that
have "type_a", "type_b", "type_c", etc.  (I have defaultOperator set to 
"AND")
>
>
> Now, I need to use "fq" so I'm not sure how to build my search string to 
> get
the same result!!
>
>
> I have tried the following:
>
>
> q=search string ...&fq=type:type_a&fq=type:type_b&fq=type:type_c&...
>
>
> But this isn't the same because each additional "fq" is now being treated 
> as
AND (keep in mind, I have defaultOperator set to "AND" and I cannot change
that).
>
>
> I have tried the following:
>
>
> q=search string ...&fq=type:(type_a OR type_b OR type_c OR ...)
>
>
> But the result I get back is not the same.

If you are using the standard (lucene) query parser for your queries,
then fq should behave exactly the same.  If you are using a different
query parser (edismax, for example) then fq may not behave the same,
because it will use the lucene query parser.

With the standard query parser, if your original query looks like the
following:

q=(query) AND (filter)

The query below should produce exactly the same results -- although if
you are using the default relevance sort, the *order* is likely to be
different, because filter queries do not affect the document scores, but
everything in the q parameter does.

q=(query)&fq=(filter)

Thanks,
Shawn




 


Re: Using fq as OR

2014-05-21 Thread johnmunir
Interesting!!  I did not know that using "fq" means the result will NOT be 
scored.


When you say "add a boosting query using the bq parameter" can you give me an 
example?  I read on "bq" but could not figure out how to convert:

q=(searchstring ...) AND (type:type_a OR type:type_b OR type:type_c OR ...)
to use "bq boosting".
Maybe my question should be rephrase to this: to narrow down my search to 
within 1 or more fields, is the syntax that I'm currently using the optimal one 
or is there some Solr trick I should be using?
My users are currently used to the score result that I give them with the 
syntax that I am currently using (that I showed above).  I looking to see if 
there is some other way to get the same result faster.  This is why I ended up 
looking into "fq" after reading about it
Thanks to everyone for helping out with this topic.  I am learning a lot 
-- MJ


-Original Message-
From: Jack Krupansky 
To: solr-user 
Sent: Wed, May 21, 2014 6:07 pm
Subject: Re: Using fq as OR


As I indicated in my original response, the fq query terms do not 
participate in any way in the scoring of documents - they merely filter 
(eliminate or keep) documents.

If you actually do want the fq terms to participate in the scoring of 
documents, either keep them on the original q query, or add a boosting query 
using the bq parameter. The latter approach works for the dismax and edismax 
query parsers only.

-- Jack Krupansky

-Original Message- 
From: johnmu...@aol.com
Sent: Wednesday, May 21, 2014 5:51 PM
To: solr-user@lucene.apache.org
Subject: Re: Using fq as OR


Hi Jack,


I'm going after speed per: 
https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-Thefq%28FilterQuery%29Parameter


If using "fq" ranking will now be different, I need to understand why.  Even 
more, I'm now wandering, which ranking is correct the one with "fq" or 
without ?!!!


I'm now more puzzled about this than ever   If the following two



q=(searchstring ...) AND (type:type_a OR type:type_b OR type:type_c OR 
...)



q=search string...&fq=type:(type_a OR type_b OR type_c OR ...)



will not give me the same ranking, than why?


-- MJ


-Original Message-
From: Jack Krupansky 
To: solr-user 
Sent: Wed, May 21, 2014 5:06 pm
Subject: Re: Using fq as OR


The whole point of a filter query is to hide data but without impacting the
scoring for the non-hidden data. A second goal is performance since the
filter query can be cached.

So, the immediate question for you is whether you really want a true filter
query, or if you actually do what the filtering terms to participate in the
document scoring.

In other words, what exactly were you trying to achieve by using fq?

-- Jack Krupansky

-Original Message- 
From: johnmu...@aol.com
Sent: Wednesday, May 21, 2014 12:19 PM
To: solr-user@lucene.apache.org
Subject: Re: Using fq as OR


Answering Jack's question first: the result is different, by few counts, but
I found my problem:I was using the wrong syntax in my code vs. what I posted
here:


I was using


q=(search string ...) AND (type:type_a OR type_b OR type_c OR ...)


(see how I left out "type:" from "type_b" and "type_c", etc.?!


Shawn and all, now the hit count is the same but ranking is totally
different, how come ?!!!  I'm not using edismax, I'm using the default query
parser, I'm also using the default sort.  You said the "order" will likely
be different, which it is, why?  If I cannot explain it to my users, they
will be confused because they can type in directly the search syntax (when
"fq" is not used) and expect to see the same result for when I grammatically
in my code apply "fq".


Same data, but different path, giving me different rank result, is not good.


-- MJ



-Original Message-
From: Shawn Heisey 
To: solr-user 
Sent: Wed, May 21, 2014 11:42 am
Subject: Re: Using fq as OR


On 5/21/2014 9:26 AM, johnmu...@aol.com wrote:
> Currently, I'm building my search as follows:
>
>
> q=(search string ...) AND (type:type_a OR type:type_b OR type:type_c
> OR
...)
>
>
> Which means anything I search for will be AND'ed to be in either fields
> that
have "type_a", "type_b", "type_c", etc.  (I have defaultOperator set to
"AND")
>
>
> Now, I need to use "fq" so I'm not sure how to build my search string to
> get
the same result!!
>
>
> I have tried the following:
>
>
> q=search string ...&fq=type:type_a&fq=type:type_b&fq=type:type_c&...
>
>
> But this isn't the same because each additional "fq" is now being treated
> as
AND (keep in mind, I have defaultOperator set to "AND" and I cannot change
that).
>
>
> I have tried the following:
>
>
> q=search string ...&fq=type:(type_a OR type_b OR type_c OR ...)
>
>
> But the result I get back is not the same.

If you are using the standard (lucene) query parser for your queries,
then fq should behave exactly the same.  If you are using a different
query parser (edismax, for e

How much free disk space will I need to optimize my index

2014-06-25 Thread johnmunir
Hi,


I need to de-fragment my index.  My question is, how much free disk space I 
need before I can do so?  My understanding is, I need 1X free disk space of my 
current index un-optimized index size before I can optimize it.  Is this true?


That is, let say my index is 20 GB (un-optimized) then I must have 20 GB of 
free disk space to make sure the optimization is successful.  The reason for 
this is because during optimization the index is re-written (is this the case?) 
and if it is already optimized, the re-write will create a new 20 GB index 
before it deletes the old one (is this true?), thus why there must be at least 
20 GB free disk space.


Can someone help me with this or point me to a wiki on this topic?


Thanks!!!


- MJ


Re: How much free disk space will I need to optimize my index

2014-06-26 Thread johnmunir
Thank you all for the reply and shedding more light on this topic.


A follow up question: during optimization, If I run out of disk space, what 
happens other than the optimizer failing?  Am I now left with even a larger 
index than I started with or am I back to the original none optimized index 
size?!!!


-- MJ



-Original Message-
From: Walter Underwood 
To: solr-user 
Sent: Thu, Jun 26, 2014 10:50 am
Subject: Re: How much free disk space will I need to optimize my index


The 3x worst case is:

1. All documents are in one segment.
2. Without merging, all documents are deleted, then re-added and committed.
3. A merge is done.

At the end of step 2, there are two equal-sized segments, 2X the space needed.

During step 3, a third segment of that size is created.

This can only happen if you disable merging. 2X is a conservative margin that 
should work fine for regular merges. 

Forced full merges ("optimize") can use more overhead because they move every 
document in the index. Yet another reason to avoid forced merges.

wunder

On Jun 26, 2014, at 12:50 AM, Thomas Egense  wrote:

> That is correct, but twice the disk space is theoretically not enough.
> Worst case is actually three times the storage, I guess this worst case can
> happen if you also submit new documents to the index while optimizing.
> I have experienced 2.5 times the disk space during an optimize for a large
> index, it was a 1TB index that temporarily used 2.5TB disc space during the
> optimize (near the end of the optimization).
> 
> From,
> Thomas Egense
> 
> 
> On Wed, Jun 25, 2014 at 8:21 PM, Markus Jelsma 
> wrote:
> 
>> 
>> 
>> 
>> 
>> -Original message-
>>> From:johnmu...@aol.com 
>>> Sent: Wednesday 25th June 2014 20:13
>>> To: solr-user@lucene.apache.org
>>> Subject: How much free disk space will I need to optimize my index
>>> 
>>> Hi,
>>> 
>>> 
>>> I need to de-fragment my index.  My question is, how much free disk
>> space I need before I can do so?  My understanding is, I need 1X free disk
>> space of my current index un-optimized index size before I can optimize it.
>> Is this true?
>> 
>> Yes, 20 GB of FREE space to force merge an existing 20 GB index.
>> 
>>> 
>>> 
>>> That is, let say my index is 20 GB (un-optimized) then I must have 20 GB
>> of free disk space to make sure the optimization is successful.  The reason
>> for this is because during optimization the index is re-written (is this
>> the case?) and if it is already optimized, the re-write will create a new
>> 20 GB index before it deletes the old one (is this true?), thus why there
>> must be at least 20 GB free disk space.
>>> 
>>> 
>>> Can someone help me with this or point me to a wiki on this topic?
>>> 
>>> 
>>> Thanks!!!
>>> 
>>> 
>>> - MJ
>>> 
>> 

--
Walter Underwood
wun...@wunderwood.org




 


Searching on special characters

2013-10-24 Thread johnmunir
Hi,


How should I setup Solr so I can search and get hit on special characters such 
as: + - && || ! ( ) { } [ ] ^ " ~ * ? : \


My need is, if a user has text like so:


Doc-#1: "(Solr)"
Doc-#2: "Solr"


And they type "(solr)" I want a hit on "(solr)" only in document #1, with the 
brackets matching.  And if they type "solr", they will get a hit in Document #2 
only.


An additional nice-to-have is, if they type "solr", I want a hit in both 
document #1 and #2.


Here is what my current schema.xml looks like:



  







  



Currently, special characters are being stripped.



Any idea how I can configure Solr to do this?  I'm using Solr 3.6.



Thanks !!


-MJ


Re: Searching on special characters

2013-10-24 Thread johnmunir
I'm not sure what you mean.  Based on what you are saying, is there an example 
of how I can setup my schema.xml to get the result I need?


Also, the way I execute a search is using 
http://localhost:8080/solr/select/?q=  Does your solution require 
me to change this?  If so, in what way?


It would be great if all this is documented somewhere, so I won't have to bug 
you guys !!!



--MJ



-Original Message-
From: Jack Krupansky 
To: solr-user 
Sent: Thu, Oct 24, 2013 9:39 am
Subject: Re: Searching on special characters


Have two or three copies of the text, one field could be raw string and 
boosted heavily for exact match, a second could be text using the keyword 
tokenizer but with lowercase filter also heavily boosted, and the third 
field general, tokenized text with a lower boost. You could also have a copy 
that uses the keyword tokenizer to maintain a single token but also applies 
a regex filter to strip special characters and applies a lower case filter 
and give that an intermediate boost.

-- Jack Krupansky

-Original Message- 
From: johnmu...@aol.com
Sent: Thursday, October 24, 2013 9:20 AM
To: solr-user@lucene.apache.org
Subject: Searching on special characters

Hi,


How should I setup Solr so I can search and get hit on special characters 
such as: + - && || ! ( ) { } [ ] ^ " ~ * ? : \


My need is, if a user has text like so:


Doc-#1: "(Solr)"
Doc-#2: "Solr"


And they type "(solr)" I want a hit on "(solr)" only in document #1, with 
the brackets matching.  And if they type "solr", they will get a hit in 
Document #2 only.


An additional nice-to-have is, if they type "solr", I want a hit in both 
document #1 and #2.


Here is what my current schema.xml looks like:



  







  



Currently, special characters are being stripped.



Any idea how I can configure Solr to do this?  I'm using Solr 3.6.



Thanks !!


-MJ 


 



How to deal with underscore

2013-07-03 Thread johnmunir
Hi,


In my schema.xml, I have the following settings:



  







  



This does great job for most of my text, but one thing I does that I don't like 
is it won't replace underscores to spaces; it strips them.  For example, if I 
have "Solr_Lucene" it becomes "solrlucene" (one word).  What I want is two 
words "solr lucene".


Thanks


-MJ


Will Solr work with a mapped drive?

2013-09-19 Thread johnmunir
Hi,


I'm having this same problem as described here: 
http://stackoverflow.com/questions/17708163/absolute-paths-in-solr-xml-configuration-using-tomcat6-on-windows
  Any one knows if this is a limitation of Solr or not?


I searched the web, nothing came up.


Thanks!!!


-- MJ


Unsubscribing from JIRA

2013-05-01 Thread johnmunir
Hi,


Can someone show me how to unsubscribe from JIRA?


Years ago, I subscribed to JIRA and since then I have been receiving emails 
from JIRA for all kind of issues: when an issue is created, closed or commented 
on.  Yes, I looked around and could not figure out how to unsubscribe, but 
maybe I didn't look hard enough?


Here is an example email subject line header from JIRA: "[jira] [Commented] 
(LUCENE-3842) Analyzing Suggester"  I have the same issue from "Jenkins" (and 
example: "[JENKINS] Lucene-Solr-Tests-4.x-Java6 - Build # 1537 - Still 
Failing").


Thanks in advance!!!


-MJ


Re: Unsubscribing from JIRA

2013-05-01 Thread johnmunir
Are you saying because I'm subscribed to dev, which I'm, is why I'm getting 
JIRA mails too, and the only way I can stop JIRA mails is to unsubscribe from 
dev?  I don't think so.  I'm subscribed to other projects, both dev and user, 
and yet I do not receive JIRA mails.


--MJ



-Original Message-
From: Alan Woodward 
To: solr-user 
Sent: Wed, May 1, 2013 12:52 pm
Subject: Re: Unsubscribing from JIRA


Hi MJ,

It looks like you're subscribed to the lucene dev list.  Send an email to 
dev-unsubscr...@lucene.apache.org to get yourself taken off the list.

Alan Woodward
www.flax.co.uk


On 1 May 2013, at 17:25, johnmu...@aol.com wrote:

> Hi,
> 
> 
> Can someone show me how to unsubscribe from JIRA?
> 
> 
> Years ago, I subscribed to JIRA and since then I have been receiving emails 
from JIRA for all kind of issues: when an issue is created, closed or commented 
on.  Yes, I looked around and could not figure out how to unsubscribe, but 
maybe 
I didn't look hard enough?
> 
> 
> Here is an example email subject line header from JIRA: "[jira] [Commented] 
(LUCENE-3842) Analyzing Suggester"  I have the same issue from "Jenkins" (and 
example: "[JENKINS] Lucene-Solr-Tests-4.x-Java6 - Build # 1537 - Still 
Failing").
> 
> 
> Thanks in advance!!!
> 
> 
> -MJ


 


RE: Unsubscribing from JIRA

2013-05-07 Thread johnmunir

For someone link me, who want to follow dev discussions but not JIRA, having a 
separate mailing list subscription for each would be ideal.  The incoming mail 
traffic would be cut drastically (for me, I get far more non relevant emails 
from JIRA vs. dev).


-- MJ
 
-Original Message-
From: Raymond Wiker [mailto:rwi...@gmail.com] 
Sent: Wednesday, May 01, 2013 2:01 PM
To: solr-user@lucene.apache.org
Subject: Re: Unsubscribing from JIRA
 
On May 1, 2013,at 19:07 , johnmunir@aol.comwrote:
> Are yousaying because I'm subscribed to dev, which I'm, is why I'm getting 
> JIRA mailstoo, and the only way I can stop JIRA mails is to unsubscribe from 
> dev?  I don't think so.  I'm subscribed to other projects, both devand user, 
> and yet I do not receive JIRA mails.
> 
 
I'm pretty surethat's the case... I subscribed to dev, and got the JIRA mails. 
I unsubscribedfrom dev, and the JIRA mails stopped.


Phrase search

2010-08-02 Thread johnmunir

Hi All,
 
I don't understand why i'm getting this behavior.  I was under the impression 
if I search for "Apple 2" (with quotes and space before “2”) it will give me 
different results vs. if I search for "Apple2" (with quotes and no space before 
“2”), but I'm not!  Why? 
 
Here is my fieldType setting from my schema.xml:


  







  
  







  

 
What I am missing?!!  What part of my solr.WordDelimiterFilterFactory need to 
change (if that’s where the issue is)?
 
I’m using Solr 1.2
 
Thanks in advanced.
 
-M
 


Re: Phrase search

2010-08-02 Thread johnmunir




Thanks for the quick response.

Which part of my WordDelimiterFilterFactory is changing "Apple 2" to "Apple2"?  
How do I fix it?  Also, I'm really confused about this.  I was under the 
impression a phrase search is not impacted by the analyzer, no?

-M


-Original Message-
From: Markus Jelsma 
To: solr-user@lucene.apache.org
Sent: Mon, Aug 2, 2010 2:27 pm
Subject: RE: Phrase search


Well, the WordDelimiterFilterFactory in your query analyzer clearly makes 
"Apple 
" out of "Apple2", that's what it's for. If you're looking for an exact match, 
se a string field. Check the output with the debugQuery=true parameter.
 
Cheers, 

Original message-
rom: johnmu...@aol.com
ent: Mon 02-08-2010 20:18
o: solr-user@lucene.apache.org; 
ubject: Phrase search

i All,
I don't understand why i'm getting this behavior.  I was under the impression 
if 
 search for "Apple 2" (with quotes and space before 2 ) it will give me 
ifferent results vs. if I search for "Apple2" (with quotes and no space before 
 ), but I'm not!  Why? 
Here is my fieldType setting from my schema.xml:
   

  
  
  
  
  
  
  


  
  
  
  
  
  
  

  
What I am missing?!!  What part of my solr.WordDelimiterFilterFactory need to 
hange (if that s where the issue is)?
I m using Solr 1.2
Thanks in advanced.
-M



Re: Phrase search

2010-08-02 Thread johnmunir

I'm using Solr 1.2, so I don't have splitOnNumerics.  Reading that URL, is my 
use of catenateNumbers="1" causing this?  Should I set it to "0" vs. "1" as I 
have it now?
 
-M




-Original Message-
From: Markus Jelsma 
To: solr-user@lucene.apache.org
Sent: Mon, Aug 2, 2010 3:54 pm
Subject: RE: Re: Phrase search


Hi,
 
Queries on an analyzed field will need to be analyzed as well or it might not 
atch. You can configure the WordDelimiterFilterFactory so it will not split 
nto multiple tokens because of numerics, see the splitOnNumerics parameter [1].
 
[1]: 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
 
Cheers,


Original message-
rom: johnmu...@aol.com
ent: Mon 02-08-2010 21:29
o: solr-user@lucene.apache.org; 
ubject: Re: Phrase search


Thanks for the quick response.
Which part of my WordDelimiterFilterFactory is changing "Apple 2" to "Apple2"? 
How do I fix it?  Also, I'm really confused about this.  I was under the 
mpression a phrase search is not impacted by the analyzer, no?
-M

Original Message-
rom: Markus Jelsma 
o: solr-user@lucene.apache.org
ent: Mon, Aug 2, 2010 2:27 pm
ubject: RE: Phrase search

ell, the WordDelimiterFilterFactory in your query analyzer clearly makes "Apple 
" out of "Apple2", that's what it's for. If you're looking for an exact match, 
e a string field. Check the output with the debugQuery=true parameter.
Cheers, 
Original message-
om: johnmu...@aol.com
nt: Mon 02-08-2010 20:18
: solr-user@lucene.apache.org; 
bject: Phrase search
i All,
 don't understand why i'm getting this behavior.  I was under the impression if 
search for "Apple 2" (with quotes and space before 2 ) it will give me 
fferent results vs. if I search for "Apple2" (with quotes and no space before 
, but I'm not!  Why? 
ere is my fieldType setting from my schema.xml:
 
  







  
  







  

hat I am missing?!!  What part of my solr.WordDelimiterFilterFactory need to 
ange (if that s where the issue is)?
 m using Solr 1.2
hanks in advanced.
M



Re: Phrase search

2010-08-02 Thread johnmunir


I'm trying to match "Apple 2" but not "Apple2" using phrase search, this is why 
I have it quoted.
 
I was under the impression --when I use phrase search-- all the analyzer magic 
would not apply, but it is!!!  Otherwise, how would I search for a phrase?!
 
Using Google, when I search for "Windows 7" (with quotes), unlike Solr, I don't 
get hits on "Window7".  I want to use catenateNumbers="1" which I want it to 
take effect on other searches but no phrase searches.  Is this possible ?
 
Yes, we are in the process of planning to upgrade to Solr 1.4.1 -- it takes 
time and a lot of effort to do such an upgrade at where I work.
 
Thank you for your help and understanding.
 
-M






-Original Message-
From: Chris Hostetter 
To: solr-user@lucene.apache.org
Sent: Mon, Aug 2, 2010 5:41 pm
Subject: Re: Phrase search



 I don't understand why i'm getting this behavior.  I was under the 
 impression if I search for "Apple 2" (with quotes and space before “2”) 
 it will give me different results vs. if I search for "Apple2" (with 
 quotes and no space before “2”), but I'm not!  Why?
if you search "Apple 2" in quotes, then the analyzer for your field gets 
he full string (with the space) and whatever it does with it and whatever 
erms it produces determs what Query gets executed.  If you search 
Apple2" (w/ or w/o quotes) then the analyzer for your field gets the full 
tring and whatever it does with it and whatever Terms it produces determs 
hat Query gets executed.
None of that changes based on the analyzer you use.
With that in mind: i relaly don't understand your question.  Let's step 
ack and instead of trying to explain *why* you are getting the results 
ou are getting (short answer: because that's how your analyzer works) 
et's ask the quetsion: what do you *want* to do?  What do you *want* to 
ee happen when you enter various query strings?
http://people.apache.org/~hossman/#xyproblem
Y Problem
Your question appears to be an "XY Problem" ... that is: you are dealing
ith "X", you are assuming "Y" will help you, and you are asking about "Y"
ithout giving more details about the "X" so that we can understand the
ull issue.  Perhaps the best solution doesn't involve "Y" at all?
ee Also: http://www.perlmonks.org/index.pl?node_id=542341
: I’m using Solr 1.2
PS: Solr 1.2 had numerous bugs which were really really bad and which were 
ixed in Solr 1.3.  Solr 1.3 had numerous bugs where were really really 
ad and were fixed in Solr 1.4.  Solr 1.4 had a couple of bugs where 
eally really bad and which were fixed in Solr 1.4.1 ... so even if you 
on't want any of hte new features, you should *REALLY* consider 
pgrading.

Hoss



Upgrading from Solr 1.2 to 1.4.1

2010-10-28 Thread johnmunir

I'm using Solr 1.2.  If I upgrade to 1.4.1, must I re-index because of 
LUCENE-1142?  If so, how will this affect me if I don’t re-index (I'm using 
EnglishPorterFilterFactory)?  What about when I’m using non-English stammers 
from Snowball?
 
Beside the brief note "IMPORTANT UPGRADE NOTE" about this in CHANGES.txt, where 
can I read more about this?  I looked in JIRA, LUCENE-1142, there isn't much.
 
-M


XML 1.1 and Solr 3.6.1

2013-01-25 Thread johnmunir
Can someone tell me if Solr 3.6.1 supports XML 1.1 or must I stick with XML 1.0?


Thanks!



-MJ


Please ignore, testing my email

2013-02-27 Thread johnmunir
Hi,


Please ignore, I'm testing my email (I have not received any email from Solr 
mailing list for over 12 hours now).


-- MJ


Questions about schema.xml

2012-11-07 Thread johnmunir

HI,


Can someone help me understand the meaning of  and 
 in schema.xml, how they are used and what do I get back 
when the values are not the same?


For example, given:



   
  
  
  
  
  
  
   
   
  
  
  
  
  
  
  
   



If I make the entire content of "index" the same as "query" (or the other way 
around) how will that impact my search?  And why would I want to not make those 
two blocks the same?


Thanks!!!


-MJ 


Re: Questions about schema.xml

2012-11-08 Thread johnmunir
Thanks Prithu.


But why would I use different settings for the index and query?  I would think 
that if the setting is not the same for both, then search results for end users 
would be confusing, no?  To illustrate my point (this maybe drastic) if I don't 
"solr.LowerCaseFilterFactory" in one case, then many searches (mix-case for 
example) won't give me any hits.  A more realistic example is, if I don't match 
the rules for "solr.WordDelimiterFilterFactory", again, I could miss hits.  If 
my understanding is correct, and there is value in using different rules for 
"query" and "index", I like to see a concrete example, a use-case I can apply.


-- MJ



-Original Message-
From: Prithu Banerjee 
To: solr-user 
Sent: Thu, Nov 8, 2012 12:34 am
Subject: Re: Questions about schema.xml


Those two values are used to specify the analyzer type you want. That can
be of two kinds, one for the indexer- the analyzer you specify analyzes the
input documents accordingly to build the index. The other one is for query,
it analyzes your query. Typically the specified analyzer for index and
query are same so that you can search over exactly the token you created
while indexing. But you are free to provide any customized analyzer
according to your need.

-- 
best regards,
Prithu

On Thu, Nov 8, 2012 at 8:43 AM,  wrote:

>
> HI,
>
>
> Can someone help me understand the meaning of  and
>  in schema.xml, how they are used and what do I get
> back when the values are not the same?
>
>
> For example, given:
>
>
>  autoGeneratePhraseQueries="true">
>
>   
>words="stopwords.txt" enablePositionIncrements="true" />
>generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>   
>protected="protwords.txt"/>
>   
>
>
>   
>ignoreCase="true" expand="true"/>
>words="stopwords.txt" enablePositionIncrements="true" />
>generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>   
>protected="protwords.txt"/>
>   
>
> 
>
>
> If I make the entire content of "index" the same as "query" (or the other
> way around) how will that impact my search?  And why would I want to not
> make those two blocks the same?
>
>
> Thanks!!!
>
>
> -MJ
>

 


Questions about schema.xml

2012-11-08 Thread johnmunir
HI,


Can someone help me understand the meaning of  and 
 in schema.xml, how they are used and what do I get back 
when the values are not the same?


For example, given:




   
  
  
  
  
  
  
   
   
  
  
  
  
  
  
  
   




If I make the entire content of "index" the same as "query" (or the other way 
around) how will that impact my search?  And why would I want to not make those 
two blocks the same?


Thanks!!!


-M 


Re: Questions about schema.xml

2012-11-08 Thread johnmunir

Thank you everyone for your explanation.  So for WordDelimiterFilter, let me 
see if I got it right.


Given that out-of-the box setting for catenateWords is "0" for query but is "1" 
for index, then I don't see how this will give me any hits.  That is, if my 
document has "wi-fi", at index time it will be stored as "wifi".  Well, than at 
query time if I type "wi-fi" (without quotes) I will be searching for "wi fi" 
and thus won't get a hit.  no?


What about when I *do* quote my search, i.e.: I search for "wi-fi" with quotes, 
now what am I sending to the searcher, "wi-fi", "wi fi" or "wifi"?  Again, this 
is using the default out-of-the box setting per the above.


The same applies for catenateNumbers.


Btw, I'm looking at this link for the above values: 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters


--MJ





-Original Message-
From: Erick Erickson 
To: solr-user 
Sent: Thu, Nov 8, 2012 6:57 pm
Subject: Re: Questions about schema.xml


And, in fact, you do NOT need to have two. If they are both identical, just
specify one analysis chain with no qualifier, i.e.



On Thu, Nov 8, 2012 at 9:44 AM, Jack Krupansky wrote:

> Many token filters will be used 100% identically for both "index" and
> "query" analysis, but WordDelimiterFilter is a rare exception. The issue is
> that at index time it has the ability to generate multiple tokens at the
> same position (the "catenate" options), any of which can be queried, but at
> query time it can be problematic to have these "extra" terms (except in
> some conditions), so the WDF settings suppress generation of the extra
> terms.
>
> Another example is synonyms - generate extra terms at index time for
> greater precision of searches, but limit the query terms to exclude the
> "extra" terms.
>
> That's the reason for the occaassional asymmetry between index-time and
> query-time analyzers.
>
> -- Jack Krupansky
>
> -Original Message- From: johnmu...@aol.com
> Sent: Wednesday, November 07, 2012 7:13 PM
> To: solr-user@lucene.apache.org
> Subject: Questions about schema.xml
>
>
>
> HI,
>
>
> Can someone help me understand the meaning of  and
>  in schema.xml, how they are used and what do I get
> back when the values are not the same?
>
>
> For example, given:
>
>
>  autoGeneratePhraseQueries="**true">
>   
>  
>   words="stopwords.txt" enablePositionIncrements="**true" />
>   generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>  
>   protected="protwords.txt"/>
>  
>   
>   
>  
>   ignoreCase="true" expand="true"/>
>   words="stopwords.txt" enablePositionIncrements="**true" />
>   generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>  
>   protected="protwords.txt"/>
>  
>   
> 
>
>
> If I make the entire content of "index" the same as "query" (or the other
> way around) how will that impact my search?  And why would I want to not
> make those two blocks the same?
>
>
> Thanks!!!
>
>
> -MJ
>

 


Is leading wildcard search turned on by default in Solr 3.6.1?

2012-11-12 Thread johnmunir


Hi,


I'm migrating from Solr 1.2 to 3.6.1.  I used the same analyzer as I was, and 
re-indexed my data.  I did not add 
solr.ReversedWildcardFilterFactory to my index analyzer, but yet leading wild 
cards are working!!  Does this mean it's turned on by default?  If so, how do I 
turn it off, and what are the implication of leaving ON?  Won't my searches be 
slower and consume more memory?


Thanks,


--MJ
 


Re: Is leading wildcard search turned on by default in Solr 3.6.1?

2012-11-12 Thread johnmunir
Thanks for the quick response.


So, I do not want to use ReversedWildcardFilterFactory, but leading wildcard is 
working and thus is ON by default.  How do I disable it to prevent the use of 
it and the issues that come with it?


-- MJ



-Original Message-
From: François Schiettecat
 te 
To: solr-user 
Sent: Mon, Nov 12, 2012 5:39 pm
Subject: Re: Is leading wildcard search turned on by default in Solr 3.6.1?


John

You can still use leading wildcards even if you dont have the 
ReversedWildcardFilterFactory in your analysis but it means you will be 
scanning 
the entire dictionary when the search is run which can be a performance issue. 
If you do use ReversedWildcardFilterFactory you wont have that performance 
issue 
but you will increase the overall size of your index. Its a tradeoff. 

When I looked into it for a site I built I decided that the tradeoff was not 
worth it (after benchmarking) given how few leading wildcards searches it was 
getting.

Best regards

François


On Nov 12, 2012, at 5:33 PM, johnmu...@aol.com wrote:

> 
> 
> Hi,
> 
> 
> I'm migrating from Solr 1.2 to 3.6.1.  I used the same analyzer as I was, and 
re-indexed my data.  I did not add 
> solr.ReversedWildcardFilterFactory to my index analyzer, but yet leading wild 
cards are working!!  Does this mean it's turned on by default?  If so, how do I 
turn it off, and what are the implication of leaving ON?  Won't my searches be 
slower and consume more memory?
> 
> 
> Thanks,
> 
> 
> --MJ
> 


 



RE: Is leading wildcard search turned on by default in Solr 3.6.1?

2012-11-12 Thread johnmunir

At one point, in some version of Solr, it was OFF by default, and you had to 
enable it via a setting (either in solrconfig.xml or schema.xml, I don't 
remember).  It looks like this is no longer the case.  Even worse, and if this 
is true, disabling it no longer seems to be possible to disable it via a Solr 
setting!!


-- MJ


-Original Message-
From: François Schiettecatte [mailto:fschietteca...@gmail.com] 
Sent: Monday, November 12, 2012 7:48 PM
To: solr-user@lucene.apache.org
Subject: Re: Is leading wildcard search turned on by default in Solr 3.6.1?


I suspect it is just part of the wildcard handling, maybe someone can chime in 
here, you may need to catch this before it gets to SOLR.


François


On Nov 12, 2012, at 5:44 PM, johnmu...@aol.com wrote:


> Thanks for the quick response.
> 
> 
> So, I do not want to use ReversedWildcardFilterFactory, but leading wildcard 
> is working and thus is ON by default.  How do I disable it to prevent the use 
> of it and the issues that come with it?
> 
> 
> -- MJ
> 
> 
> 
> -Original Message-
> From: François Schiettecat
> te 
> To: solr-user 
> Sent: Mon, Nov 12, 2012 5:39 pm
> Subject: Re: Is leading wildcard search turned on by default in Solr 3.6.1?
> 
> 
> John
> 
> You can still use leading wildcards even if you dont have the 
> ReversedWildcardFilterFactory in your analysis but it means you will 
> be scanning the entire dictionary when the search is run which can be a 
> performance issue.
> If you do use ReversedWildcardFilterFactory you wont have that 
> performance issue but you will increase the overall size of your index. Its a 
> tradeoff.
> 
> When I looked into it for a site I built I decided that the tradeoff 
> was not worth it (after benchmarking) given how few leading wildcards 
> searches it was getting.
> 
> Best regards
> 
> François
> 
> 
> On Nov 12, 2012, at 5:33 PM, johnmu...@aol.com wrote:
> 
>> 
>> 
>> Hi,
>> 
>> 
>> I'm migrating from Solr 1.2 to 3.6.1.  I used the same analyzer as I 
>> was, and
> re-indexed my data.  I did not add
>> solr.ReversedWildcardFilterFactory to my index analyzer, but yet 
>> leading wild
> cards are working!!  Does this mean it's turned on by default?  If so, 
> how do I turn it off, and what are the implication of leaving ON?  
> Won't my searches be slower and consume more memory?
>> 
>> 
>> Thanks,
>> 
>> 
>> --MJ
>> 
> 
> 
> 
> 




RE: Is leading wildcard search turned on by default in Solr 3.6.1?

2012-11-12 Thread johnmunir

I'm surprised that this has not been logged as adefect.  The fact that this is 
ON bydefault, means someone can bring down a server; this is bad enough to 
categorizethis as a security issue.
 
--MJ
 
-Original Message-
From: Michael Ryan [mailto:mr...@moreover.com] 
Sent: Monday, November 12, 2012 8:10 PM
To: solr-user@lucene.apache.org
Subject: RE: Is leading wildcard search turned on by default in Solr 3.6.1?
 
Yeah, thesituation is kind of a pain right now. In 
https://issues.apache.org/jira/browse/SOLR-2438, it was enabled by default and 
there is noway to disable without patching SolrQueryParser. There's also the 
edismaxparser which doesn't have a setting for this, which I've made a jira for 
at https://issues.apache.org/jira/browse/SOLR-3031.
 
I'm surprisedother people haven't requested this, as any instance of serious 
size can bebrought to its knees by a wildcard query.
 
-Michael
 
-OriginalMessage-
From: johnmu...@aol.com [mailto:johnmu...@aol.com] 
Sent: Monday,November 12, 2012 7:58 PM
To: solr-user@lucene.apache.org
Subject: RE: Isleading wildcard search turned on by default in Solr 3.6.1?
 
 
At one point, insome version of Solr, it was OFF by default, and you had to 
enable it via asetting (either in solrconfig.xml or schema.xml, I don't 
remember).  It looks like this is no longer thecase.  Even worse, and if this 
is true,disabling it no longer seems to be possible to disable it via a Solr 
setting!!
 
 
-- MJ
 


Using CJK analyzer

2012-11-13 Thread johnmunir
Hi,


Using Solr 1.2.0, the following works (and I get hits searching on Chinese 
text):




  






  




and it won't work.


I run it through the analyzer and I see this (I hope the table will show up 
fine on the mailing list):


Index Analyzer
org.apache.lucene.analysis.cn.ChineseAnalyzer {}


position
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

term text
去
除
商
品
操
作
在
订
购
单
中
留
下
空
白
行

startOffset
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

endOffset
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16


Query Analyzer
org.apache.lucene.analysis.cn.ChineseAnalyzer {}


position
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

term text
去
除
商
品
操
作
在
订
购
单
中
留
下
空
白
行

startOffset
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

endOffset
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16






--MJ


Tokenization and wild card search

2010-01-18 Thread johnmunir

Hi,
 
I have an issue and I'm not sure how to address it, so I hope someone can help 
me.
 
I have the following text in one of my fields: "ABC_Expedition_ERROR".   When I 
search on it like: "MyField:SDD_Expedition_PCB" (without quotes) it will fail 
to find me only this word “ABC_Expedition_ERROR” which I think is due to 
tokenization because of the underscore.
 
My solution is: "MyField:"SDD_Expedition_PCB"" (without the outer quotes, but 
quotes around the word “ABC_Expedition_ERROR”).  This works fine.  But then, 
how do I search on "SDD_Expedition_PCB" with wild card?  For example: 
"MyField:SDD_Expedition*" will not work.
 
Any help is greatly appreciated.
 
Thanks.
 
-- JM
 
=


RE: Tokenization and wild card search

2010-01-19 Thread johnmunir


I want the following searches to work:
 
  MyField:SDD_Expedition_PCB
 
This should match the word "SDD_Expedition_PCB" only, and not matching 
individual words such as "SDD" or "Expedition", or "PCB".

And the following search:
 
  MyField:SDD_Expedition*
 
Should match any word starting with "SDD_Expedition" and ending with anything 
else such as "SDD_Expedition_PBC", "SDD_Expedition_One", "SDD_Expedition_Two", 
"SDD_ExpeditionSolr", "SDD_ExpeditionSolr1.4", etc, but not matching individual 
words such as "SDD" or "Expedition".
 

The field type for "MyField" is (the field name is keywords):
 

 
And here is the analyzer I'm using:
 

  







  
  







  

 
Any help on how I can achieve the above is greatly appreciated.
 
Btw, if at all possible, I would like to be able to achieve this search without 
having to change how I'm indexing / tokenizing the data.  I'm looking for 
search syntax to make this work.
 
-- JM
 
-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Tuesday, January 19, 2010 7:57 AM
To: solr-user@lucene.apache.org
Subject: Re: Tokenization and wild card search
 
> I have an issue and I'm not sure how to address it, so I
> hope someone can help me.
>  
> I have the following text in one of my fields:
> "ABC_Expedition_ERROR".���When I search on it
> like: "MyField:SDD_Expedition_PCB" (without quotes) it will
> fail to find me only this word �ABC_Expedition_ERROR�
> which I think is due to tokenization because of the
> underscore.
 
Do you want or do not want your query MyField:SDD_Expedition_PCB to return 
documents containing ABC_Expedition_ERROR?
 
> My solution is: "MyField:"SDD_Expedition_PCB"" (without the
> outer quotes, but quotes around the word
> �ABC_Expedition_ERROR�).� This works fine.�
> But then, how do I search on "SDD_Expedition_PCB" with wild
> card?� For example: "MyField:SDD_Expedition*" will not
> work.
 
Can you paste your field type of MyField? And give some examples what queries 
should return what documents.



Re: Tokenization and wild card search

2010-01-19 Thread johnmunir


You are correct, the way I'm using tokenization is my issue.  It's too late to 
re-index now, this is why I'm looking for a search syntax that will to make the 
search work.
 
I have tried various search syntax with no luck.  Is there no search syntax to 
make this work without re-indexing?!
 
-- JM


-Original Message-
From: Erick Erickson 
To: solr-user@lucene.apache.org
Sent: Tue, Jan 19, 2010 10:30 am
Subject: Re: Tokenization and wild card search


I'm pretty sure you're going to be disappointed about
he re-indexing part.
I'm pretty sure that WordDelimiterFilterFactory is tokenizing
our input in ways you don't expect, making your use-case
ard to accomplish.
It's basically splitting your input on all non-alpha characters,
o you're indexing see
ttp://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
I'd *strongly* suggest you examine the results of your indexing
n order to understand what's possible.
Get a copy of luke and examine your index or use the
OLR admin Analysis page...
I suspect what you're really looking for is WhitespaceAnalyzer
r Keyword
On Tue, Jan 19, 2010 at 9:50 AM,  wrote:
>

 I want the following searches to work:

  MyField:SDD_Expedition_PCB

 This should match the word "SDD_Expedition_PCB" only, and not matching
 individual words such as "SDD" or "Expedition", or "PCB".

 And the following search:

  MyField:SDD_Expedition*

 Should match any word starting with "SDD_Expedition" and ending with
 anything else such as "SDD_Expedition_PBC", "SDD_Expedition_One",
 "SDD_Expedition_Two", "SDD_ExpeditionSolr", "SDD_ExpeditionSolr1.4", etc,
 but not matching individual words such as "SDD" or "Expedition".


 The field type for "MyField" is (the field name is keywords):



 And here is the analyzer I'm using:


  







  
  







  


 Any help on how I can achieve the above is greatly appreciated.

 Btw, if at all possible, I would like to be able to achieve this search
 without having to change how I'm indexing / tokenizing the data.  I'm
 looking for search syntax to make this work.

 -- JM

 -Original Message-
 From: Ahmet Arslan [mailto:iori...@yahoo.com]
 Sent: Tuesday, January 19, 2010 7:57 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Tokenization and wild card search

 > I have an issue and I'm not sure how to address it, so I
 > hope someone can help me.
 >
 > I have the following text in one of my fields:
 > "ABC_Expedition_ERROR".���When I search on it
 > like: "MyField:SDD_Expedition_PCB" (without quotes) it will
 > fail to find me only this word �ABC_Expedition_ERROR�
 > which I think is due to tokenization because of the
 > underscore.

 Do you want or do not want your query MyField:SDD_Expedition_PCB to return
 documents containing ABC_Expedition_ERROR?

 > My solution is: "MyField:"SDD_Expedition_PCB"" (without the
 > outer quotes, but quotes around the word
 > �ABC_Expedition_ERROR�).� This works fine.�
 > But then, how do I search on "SDD_Expedition_PCB" with wild
 > card?� For example: "MyField:SDD_Expedition*" will not
 > work.

 Can you paste your field type of MyField? And give some examples what
 queries should return what documents.





what's up with: java -Ddata=args -jar post.jar ""

2008-03-19 Thread johnmunir

Hi,



I'm a new Solr user. I figured my way around Solr just fine (I think) ... I can 
index and search ets. And so far I have indexed over 300k documents.



What I can't figure out is the following. I'm using:



??? java -Ddata=args -jar post.jar ""


to post an optimize command. What I'm finding is that I have to do it twice in 
order for the files to be "optimized" ... i.e.: the first post takes 3-4 
minutes but leaves the file count as is at 44 ... the second post takes 2-3 
seconds but shrinks the file count from 44 to 8.


So my question is the following, is this the expected behavior or am I doing 
something wrong? Do I need two optimize posts to really optimize my index?!


Thanks in advance


-JM