date:20080617

Re: Dismax + Dynamic fields

2008-06-17 Thread Norberto Meijome

On Mon, 16 Jun 2008 14:22:12 -0400
"Yonik Seeley" <[EMAIL PROTECTED]> wrote:

> There are two levels of dynamic field support.
> 
> Specific dynamic fields can be queried with dismax, but you can't
> wildcard the "qf" or other field parameters.

Thanks Yonik. ok, that matches what I've seen - if i know the actual name of 
the field I'm after, I can use it in a query it, but i can't use the 
dynamic_field_name_* (with wildcard) in the config.

Is adding support for this something that is desirable / needed (doable??) , 
and is it being worked on ?

thanks,
B
_
{Beto|Norberto|Numard} Meijome

"First they ignore you, then they laugh at you, then they fight you, then you 
win."
  Mahatma Gandhi.

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.

easiest roadmap for server deployment

2008-06-17 Thread Bram de Jong

Hello All,


It looks like all my tests with solr have been very conclusive: it's the way
to go.
Sadly enough, me nor our sysadmin have any experience with setting up
tomcat, jetty, orion, .
We have plenty of experience with other servers like lighttpd and apache,
but that doesn't particularly help.

What would be the easiest roadmap to set up Solr in our live environment and
would that easy roadmap (whatever it is) be good enough for us (given the
data below)?

Tech data:
  There are 60K documents (and growing slowly at << 100/day) and about
20K-30K searches per day (growing faster than the #documents, but not that
fast either).
  Solr will have to share a (quad xeon, 12GB of RAM, SAS disks) with
Postgresql.
  In all my tests (replaying stored searches) I had 0 cache misses and
between 0.7 and 0.99 hit rate for all 3 caches.
  I will use plenty of faceting to create various tag clouds in various
places.


 - Bram

-- 
http://www.freesound.org
http://www.smartelectronix.com
http://www.musicdsp.org

Talk on Solr - Oakland, CA June 18, 2008

2008-06-17 Thread Tom Hill - Solr


Hi -

I'll be giving a talk on Solr at the East Bay Innovations Group (eBig) Java
SIG on Wed, June 18.

http://www.ebig.org/index.cfm?fuseaction=Calendar.eventDetail&eventID=16

This is an introductory / overview talk intended to get you from "What is
Solr & Why Would I Use It" to "Cool, now I know enough go home and start
playing with Solr".

Tom

-- 
View this message in context: 
http://www.nabble.com/Talk-on-Solr---Oakland%2C-CA-June-18%2C-2008-tp17880636p17880636.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: multicore vs. multiple instances

2008-06-17 Thread Nico Heid

I think I'm about going the same way too.
The thing is, we have few 100.000 users who are allowed to upload data which
will be indexed.
We're thinking of partitioning/sharding the index by user(groups).

In the beginning, I see no need for one server per shard. So we'll probably
put a certain ammount of shards on one server. But as time goes by we'll
reach a point where we have to migrate shards to new machine as load or
shard size grows.

So I see this as a scalabilty bonus.
I still have to look into the servlet container / JNDI "stuff".
I hope I didn't get a wrong idea, I've been on this topic for a short time
only.

Nico

Otis wrote:

>Short-circuit attempt.  Why put 3 shards on a single server in the first
place?  If you are
>working with large index and need to break it into smaller shards, break it
in shards where
>each shard fully utilizes the server it is on.

Updating index

2008-06-17 Thread Mihails Agafonovs

Hi!

Updating index with post.jar just replaces the index with the defined
xml's. But if there are, for example, two fields in all xml's that
were changed, is there a way to update only these fields (incremental
update)? If there are a lot of large xml's, it would be performance
slowdown each time rewriting the index, and also an unreal job to
change the fields manually.
 Ar cieņu, Mihails

Re: Dismax + Dynamic fields

2008-06-17 Thread Yonik Seeley

On Tue, Jun 17, 2008 at 3:36 AM, Norberto Meijome <[EMAIL PROTECTED]> wrote:
> On Mon, 16 Jun 2008 14:22:12 -0400
> "Yonik Seeley" <[EMAIL PROTECTED]> wrote:
>
>> There are two levels of dynamic field support.
>>
>> Specific dynamic fields can be queried with dismax, but you can't
>> wildcard the "qf" or other field parameters.
>
> Thanks Yonik. ok, that matches what I've seen - if i know the actual name of 
> the field I'm after, I can use it in a query it, but i can't use the 
> dynamic_field_name_* (with wildcard) in the config.
>
> Is adding support for this something that is desirable / needed (doable??) , 
> and is it being worked on ?

It does make sense in certain scenarios, but I don't think anyone is
working on it.

-Yonik

Re: Dismax + Dynamic fields

2008-06-17 Thread Daniel Papasian

Norberto Meijome wrote:
> Thanks Yonik. ok, that matches what I've seen - if i know the actual
> name of the field I'm after, I can use it in a query it, but i can't
> use the dynamic_field_name_* (with wildcard) in the config.
> 
> Is adding support for this something that is desirable / needed
> (doable??) , and is it being worked on ?

You can use a wildcard with copyFrom to copy the dynamic fields that
match the pattern to another field that you can then query on. It seems
like that would cover your needs, no?

Daniel

Re: Updating index

2008-06-17 Thread Otis Gospodnetic

Mihails,

Update is done as delete + re-add.  You may also want to look at SOLR-139 in 
Solr JIRA.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Mihails Agafonovs <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Tuesday, June 17, 2008 6:25:26 AM
> Subject: Updating index
> 
> Hi!
> 
> Updating index with post.jar just replaces the index with the defined
> xml's. But if there are, for example, two fields in all xml's that
> were changed, is there a way to update only these fields (incremental
> update)? If there are a lot of large xml's, it would be performance
> slowdown each time rewriting the index, and also an unreal job to
> change the fields manually.
> Ar cieņu, Mihails

Re: Dismax + Dynamic fields

2008-06-17 Thread Chris Hostetter

: > Is adding support for this something that is desirable / needed
: > (doable??) , and is it being worked on ?
: 
: You can use a wildcard with copyFrom to copy the dynamic fields that
: match the pattern to another field that you can then query on. It seems
: like that would cover your needs, no?

bingo.  even if a wildcard like syntax was allowed in the qf/pf of dismax 
since the same boost would be applied to each field the results would be 
roughly the same as if you used copyField -- and searching a single field 
will be faster then searching N other fields)

(i say "roughly" the same because the tf/idf of the individual fields 
would be different then a single consolidated field, so there would be 
variations in the score ... but the basic results should be the same.  if 
you really care about tuning the score, you'd wnat to assign seperate 
boosts per field name anyway, and then you're right back to not 
needing/wanting to use a glob syntax)





-Hoss

Re: Re[2]: How to limit number of pages per domain

2008-06-17 Thread Otis Gospodnetic

That looks like the correct way to apply the patch.  I tried it and it worked 
for me.

Otis
P.S.
no need to use reply-all, I'm subscribed to solr-user :)
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: JLIST <[EMAIL PROTECTED]>
> To: Otis Gospodnetic <[EMAIL PROTECTED]>
> Cc: solr-user@lucene.apache.org
> Sent: Tuesday, June 17, 2008 1:16:04 AM
> Subject: Re[2]: How to limit number of pages per domain
> 
> Hello Otis,
> 
> https://issues.apache.org/jira/browse/SOLR-236 has links for
> a lot of files. I figure this is what I need:
> 10. solr-236.patch (24 kb)
> 
> So I downloaded the patch file, and also downloaded 2008/06/16
> nightly build, then I ran this, and got an error:
> 
> $ patch -p0 -i solr-236.patch --dry-run
> patching file `src/test/org/apache/solr/search/TestDocSet.java'
> patching file `src/java/org/apache/solr/search/CollapseFilter.java'
> patching file `src/java/org/apache/solr/search/NegatedDocSet.java'
> patching file `src/java/org/apache/solr/search/SolrIndexSearcher.java'
> patching file `src/java/org/apache/solr/common/params/CollapseParams.java'
> patching file 
> `src/java/org/apache/solr/handler/component/CollapseComponent.java'
> patch:  malformed patch at line 680:
> 
> Am I doing it wrong, or missing some other steps?
> 
> Thanks,
> Jack
> 
> > I don't know yet, so I asked directly in that JIRA issue :)
> 
> > Applying patches is done something like this:
> > 
> > Ah, just added it to the Solr FAQ on the Wiki for everyone:
> 
> > http://wiki.apache.org/solr/FAQ#head-bd01dc2c65240a36e7c0ee78eaef88912a0e4030
> 
> > Can you provide feedback about this particular patch once you try
> > it?  I'd like to get it on Solr 1.3, actually, so any feedback would
> > help.
> 
> > Thanks,
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> > - Original Message 
> >> From: Jack 
> >> To: solr-user@lucene.apache.org
> >> Sent: Thursday, May 22, 2008 12:35:28 PM
> >> Subject: Re: How to limit number of pages per domain
> >> 
> >> I think I'll give it a try. I haven't done this before. Are there any
> >> instructions regarding how to apply the patch? I see 9 files, some
> >> displayed in gray links, some in blue links; some named as .diff, some
> >> .patch; one has 1.3 in file name, one has 1.3, I suppose the other
> >> files are for both versions. Should I apply all of them?
> >> https://issues.apache.org/jira/browse/SOLR-236
> >> 
> >> > Actually, the best documentation are really the comments in the JIRA 
> >> > issue
> >> itself.
> >> > Is there anyone actually using Solr with this patch?

RE: Search query optimization

2008-06-17 Thread Yongjun Rong

Hi,
  Thanks for your reply. I did some test on my test machine. 
http://stage.boomi.com:8080/solr/select/?q=account:1&rows=1000. It will
return resultset 384 in 3ms. If I add a new AND condition as below:
http://stage.boomi.com:8080/solr/select/?q=account:1+AND+recordeddate_dt
:[NOW/DAYS-7DAYS+TO+NOW]&rows=1000. It will take 18236 to return 21
resultset. If I only use the recordedate_dt condition like
http://stage.boomi.com:8080/solr/select/?q=recordeddate_dt:[NOW/DAYS-7DA
YS+TO+NOW]&rows=1000. It takes 20271 ms to get 412800 results. All the
above URL are live, you test it.

Can anyone give me some explaination why this happens if we have the
query optimization? Thank you very much.
Yongjun Rong

-Original Message-
From: Walter Underwood [mailto:[EMAIL PROTECTED] 
Sent: Thursday, May 29, 2008 4:57 PM
To: solr-user@lucene.apache.org
Subject: Re: Search query optimization

The people working on Lucene are pretty smart, and this sort of query
optimization is a well-known trick, so I would not worry about it.

A dozen years ago at Infoseek, we checked the count of matches for each
term in an AND, and evaluated the smallest one first.
If any of them had zero matches, we didn't evaluate any of them.

I expect that Doug Cutting and the other Lucene folk know those same
tricks.

wunder

On 5/29/08 1:50 PM, "Yongjun Rong" <[EMAIL PROTECTED]> wrote:

> Hi Yonik,
>   Thanks for your quick reply. I'm very new to the lucene source code.
> Can you give me a little more detail explaination about this.
> Do you think it will save some memory if docnum = find_match("A") > 
> docnum = find_match("B") and put B in the front of the AND query like 
> "B AND A AND C"? How about sorting (sort=A,B,C&q=A AND B AND C)? Do 
> you think the order of conditions (A,B,C) in a query will affect the 
> performance of the query?
>   Thank you very much.
>   Yongjun
> 
> 
> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik 
> Seeley
> Sent: Thursday, May 29, 2008 4:12 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Search query optimization
> 
> On Thu, May 29, 2008 at 4:05 PM, Yongjun Rong <[EMAIL PROTECTED]>
> wrote:
>>  I have a question about how the lucene query parser. For example, I 
>> have query "A AND B AND C". Will lucene extract all documents satisfy

>> condition A in memory and then filter it with condition B and C?
> 
> No, Lucene will try and optimize this the best it can.
> 
> It roughly goes like this..
> docnum = find_match("A")
> docnum = find_first_match_after(docnum, "B") docnum =
> find_first_match_after(docnum,"C")
> etc...
> until the same docnum is returned for "A","B", and "C".
> 
> See ConjunctionScorer for the gritty details.
> 
> -Yonik
> 
> 
> 
>> or only
>> the documents satisfying "A AND B AND C" will be put into memory? Is 
>> there any articles discuss about how to build a optimization query to

>> save memory and improve performance?
>>  Thank you very much.
>>  Yongjun Rong
>>

Re[4]: How to limit number of pages per domain

2008-06-17 Thread JLIST

Hmm... I tried it with a Windows native port of patch, cygwin patch
and also on Linux and got the same error.

Is this, by any chance, going to be in solr 1.3 soon?

Thanks,
Jack

> That looks like the correct way to apply the patch.  I tried it and it worked 
> for me.

> Otis

> - Original Message 
>> https://issues.apache.org/jira/browse/SOLR-236 has links for
>> a lot of files. I figure this is what I need:
>> 10. solr-236.patch (24 kb)
>> 
>> So I downloaded the patch file, and also downloaded 2008/06/16
>> nightly build, then I ran this, and got an error:
>> 
>> $ patch -p0 -i solr-236.patch --dry-run
>> patching file `src/test/org/apache/solr/search/TestDocSet.java'
>> patching file `src/java/org/apache/solr/search/CollapseFilter.java'
>> patching file `src/java/org/apache/solr/search/NegatedDocSet.java'
>> patching file
>> `src/java/org/apache/solr/search/SolrIndexSearcher.java'
>> patching file
>> `src/java/org/apache/solr/common/params/CollapseParams.java'
>> patching file 
>> `src/java/org/apache/solr/handler/component/CollapseComponent.java'
>> patch:  malformed patch at line 680:
>> 
>> Am I doing it wrong, or missing some other steps?

Re: Search query optimization

2008-06-17 Thread Otis Gospodnetic

Hi,

Probably because the [NOW/DAYS-7DAYS+TO+NOW] part gets rewritten as lots of OR 
clauses.  I think that you'll see that if you add &debugQuery=true to the URL.  
Make sure your recorded_date_dt is not too granular (e.g. if you don't need 
minutes, round the values to hours. If you don't need hours, round the values 
to days).


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Yongjun Rong <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Tuesday, June 17, 2008 11:56:06 AM
> Subject: RE: Search query optimization
> 
> Hi,
>   Thanks for your reply. I did some test on my test machine. 
> http://stage.boomi.com:8080/solr/select/?q=account:1&rows=1000. It will
> return resultset 384 in 3ms. If I add a new AND condition as below:
> http://stage.boomi.com:8080/solr/select/?q=account:1+AND+recordeddate_dt
> :[NOW/DAYS-7DAYS+TO+NOW]&rows=1000. It will take 18236 to return 21
> resultset. If I only use the recordedate_dt condition like
> http://stage.boomi.com:8080/solr/select/?q=recordeddate_dt:[NOW/DAYS-7DA
> YS+TO+NOW]&rows=1000. It takes 20271 ms to get 412800 results. All the
> above URL are live, you test it.
> 
> Can anyone give me some explaination why this happens if we have the
> query optimization? Thank you very much.
> Yongjun Rong
> 
> 
> -Original Message-
> From: Walter Underwood [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, May 29, 2008 4:57 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Search query optimization
> 
> The people working on Lucene are pretty smart, and this sort of query
> optimization is a well-known trick, so I would not worry about it.
> 
> A dozen years ago at Infoseek, we checked the count of matches for each
> term in an AND, and evaluated the smallest one first.
> If any of them had zero matches, we didn't evaluate any of them.
> 
> I expect that Doug Cutting and the other Lucene folk know those same
> tricks.
> 
> wunder
> 
> On 5/29/08 1:50 PM, "Yongjun Rong" wrote:
> 
> > Hi Yonik,
> >   Thanks for your quick reply. I'm very new to the lucene source code.
> > Can you give me a little more detail explaination about this.
> > Do you think it will save some memory if docnum = find_match("A") > 
> > docnum = find_match("B") and put B in the front of the AND query like 
> > "B AND A AND C"? How about sorting (sort=A,B,C&q=A AND B AND C)? Do 
> > you think the order of conditions (A,B,C) in a query will affect the 
> > performance of the query?
> >   Thank you very much.
> >   Yongjun
> >
> > 
> > -Original Message-
> > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik 
> > Seeley
> > Sent: Thursday, May 29, 2008 4:12 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Search query optimization
> > 
> > On Thu, May 29, 2008 at 4:05 PM, Yongjun Rong 
> > wrote:
> >>  I have a question about how the lucene query parser. For example, I 
> >> have query "A AND B AND C". Will lucene extract all documents satisfy
> 
> >> condition A in memory and then filter it with condition B and C?
> > 
> > No, Lucene will try and optimize this the best it can.
> > 
> > It roughly goes like this..
> > docnum = find_match("A")
> > docnum = find_first_match_after(docnum, "B") docnum =
> > find_first_match_after(docnum,"C")
> > etc...
> > until the same docnum is returned for "A","B", and "C".
> > 
> > See ConjunctionScorer for the gritty details.
> > 
> > -Yonik
> > 
> > 
> > 
> >> or only
> >> the documents satisfying "A AND B AND C" will be put into memory? Is 
> >> there any articles discuss about how to build a optimization query to
> 
> >> save memory and improve performance?
> >>  Thank you very much.
> >>  Yongjun Rong
> >>

RE: Search query optimization

2008-06-17 Thread Yongjun Rong

Thanks for reply. Here is the debugQuery output:

−

account:1 AND recordeddate_dt:[NOW/DAYS-1DAYS TO NOW]

−

account:1 AND recordeddate_dt:[NOW/DAYS-1DAYS TO NOW]

−

+account:1 +recordeddate_dt:[2008-06-16T00:00:00.000Z TO 
2008-06-17T17:07:57.420Z]

−

+account:1 +recordeddate_dt:[2008-06-16T00:00:00.000 TO 2008-06-17T17:07:57.420]

−

−


10.88071 = (MATCH) sum of:
  10.788804 = (MATCH) weight(account:1 in 6515410), product of:
0.9957678 = queryWeight(account:1), product of:
  10.834659 = idf(docFreq=348, numDocs=6515640)
  0.09190578 = queryNorm
10.834659 = (MATCH) fieldWeight(account:1 in 6515410), product of:
  1.0 = tf(termFreq(account:1)=1)
  10.834659 = idf(docFreq=348, numDocs=6515640)
  1.0 = fieldNorm(field=account, doc=6515410)
  0.09190578 = (MATCH) 
ConstantScoreQuery(recordeddate_dt:[2008-06-16T00:00:00.000-2008-06-17T17:07:57.420]),
 product of:
1.0 = boost
0.09190578 = queryNorm


 

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, June 17, 2008 12:43 PM
To: solr-user@lucene.apache.org
Subject: Re: Search query optimization

Hi,

Probably because the [NOW/DAYS-7DAYS+TO+NOW] part gets rewritten as lots of OR 
clauses.  I think that you'll see that if you add &debugQuery=true to the URL.  
Make sure your recorded_date_dt is not too granular (e.g. if you don't need 
minutes, round the values to hours. If you don't need hours, round the values 
to days).


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Yongjun Rong <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Tuesday, June 17, 2008 11:56:06 AM
> Subject: RE: Search query optimization
> 
> Hi,
>   Thanks for your reply. I did some test on my test machine. 
> http://stage.boomi.com:8080/solr/select/?q=account:1&rows=1000. It 
> will return resultset 384 in 3ms. If I add a new AND condition as below:
> http://stage.boomi.com:8080/solr/select/?q=account:1+AND+recordeddate_
> dt :[NOW/DAYS-7DAYS+TO+NOW]&rows=1000. It will take 18236 to return 21 
> resultset. If I only use the recordedate_dt condition like 
> http://stage.boomi.com:8080/solr/select/?q=recordeddate_dt:[NOW/DAYS-7
> DA
> YS+TO+NOW]&rows=1000. It takes 20271 ms to get 412800 results. All the
> above URL are live, you test it.
> 
> Can anyone give me some explaination why this happens if we have the 
> query optimization? Thank you very much.
> Yongjun Rong
> 
> 
> -Original Message-
> From: Walter Underwood [mailto:[EMAIL PROTECTED]
> Sent: Thursday, May 29, 2008 4:57 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Search query optimization
> 
> The people working on Lucene are pretty smart, and this sort of query 
> optimization is a well-known trick, so I would not worry about it.
> 
> A dozen years ago at Infoseek, we checked the count of matches for 
> each term in an AND, and evaluated the smallest one first.
> If any of them had zero matches, we didn't evaluate any of them.
> 
> I expect that Doug Cutting and the other Lucene folk know those same 
> tricks.
> 
> wunder
> 
> On 5/29/08 1:50 PM, "Yongjun Rong" wrote:
> 
> > Hi Yonik,
> >   Thanks for your quick reply. I'm very new to the lucene source code.
> > Can you give me a little more detail explaination about this.
> > Do you think it will save some memory if docnum = find_match("A") > 
> > docnum = find_match("B") and put B in the front of the AND query 
> > like "B AND A AND C"? How about sorting (sort=A,B,C&q=A AND B AND 
> > C)? Do you think the order of conditions (A,B,C) in a query will 
> > affect the performance of the query?
> >   Thank you very much.
> >   Yongjun
> >
> > 
> > -Original Message-
> > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of 
> > Yonik Seeley
> > Sent: Thursday, May 29, 2008 4:12 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Search query optimization
> > 
> > On Thu, May 29, 2008 at 4:05 PM, Yongjun Rong
> > wrote:
> >>  I have a question about how the lucene query parser. For example, 
> >> I have query "A AND B AND C". Will lucene extract all documents 
> >> satisfy
> 
> >> condition A in memory and then filter it with condition B and C?
> > 
> > No, Lucene will try and optimize this the best it can.
> > 
> > It roughly goes like this..
> > docnum = find_match("A")
> > docnum = find_first_match_after(docnum, "B") docnum =
> > find_first_match_after(docnum,"C")
> > etc...
> > until the same docnum is returned for "A","B", and "C".
> > 
> > See ConjunctionScorer for the gritty details.
> > 
> > -Yonik
> > 
> > 
> > 
> >> or only
> >> the documents satisfying "A AND B AND C" will be put into memory? 
> >> Is there any articles discuss about how to build a optimization 
> >> query to
> 
> >> save memory and improve performance?
> >>  Thank you very much.
> >>  Yongjun Rong
> >>

Re: Faceting on date fields

2008-06-17 Thread Chris Hostetter


: If I do a search that returns 1 result with a created date of
: "1993-01-01T00:00:00.000Z", I get this:
: 
: 
:...
:   1 
:   1

This is because the "range queries" used in date faceting are "inclusive" 
of both bounding dates ... this was the simplest solution that we came up 
with at the time.  There was disucssion at one time about adding 
additional options to control which bounds were inclusive and which were 
exclusive but I don't think anyone ever proposed anything concrete (or 
opend a Jira issue)

: Am I better off storing the year separately in an integer field and
: faceting on that?

you mean using facet.field?  that will certianly work.  

the other worarround (which is admitedly really hackish) is to 
add/subtract a few milliseconds, either to the dates of docs when you 
index them, or to your facet.date.start param (in your case: either add 
amilli to the docs, or subtract a milli from the param)





-Hoss

Re: Search query optimization

2008-06-17 Thread Otis Gospodnetic

Hi,

This is what I was talking about:

recordeddate_dt:[2008-06-16T00:00:00.000Z TO 2008-06-17T17:07:57.420Z]

Note that the granularity of this date field is down to milliseconds.  You 
should change that to be more coarse if you don't need such precision (e.g. no 
milliseconds, no seconds, no minutes, no hours...)


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Yongjun Rong <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Tuesday, June 17, 2008 1:09:19 PM
> Subject: RE: Search query optimization
> 
> Thanks for reply. Here is the debugQuery output:
> 
> −
> 
> account:1 AND recordeddate_dt:[NOW/DAYS-1DAYS TO NOW]
> 
> −
> 
> account:1 AND recordeddate_dt:[NOW/DAYS-1DAYS TO NOW]
> 
> −
> 
> +account:1 +recordeddate_dt:[2008-06-16T00:00:00.000Z TO 
> 2008-06-17T17:07:57.420Z]
> 
> −
> 
> +account:1 +recordeddate_dt:[2008-06-16T00:00:00.000 TO 
> 2008-06-17T17:07:57.420]
> 
> −
> 
> −
> 
> name="id=e03dbd92-3d41-4693-8b69-ac9a0d332446-atom-d52484f5-7aa8-40b3-ad6f-ba3a9071999e,internal_docid=6515410">
> 
> 10.88071 = (MATCH) sum of:
>   10.788804 = (MATCH) weight(account:1 in 6515410), product of:
> 0.9957678 = queryWeight(account:1), product of:
>   10.834659 = idf(docFreq=348, numDocs=6515640)
>   0.09190578 = queryNorm
> 10.834659 = (MATCH) fieldWeight(account:1 in 6515410), product of:
>   1.0 = tf(termFreq(account:1)=1)
>   10.834659 = idf(docFreq=348, numDocs=6515640)
>   1.0 = fieldNorm(field=account, doc=6515410)
>   0.09190578 = (MATCH) 
> ConstantScoreQuery(recordeddate_dt:[2008-06-16T00:00:00.000-2008-06-17T17:07:57.420]),
>  
> product of:
> 1.0 = boost
> 0.09190578 = queryNorm
> 
> 
>  
> 
> -Original Message-
> From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, June 17, 2008 12:43 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Search query optimization
> 
> Hi,
> 
> Probably because the [NOW/DAYS-7DAYS+TO+NOW] part gets rewritten as lots of 
> OR 
> clauses.  I think that you'll see that if you add &debugQuery=true to the 
> URL.  
> Make sure your recorded_date_dt is not too granular (e.g. if you don't need 
> minutes, round the values to hours. If you don't need hours, round the values 
> to 
> days).
> 
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> - Original Message 
> > From: Yongjun Rong 
> > To: solr-user@lucene.apache.org
> > Sent: Tuesday, June 17, 2008 11:56:06 AM
> > Subject: RE: Search query optimization
> > 
> > Hi,
> >   Thanks for your reply. I did some test on my test machine. 
> > http://stage.boomi.com:8080/solr/select/?q=account:1&rows=1000. It 
> > will return resultset 384 in 3ms. If I add a new AND condition as below:
> > http://stage.boomi.com:8080/solr/select/?q=account:1+AND+recordeddate_ 
> > dt :[NOW/DAYS-7DAYS+TO+NOW]&rows=1000. It will take 18236 to return 21 
> > resultset. If I only use the recordedate_dt condition like 
> > http://stage.boomi.com:8080/solr/select/?q=recordeddate_dt:[NOW/DAYS-7
> > DA
> > YS+TO+NOW]&rows=1000. It takes 20271 ms to get 412800 results. All the
> > above URL are live, you test it.
> > 
> > Can anyone give me some explaination why this happens if we have the 
> > query optimization? Thank you very much.
> > Yongjun Rong
> > 
> > 
> > -Original Message-
> > From: Walter Underwood [mailto:[EMAIL PROTECTED]
> > Sent: Thursday, May 29, 2008 4:57 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Search query optimization
> > 
> > The people working on Lucene are pretty smart, and this sort of query 
> > optimization is a well-known trick, so I would not worry about it.
> > 
> > A dozen years ago at Infoseek, we checked the count of matches for 
> > each term in an AND, and evaluated the smallest one first.
> > If any of them had zero matches, we didn't evaluate any of them.
> > 
> > I expect that Doug Cutting and the other Lucene folk know those same 
> > tricks.
> > 
> > wunder
> > 
> > On 5/29/08 1:50 PM, "Yongjun Rong" wrote:
> > 
> > > Hi Yonik,
> > >   Thanks for your quick reply. I'm very new to the lucene source code.
> > > Can you give me a little more detail explaination about this.
> > > Do you think it will save some memory if docnum = find_match("A") > 
> > > docnum = find_match("B") and put B in the front of the AND query 
> > > like "B AND A AND C"? How about sorting (sort=A,B,C&q=A AND B AND 
> > > C)? Do you think the order of conditions (A,B,C) in a query will 
> > > affect the performance of the query?
> > >   Thank you very much.
> > >   Yongjun
> > >
> > > 
> > > -Original Message-
> > > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of 
> > > Yonik Seeley
> > > Sent: Thursday, May 29, 2008 4:12 PM
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Search query optimization
> > > 
> > > On Thu, May 29, 2008 at 4:05 PM, Yongjun Rong
> > > wrote:
> > >>  I have a question abo

Re: Re[2]: "null" in admin page

2008-06-17 Thread Chris Hostetter

: Steps as follow:
: 1. I download the solr example app.
: 2. Unpack it.
: 3. cd 
: 4. java -jar start.jar
: 5. Try do use one of the links in admin webapp
: 6. Get core=null

I don't understand what you mean by "Get core=null" ... i understand that 
on the stats.jsp page the word "null" appears in between sections as 
initially discussed in this thread, but that's just a single piece of 
(admitedly confusing) information on one page.  in your previous email 
you said...

: > : It surely comes on the example, as I got this problem all times I get the
: > : example, and I have to remove the file multicore.xml or I get the error.

what error are you refering to?  All of the functionality of Solr should 
be working fine. do you actaully see an error message anywhere?

Removing example/multicore/multicore.xml (or even the entire multicore 
directory) doesn't change the behavior of stats.jsp (Solr isn't even 
looking at that directory unless you explicitly use it as your solr.home.



-Hoss

Re: Search query optimization

2008-06-17 Thread Chris Hostetter


: Probably because the [NOW/DAYS-7DAYS+TO+NOW] part gets rewritten as lots 
: of OR clauses.  I think that you'll see that if you add &debugQuery=true 
: to the URL.  Make sure your recorded_date_dt is not too granular (e.g. 
: if you don't need minutes, round the values to hours. If you don't need 
: hours, round the values to days).

for the record: it doesn't get rewritten to a lot of OR clauses, it's 
using ConstantScoreRangeQuery.

granularity is definitely important however, bth when indexing and when 
querying.  

"NOW" is milliseconds, so every time you execute that query it's different 
and there is almost no caching possible.

if you use [NOW/DAY-7DAYS TO NOW/DAY] or even 
[NOW/DAY-7DAYS TO NOW/HOUR] you'll get a lot better caching behavior.  it 
looks like you are trying to find anything in the past week, so you may 
want [NOW/DAY-7DAYS TO NOW/DAY+1DAY] (to go to the end of the current day)

once you have a less granular date restriction, it can frequently make 
sense to put this in a seperate fq clause, so it will get cached 
independently of your main query. 

But Otis's point about reducing granularity can also help when indexing 
... the fewer "unique" dates that apepar in your index, the faster range 
queries will be ... if you've got 1000 documents that all of a 
recordeddate of June 11 2008, but at different times, and you're never 
going to care aboutthe times (just the date) then strip those times off 
when indexing so they all have the same fieled value of 
2008-06-11T00:00:00Z

BTW: the solr port you sent out a URL to ... all of it's caching is 
turned off (the filterCache and queryResultCache configs are commented out 
of your solrconfig.xml) ... you're going to wnat to turn on some caching 
or you'll never see really *great* request times.


-Hoss

RE: Search query optimization

2008-06-17 Thread Yongjun Rong

Hi Otis,
  Thanks for your advice. Do you mean when we add the date data we need 
carefully select the granularity of the date field to make sure it is more 
coarse? How can we do this? We just access solr via http URL not API. If you 
talk about the query syntax, we do have NOW/DAY as round to DAY.
  Yongjun Rong
   

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, June 17, 2008 1:32 PM
To: solr-user@lucene.apache.org
Subject: Re: Search query optimization

Hi,

This is what I was talking about:

recordeddate_dt:[2008-06-16T00:00:00.000Z TO 2008-06-17T17:07:57.420Z]

Note that the granularity of this date field is down to milliseconds.  You 
should change that to be more coarse if you don't need such precision (e.g. no 
milliseconds, no seconds, no minutes, no hours...)


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Yongjun Rong <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Tuesday, June 17, 2008 1:09:19 PM
> Subject: RE: Search query optimization
> 
> Thanks for reply. Here is the debugQuery output:
> 
> −
> 
> account:1 AND recordeddate_dt:[NOW/DAYS-1DAYS TO NOW]
> 
> −
> 
> account:1 AND recordeddate_dt:[NOW/DAYS-1DAYS TO NOW]
> 
> −
> 
> +account:1 +recordeddate_dt:[2008-06-16T00:00:00.000Z TO
> 2008-06-17T17:07:57.420Z]
> 
> −
> 
> +account:1 +recordeddate_dt:[2008-06-16T00:00:00.000 TO 
> +2008-06-17T17:07:57.420]
> 
> −
> 
> −
> 
> name="id=e03dbd92-3d41-4693-8b69-ac9a0d332446-atom-d52484f5-7aa8-40b3-
> ad6f-ba3a9071999e,internal_docid=6515410">
> 
> 10.88071 = (MATCH) sum of:
>   10.788804 = (MATCH) weight(account:1 in 6515410), product of:
> 0.9957678 = queryWeight(account:1), product of:
>   10.834659 = idf(docFreq=348, numDocs=6515640)
>   0.09190578 = queryNorm
> 10.834659 = (MATCH) fieldWeight(account:1 in 6515410), product of:
>   1.0 = tf(termFreq(account:1)=1)
>   10.834659 = idf(docFreq=348, numDocs=6515640)
>   1.0 = fieldNorm(field=account, doc=6515410)
>   0.09190578 = (MATCH)
> ConstantScoreQuery(recordeddate_dt:[2008-06-16T00:00:00.000-2008-06-17
> T17:07:57.420]),
> product of:
> 1.0 = boost
> 0.09190578 = queryNorm
> 
> 
>  
> 
> -Original Message-
> From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, June 17, 2008 12:43 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Search query optimization
> 
> Hi,
> 
> Probably because the [NOW/DAYS-7DAYS+TO+NOW] part gets rewritten as 
> lots of OR clauses.  I think that you'll see that if you add &debugQuery=true 
> to the URL.
> Make sure your recorded_date_dt is not too granular (e.g. if you don't 
> need minutes, round the values to hours. If you don't need hours, 
> round the values to days).
> 
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> - Original Message 
> > From: Yongjun Rong
> > To: solr-user@lucene.apache.org
> > Sent: Tuesday, June 17, 2008 11:56:06 AM
> > Subject: RE: Search query optimization
> > 
> > Hi,
> >   Thanks for your reply. I did some test on my test machine. 
> > http://stage.boomi.com:8080/solr/select/?q=account:1&rows=1000. It 
> > will return resultset 384 in 3ms. If I add a new AND condition as below:
> > http://stage.boomi.com:8080/solr/select/?q=account:1+AND+recordeddat
> > e_ dt :[NOW/DAYS-7DAYS+TO+NOW]&rows=1000. It will take 18236 to 
> > return 21 resultset. If I only use the recordedate_dt condition like
> > http://stage.boomi.com:8080/solr/select/?q=recordeddate_dt:[NOW/DAYS
> > -7
> > DA
> > YS+TO+NOW]&rows=1000. It takes 20271 ms to get 412800 results. All 
> > YS+TO+the
> > above URL are live, you test it.
> > 
> > Can anyone give me some explaination why this happens if we have the 
> > query optimization? Thank you very much.
> > Yongjun Rong
> > 
> > 
> > -Original Message-
> > From: Walter Underwood [mailto:[EMAIL PROTECTED]
> > Sent: Thursday, May 29, 2008 4:57 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Search query optimization
> > 
> > The people working on Lucene are pretty smart, and this sort of 
> > query optimization is a well-known trick, so I would not worry about it.
> > 
> > A dozen years ago at Infoseek, we checked the count of matches for 
> > each term in an AND, and evaluated the smallest one first.
> > If any of them had zero matches, we didn't evaluate any of them.
> > 
> > I expect that Doug Cutting and the other Lucene folk know those same 
> > tricks.
> > 
> > wunder
> > 
> > On 5/29/08 1:50 PM, "Yongjun Rong" wrote:
> > 
> > > Hi Yonik,
> > >   Thanks for your quick reply. I'm very new to the lucene source code.
> > > Can you give me a little more detail explaination about this.
> > > Do you think it will save some memory if docnum = find_match("A") 
> > > > docnum = find_match("B") and put B in the front of the AND query 
> > > like "B AND A AND C"? How about sorting (sort=A,B,C&q=A AND B AND 
> > > C)? Do

Re: get the fields of solr

2008-06-17 Thread Chris Hostetter


: I'm able to get the fields specified in my schema with this query:
: /solr/admin/luke?show=schema&numTerms=0

that's what the "show=schema" does .. if you want all the field names 
regardless of wether they were explicitly or dynamicly created you just 
leave that option off...

http://localhost:8983/solr/admin/luke?numTerms=0

...the dynamic ifleds are the ones with a "dynamicBase" child tag.




-Hoss

RE: Search query optimization

2008-06-17 Thread Yongjun Rong

Hi Chris,
   Thanks for your suggestions. I did try the [NOW/DAY-7DAYS TO
NOW/DAY], but it is not better. And I tried [NOW/DAY-7DAYS TO
NOW/DAY+1DAY], I got some exception as below:
org.apache.solr.core.SolrException: Query parsing error: Cannot parse
'account:1 AND recordeddate_dt:[NOW/DAYS-7DAYS TO NOW/DAY 1DAY]':
Encountered "1DAY" at line 1, column 57.
Was expecting:
"]" ...

at
org.apache.solr.search.QueryParsing.parseQuery(QueryParsing.java:104)
at
org.apache.solr.request.StandardRequestHandler.handleRequestBody(Standar
dRequestHandler.java:109)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB
ase.java:77)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:658)
at
org.apache.solr.servlet.SolrServlet.doGet(SolrServlet.java:66)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:487)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHan
dler.java:1093)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.j
ava:185)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHan
dler.java:1084)
at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:360)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:2
16)
at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:726)
at
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandler
Collection.java:206)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.jav
a:114)
at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:324)
at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:505)
at
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConne
ction.java:828)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:514)
at
org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:211)
at
org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:380)
at
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:
395)
at
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.ja
va:450)
Caused by: org.apache.lucene.queryParser.ParseException: Cannot parse
'account:1 AND recordeddate_dt:[NOW/DAYS-7DAYS TO NOW/DAY 1DAY]':
Encountered "1DAY" at line 1, column 57.
Was expecting:
"]" ...

at
org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:152)
at
org.apache.solr.search.QueryParsing.parseQuery(QueryParsing.java:94)
... 26 more

And I will try to open the cache and see if I can get better query time.
I will let you know.
Thank you very much.
Yongjun Rong

-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, June 17, 2008 1:55 PM
To: solr-user@lucene.apache.org
Subject: Re: Search query optimization


: Probably because the [NOW/DAYS-7DAYS+TO+NOW] part gets rewritten as
lots
: of OR clauses.  I think that you'll see that if you add
&debugQuery=true
: to the URL.  Make sure your recorded_date_dt is not too granular (e.g.

: if you don't need minutes, round the values to hours. If you don't
need
: hours, round the values to days).

for the record: it doesn't get rewritten to a lot of OR clauses, it's
using ConstantScoreRangeQuery.

granularity is definitely important however, bth when indexing and when
querying.  

"NOW" is milliseconds, so every time you execute that query it's
different and there is almost no caching possible.

if you use [NOW/DAY-7DAYS TO NOW/DAY] or even [NOW/DAY-7DAYS TO
NOW/HOUR] you'll get a lot better caching behavior.  it looks like you
are trying to find anything in the past week, so you may want
[NOW/DAY-7DAYS TO NOW/DAY+1DAY] (to go to the end of the current day)

once you have a less granular date restriction, it can frequently make
sense to put this in a seperate fq clause, so it will get cached
independently of your main query. 

But Otis's point about reducing granularity can also help when indexing
... the fewer "unique" dates that apepar in your index, the faster range
queries will be ... if you've got 1000 documents that all of a
recordeddate of June 11 2008, but at different times, and you're never
going to care aboutthe times (just the date) then strip those times off
when indexing so they all have the same fieled value of
2008-06-11T00:00:00Z

BTW: the solr port you sent out a URL to ... all of it's caching is
turned off (the filterCache a

RE: Search query optimization

2008-06-17 Thread Chris Hostetter

:Thanks for your suggestions. I did try the [NOW/DAY-7DAYS TO
: NOW/DAY], but it is not better. And I tried [NOW/DAY-7DAYS TO
: NOW/DAY+1DAY], I got some exception as below:
: org.apache.solr.core.SolrException: Query parsing error: Cannot parse
: 'account:1 AND recordeddate_dt:[NOW/DAYS-7DAYS TO NOW/DAY 1DAY]':
: Encountered "1DAY" at line 1, column 57.

you need to propertly URL escape the "+" character as %2B in your URLs.

: And I will try to open the cache and see if I can get better query time.

the first request won't be any faster.  but the second request will be.  
and if filtering by week is something you expect peopel to do a lot of, 
you can put it in a newSearcher so it's always warmed up and fast 
for everyone.


-Hoss

Re: easiest roadmap for server deployment

2008-06-17 Thread Mike Klaas



On 17-Jun-08, at 12:55 AM, Bram de Jong wrote:

It looks like all my tests with solr have been very conclusive: it's  
the way

to go.


Glad to hear it!


Sadly enough, me nor our sysadmin have any experience with setting up
tomcat, jetty, orion, .
We have plenty of experience with other servers like lighttpd and  
apache,

but that doesn't particularly help.

What would be the easiest roadmap to set up Solr in our live  
environment and
would that easy roadmap (whatever it is) be good enough for us  
(given the

data below)?


I'd suggest using Jetty that comes with the distribution.  Treat it as  
you would a unix process, with 'java -jar start.jar' the launch  
command and sending SIGTERM to kill it.  You'll want to give it more  
ram with the -Xmx and -Xms parameters, as well as specify -server.


Other than that, you'll need a way keep an eye on the log file.   
That's about it.  One quite useful debugging tool is to send SIGQUIT  
to Solr, which prints out a stack trace for every alive thread at the  
current timeslice.



Tech data:
 There are 60K documents (and growing slowly at << 100/day) and about
20K-30K searches per day (growing faster than the #documents, but  
not that

fast either).
 Solr will have to share a (quad xeon, 12GB of RAM, SAS disks) with
Postgresql.
 In all my tests (replaying stored searches) I had 0 cache misses and
between 0.7 and 0.99 hit rate for all 3 caches.
 I will use plenty of faceting to create various tag clouds in various
places.


That should be fine.  You might want to schedule an occasional  
optimize (you can do that with cron+shell script)


-Mike

RE: Search query optimization

2008-06-17 Thread Yongjun Rong

Hi Chris,
  Thank you very much for the detail suggestions. I just did the cache
test. If most of requests return the same set of data, cache will
improve the query performance. But in our usage, almost all requests
have different data set to return. The cache hit ratio is very low.
That's the reason we close the cache for memory saving.  Another
question is: 
q=account:1+AND+recordeddate_dt:[NOW/DAY-7DAYS+TO+NOW/DAY] will combine
the resultset of account:1 and
recordeddate_dt:[NOW/DAY-7DAYS+TO+NOW/DAY]. How lucene handle it? From
my previous test examples, it seems lucene will not check the size of
the subconditions (like account:1 or
recordeddate_dt:[NOW/DAY-7DAYS+TO+NOW/DAY]). Q=account:1 will return a
small set of data. But q=recordeddate_dt:[NOW/DAY-7DAYS+TO+NOW/DAY] will
return a large set of data. If we combine them with "AND" like:
q=account+AND+recordeddate_dt:[NOW/DAY-7DAYS+TO+NOW/DAY]. It should
return the small set of data and then apply the subcondition
"recordeddate_dt:[NOW/DAY-7DAYS+TO+NOW/DAY]". But from the response
time, it seems not the case.
Can anyone give me some detail explaination about this?
Thank you very much.
Yongjun Rong

-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, June 17, 2008 2:32 PM
To: solr-user@lucene.apache.org
Subject: RE: Search query optimization

:Thanks for your suggestions. I did try the [NOW/DAY-7DAYS TO
: NOW/DAY], but it is not better. And I tried [NOW/DAY-7DAYS TO
: NOW/DAY+1DAY], I got some exception as below:
: org.apache.solr.core.SolrException: Query parsing error: Cannot parse
: 'account:1 AND recordeddate_dt:[NOW/DAYS-7DAYS TO NOW/DAY 1DAY]':
: Encountered "1DAY" at line 1, column 57.

you need to propertly URL escape the "+" character as %2B in your URLs.

: And I will try to open the cache and see if I can get better query
time.

the first request won't be any faster.  but the second request will be.

and if filtering by week is something you expect peopel to do a lot of,
you can put it in a newSearcher so it's always warmed up and fast for
everyone.


-Hoss

Re: easiest roadmap for server deployment

2008-06-17 Thread Chris Hostetter


: Sadly enough, me nor our sysadmin have any experience with setting up
: tomcat, jetty, orion, .
: We have plenty of experience with other servers like lighttpd and apache,
: but that doesn't particularly help.

if you already use a package management system for your servers (RPM, 
deb, ports, etc...) you might consider using the pre-packaged versions of 
Tomcat or Jetty, which should take care of things like init.d scripts and 
standard log file placement.


-Hoss

RE: scaling / sharding questions

2008-06-17 Thread Norskog, Lance

I cannot facet on one huge index; it runs out of ram when it attempts to
allocate a giant array. If I store several shards in one JVM, there is
no problem.

Are there any performance benefits to a large index v.s. several small
indexes?

Lance 

-Original Message-
From: Marcus Herou [mailto:[EMAIL PROTECTED] 
Sent: Sunday, June 15, 2008 10:24 PM
To: solr-user@lucene.apache.org
Subject: Re: scaling / sharding questions

Yep got that.

Thanks.

/M

On Sun, Jun 15, 2008 at 8:42 PM, Otis Gospodnetic <
[EMAIL PROTECTED]> wrote:

> With Lance's MD5 schema you'd do this:
>
> 1 shard: 0-f*
> 2 shards: 0-8*, 9-f*
> 3 shards: 0-5*, 6-a*, b-f*
> 4 shards: 0-3*, 4-7*, 8-b*, c-f*
> ...
> 16 shards: 0*, 1*, 2*... d*, e*, f*
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
> - Original Message 
> > From: Marcus Herou <[EMAIL PROTECTED]>
> > To: solr-user@lucene.apache.org
> > Cc: [EMAIL PROTECTED]
> > Sent: Saturday, June 14, 2008 5:53:35 AM
> > Subject: Re: scaling / sharding questions
> >
> > Hi.
> >
> > We as well use md5 as the uid.
> >
> > I guess by saying each 1/16th is because the md5 is hex, right?
(0-f).
> > Thinking about md5 sharding.
> > 1 shard: 0-f
> > 2 shards: 0-7:8-f
> > 3 shards: problem!
> > 4 shards: 0-3
> >
> > This technique would require that you double the amount of shards 
> > each
> time
> > you split right ?
> >
> > Split by delete sounds really smart, damn that I did'nt think of 
> > that :)
> >
> > Anyway over time the technique of moving the whole index to a new 
> > shard
> and
> > then delete would probably be more than challenging.
> >
> >
> >
> >
> > I will never ever store the data in Lucene mainly because of bad exp

> > and since I want to create modules which are fast,  scalable and 
> > flexible and storing the data alongside with the index do not match 
> > that for me at
> least.
> >
> > So yes I will have the need to do a "foreach id in ids get document"
> > approach in the searcher code, but at least I can optimize the 
> > retrieval
> of
> > docs myself and let Lucene do what it's good at: indexing and 
> > searching
> not
> > storage.
> >
> > I am more and more thinking in terms of having different levels of
> searching
> > instead of searcing in all shards at the same time.
> >
> > Let's say you start with 4 shards where you each document is 
> > replicated 4 times based on publishdate. Since all shards have the 
> > same data you can
> lb
> > the query to any of the 4 shards.
> >
> > One day you find that 4 shards is not enough because of search
> performance
> > so you add 4 new shards. Now you only index these 4 new shards with 
> > the
> new
> > documents making the old ones readonly.
> >
> > The searcher would then prioritize the new shards and only if the 
> > query returns less than X results you start querying the old shards.
> >
> > This have a nice side effect of having the most relevant/recent 
> > entries
> in
> > the index which is searched the most. Since the old shards will be 
> > mostly idle you can as well convert 2 of the old shards to "new" 
> > shards reducing the need for buying new servers.
> >
> > What I'm trying to say is that you will end up with an architecture 
> > which have many nodes on top which each have few documents and fewer

> > and fewer nodes as you go down the architecture but where each node 
> > store more documents since the search speed get's less and less
relevant.
> >
> > Something like this:
> >
> >  - Primary: 10M docs per shard, make sure 95% of the results
> comes
> > from here.
> > - Standby: 100M docs per shard - merges of 10 primary
indices.
> >  zz - Archive: 1000M docs per shard - merges of 10 standby
indices.
> >
> > Search top-down.
> > The numbers are just speculative. The drawback with this 
> > architecture is that you get no indexing benefit at all if the 
> > architecture drawn above
> is
> > the same as which you use for indexing. I think personally you 
> > should use
> X
> > indexers which then merge indices (MapReduce) for max performance 
> > and lay them out as described above.
> >
> > I think Google do something like this.
> >
> >
> > Kindly
> >
> > //Marcus
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Sat, Jun 14, 2008 at 2:27 AM, Lance Norskog wrote:
> >
> > > Yes, I've done this split-by-delete several times. The halved 
> > > index
> still
> > > uses as much disk space until you optimize it.
> > >
> > > As to splitting policy: we use an MD5 signature as our unique ID. 
> > > This
> has
> > > the lovely property that we can wildcard.  'contentid:f*' denotes 
> > > 1/16
> of
> > > the whole index. This 1/16 is a very random sample of the whole
index.
> We
> > > use this for several things. If we use this for shards, we have a 
> > > query that matches a shard's contents.
> > >
> > > The Solr/Lucene syntax does not support modular arithmetic,and so 
> > > it
> will
> > > not let you query a subset

RE: Search query optimization

2008-06-17 Thread Chris Hostetter

: test. If most of requests return the same set of data, cache will
: improve the query performance. But in our usage, almost all requests
: have different data set to return. The cache hit ratio is very low.

that's hwy i suggested moving clauses that are likely to be common (ie: 
your "within the last week" clause into a seperate fq param where it can 
be cached independently from the main query.  if you do that *and* you 
have the filterCache turned on then after this query...
  q=account:1&fq=recordeddate_dt:[NOW/DAY-7DAYS+TO+NOW/DAY]
...these other queries will all be fairly fast becauseo f hte cache hit...
  q=account:&fq=recordeddate_dt:[NOW/DAY-7DAYS+TO+NOW/DAY]
  q=account:&fq=recordeddate_dt:[NOW/DAY-7DAYS+TO+NOW/DAY]
  q=anything+you+want&fq=recordeddate_dt:[NOW/DAY-7DAYS+TO+NOW/DAY]

: my previous test examples, it seems lucene will not check the size of
: the subconditions (like account:1 or
: recordeddate_dt:[NOW/DAY-7DAYS+TO+NOW/DAY]). Q=account:1 will return a
: small set of data. But q=recordeddate_dt:[NOW/DAY-7DAYS+TO+NOW/DAY] will
: return a large set of data. If we combine them with "AND" like:
: q=account+AND+recordeddate_dt:[NOW/DAY-7DAYS+TO+NOW/DAY]. It should
: return the small set of data and then apply the subcondition
: "recordeddate_dt:[NOW/DAY-7DAYS+TO+NOW/DAY]". But from the response

the ConjunctionScorer will do that (as mentioned earlier in this thread) 
but even if the account:1 clause indicates that it can skip ahead to 
*document* #1234567, the ConstantScoreRangeQuery still 
needs iterate over all of the *terms* in the specified range before it 
knows which the lowest matching doc id is above #1234567.

that's why putting "range queries" into seperate "fq" params can be a lot 
better ... that term iteration only needs to be done once and can then be 
cached and reused.



-Hoss

Hadoop get together @ Berlin

2008-06-17 Thread idrost

Hello,

I am happy to announce the first German Hadoop Meetup in Berlin. We will meet 
at 5 p.m. MESZ next Tuesday (24th of June) at the newthinking store in Berlin 
Mitte:

newthinking store GmbH
Tucholskystr. 48
10117 Berlin

Please see also: http://upcoming.yahoo.com/event/807782/

A big Thanks to the newthinking store for providing a room in the center of 
Berlin for us.

There will be drinks provided by newthinking. You can order pizza if you like. 
There are quite a few good restaurants nearby, so we can go there after the 
official part. 

Talks scheduled so far:

Stefan Groschupf will talk about Hadoop in action in one of his customer 
projects. Of course there will be time to ask him questions on his new 
project katta.

Isabel Drost will talk about the new project Mahout.

There will be a few more slots for talks of about 20 minutes with another 10 
minutes for discussion. There will be a beamer so feel free to bring some 
slides. In case you are interested, please contact me by mail. Please also 
contact me by mail if you intend to visit the Meetup if you plan to attend.

Feel free to resend this mail to any communities interested in the meeting.


Isabel

PS: In case this mail reaches the list, I am sorry for the double posting, I 
did not see my first mail arrive and realized only thereafter that I had used 
the wrong From: :(

-- 
The "cutting edge" is getting rather dull.  -- Andy Purshottam
  |\  _,,,---,,_   Web:   
  /,`.-'`'-.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  


signature.asc
Description: This is a digitally signed message part.

"Did you mean" functionality

2008-06-17 Thread Lucas F. A. Teixeira


Hello everybody,

I need to integrate the Lucene SpellChecker Contrib lib in my 
applycation, but I`m using the EmbeededSolrServer to access all indexes.
I want to know what should I do (if someone have any step-by-step, link, 
tutorial or smoke signal) of what I need to do during indexing, and of 
course to search through this words generated by this API.


I can use the lib itself to search the suggestions, w/out using solr, 
but I`m confused about how may I proceed when indexing this docs.


Thanks a lot,

[]s,

--
Lucas Frare A. Teixeira
[EMAIL PROTECTED] 
Tel: +55 11 3660.1622 - R3018

Re: How does solr.StrField handle punctuation?

2008-06-17 Thread Chris Hostetter


: However, the following query produces no hits, even though I know from the
: facets info that there are over 4000 matches in the index:
: 
: 
fl=*,score&start=0&q=division_t:"Accounting"&company_facet:"Deloitte+%26+Touche"&qt=standard&wt=ruby&rows=30

That's not a "legal" URL ... note the "...&company_facet...".  You've 
specified a URL param named: 'company_facet:"Deloitte+%26+Touche"' which 
has no value.

I think you ment to use...

fl=*,score&start=0&q=division_t:"Accounting"&fq=company_facet:"Deloitte+%26+Touche"&qt=standard&wt=ruby&rows=30

: I was under the impression that Solr.StrField just indexes the literal
: string, so I'm confused why this won't work. What's the proper way to feed

the record: that is in fact exactly what StrField does.


-Hoss

Re: "Did you mean" functionality

2008-06-17 Thread Otis Gospodnetic

Hi Lucas,

Have a look at (the patch in) SOLR-572, lots of work happening there as we 
speak.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Lucas F. A. Teixeira <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Tuesday, June 17, 2008 4:30:12 PM
> Subject: "Did you mean" functionality
> 
> Hello everybody,
> 
> I need to integrate the Lucene SpellChecker Contrib lib in my 
> applycation, but I`m using the EmbeededSolrServer to access all indexes.
> I want to know what should I do (if someone have any step-by-step, link, 
> tutorial or smoke signal) of what I need to do during indexing, and of 
> course to search through this words generated by this API.
> 
> I can use the lib itself to search the suggestions, w/out using solr, 
> but I`m confused about how may I proceed when indexing this docs.
> 
> Thanks a lot,
> 
> []s,
> 
> -- 
> Lucas Frare A. Teixeira
> [EMAIL PROTECTED] 
> Tel: +55 11 3660.1622 - R3018

Re: "Did you mean" functionality

2008-06-17 Thread Lucas F. A. Teixeira


Hello Otis! Thanks a lot!


[]s,

Lucas



Lucas Frare A. Teixeira
[EMAIL PROTECTED] 
Tel: +55 11 3660.1622 - R3018



Otis Gospodnetic escreveu:

Hi Lucas,

Have a look at (the patch in) SOLR-572, lots of work happening there as we 
speak.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
  

From: Lucas F. A. Teixeira <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Tuesday, June 17, 2008 4:30:12 PM
Subject: "Did you mean" functionality

Hello everybody,

I need to integrate the Lucene SpellChecker Contrib lib in my 
applycation, but I`m using the EmbeededSolrServer to access all indexes.
I want to know what should I do (if someone have any step-by-step, link, 
tutorial or smoke signal) of what I need to do during indexing, and of 
course to search through this words generated by this API.


I can use the lib itself to search the suggestions, w/out using solr, 
but I`m confused about how may I proceed when indexing this docs.


Thanks a lot,

[]s,

--
Lucas Frare A. Teixeira
[EMAIL PROTECTED] 
Tel: +55 11 3660.1622 - R3018

Re: Analyser doubt in solr

2008-06-17 Thread Chris Hostetter


I'm not really sure that i understand your question at all ... it seems 
like maybe you are asking about natural language type problems (ie: 
question answering) but i'm not really sure.  there are "question 
answering" type features provided by Solr, but off the top of my head you 
might have some success using a StopFilter that ignores all "question" 
type works (who, what, where, how, is, are, many, can, etc...) to see if 
that help distill the queries down to just the interesting terms.

: Hi to all,
: 
: I am using solr for searching in my application.
: my problem  is ,
:  for example if i want to serach 
: 
:  "what is java?" means ,
: 
: The highly matched result from solr should come (ie java based result ,which
: should be a lose search ) .what kind of ANALYSER i have to use and how to
: configure the analyser in solr .
: 
: 
: Iam waiting for the reply
: 
: with regards,
: T.Rekha.


-Hoss

Re: How does solr.StrField handle punctuation?

2008-06-17 Thread terhorst

Thanks for the reply. I was in a hurry and made the URL up to illustrate my
point. The real query string is more like what you suggest. In any case I'm
certain that the actual query being used is valid (Solr would complain if it
weren't) and that the ampersand is somehow affecting results. Is there any
way I can get Solr to dump some information about how it stores indexes,
keys, etc. for a certain record? I'm wondering if the ampersand was handled
in a weird way by my application when the records were added to the index.
(Although I doubt this since it shows up properly in the facets.) Thanks
again for your help.

Jonathan

hossman wrote:
> 
> 
> : However, the following query produces no hits, even though I know from
> the
> : facets info that there are over 4000 matches in the index:
> : 
> :
> fl=*,score&start=0&q=division_t:"Accounting"&company_facet:"Deloitte+%26+Touche"&qt=standard&wt=ruby&rows=30
> 
> That's not a "legal" URL ... note the "...&company_facet...".  You've 
> specified a URL param named: 'company_facet:"Deloitte+%26+Touche"' which 
> has no value.
> 
> I think you ment to use...
> 
> fl=*,score&start=0&q=division_t:"Accounting"&fq=company_facet:"Deloitte+%26+Touche"&qt=standard&wt=ruby&rows=30
> 
> : I was under the impression that Solr.StrField just indexes the literal
> : string, so I'm confused why this won't work. What's the proper way to
> feed
> 
> the record: that is in fact exactly what StrField does.
> 
> 
> -Hoss
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/How-does-solr.StrField-handle-punctuation--tp17759824p17956690.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How does solr.StrField handle punctuation?

2008-06-17 Thread Chris Hostetter


: Thanks for the reply. I was in a hurry and made the URL up to illustrate my
: point. The real query string is more like what you suggest. In any case I'm
: certain that the actual query being used is valid (Solr would complain if it
: weren't) and that the ampersand is somehow affecting results. Is there any

no, actually it wouldn't complain in that case ... a URL param with a name 
it's not expecting would just be ignored.

if you send us the exact URLs you'rehaving problems with there may be 
other nuances about it that we can spot to help figure out your problem. 
(for example: are you absolutely sure the apersand in your field value is 
URL escaped?)

: way I can get Solr to dump some information about how it stores indexes,
: keys, etc. for a certain record? I'm wondering if the ampersand was handled
: in a weird way by my application when the records were added to the index.
: (Although I doubt this since it shows up properly in the facets.) Thanks
: again for your help.

yep, there are a couple of things you can do in general to 
troubleshoot things like this...

1) debugQuery=true ... add that param into your URL and Solr will give you 
some nice debuging info about how your queries are bering parsed.  this is 
important to post when asking followup questions.

2) analysis.jsp ... this is the "Analysis" link on the admin page, it will 
show you how your analyzer is treating the fields you index ... but this 
isn'treally relevant to your specific problem since you are using 
StrField.

3) LukeRequestHandler, in the example schema it's mapped to /admin/luke 
... this will let you see the actual terms indexed for your fields ... but 
this as you said, this isn't going to be much help for you in this 
specific case since you used facet.field to get the value in the first place -- 
that means it's 
definitely indexed that way.

debugQuery=true is definitely your best first step ... send us the exact 
URLs your having problems with (that have debugQuery=true) along with the 
full output of that URL and people can probably help spot your problem.



-Hoss

Re: Dismax + Dynamic fields

2008-06-17 Thread Norberto Meijome

On Tue, 17 Jun 2008 09:43:58 -0400
Daniel Papasian <[EMAIL PROTECTED]> wrote:

> Norberto Meijome wrote:
> > Thanks Yonik. ok, that matches what I've seen - if i know the actual
> > name of the field I'm after, I can use it in a query it, but i can't
> > use the dynamic_field_name_* (with wildcard) in the config.
> > 
> > Is adding support for this something that is desirable / needed
> > (doable??) , and is it being worked on ?
> 
> You can use a wildcard with copyFrom to copy the dynamic fields that
> match the pattern to another field that you can then query on. It seems
> like that would cover your needs, no?

indeed, that's what I did for this prototype. I was just wondering whether I 
was missing something that would prevent it from working with dynfields.

cheers,
B

_
{Beto|Norberto|Numard} Meijome

Law of Conservation of Perversity: 
  we can't make something simpler without making something else more complex

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.

Re: Dismax + Dynamic fields

2008-06-17 Thread Norberto Meijome

On Tue, 17 Jun 2008 08:16:41 -0700 (PDT)
Chris Hostetter <[EMAIL PROTECTED]> wrote:

> bingo.  even if a wildcard like syntax was allowed in the qf/pf of dismax 
> since the same boost would be applied to each field the results would be 
> roughly the same as if you used copyField -- and searching a single field 
> will be faster then searching N other fields)
> 
> (i say "roughly" the same because the tf/idf of the individual fields 
> would be different then a single consolidated field, so there would be 
> variations in the score ... but the basic results should be the same.  if 
> you really care about tuning the score, you'd wnat to assign seperate 
> boosts per field name anyway, and then you're right back to not 
> needing/wanting to use a glob syntax)

thanks Chris, that makes perfect sense.

B
_
{Beto|Norberto|Numard} Meijome

"The music business is a cruel and shallow money trench, a long plastic hallway 
where thieves and pimps run free, and good men die like dogs. There's also a 
negative side."
   Hunter S. Thompson

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.

Re: "Did you mean" functionality

2008-06-17 Thread Grant Ingersoll


Also see http://wiki.apache.org/solr/SpellCheckComponent

I expect to commit fairly soon.

On Jun 17, 2008, at 5:46 PM, Otis Gospodnetic wrote:


Hi Lucas,

Have a look at (the patch in) SOLR-572, lots of work happening there  
as we speak.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 

From: Lucas F. A. Teixeira <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Tuesday, June 17, 2008 4:30:12 PM
Subject: "Did you mean" functionality

Hello everybody,

I need to integrate the Lucene SpellChecker Contrib lib in my
applycation, but I`m using the EmbeededSolrServer to access all  
indexes.
I want to know what should I do (if someone have any step-by-step,  
link,
tutorial or smoke signal) of what I need to do during indexing, and  
of

course to search through this words generated by this API.

I can use the lib itself to search the suggestions, w/out using solr,
but I`m confused about how may I proceed when indexing this docs.

Thanks a lot,

[]s,

--
Lucas Frare A. Teixeira
[EMAIL PROTECTED]
Tel: +55 11 3660.1622 - R3018

Re: How does solr.StrField handle punctuation?

2008-06-17 Thread terhorst


Here are the exact query strings I'm using. The only modification I made is
to change the output formatter from Ruby to XML and run the output through a
pretty printer.

This is the one that returns the facet.fields I'm interested in. The problem
field is the first one returned:

Query:
/solr/select/?facet=true&facet.mincount=1&facet.offset=0&facet.limit=22&wt=xml&rows=0&fl=*,score&start=0&facet.sort=true&q=division_t:%22Accounting%22;last_name_facet+asc&facet.field=company_facet&qt=standard&fq=in_redbook_b:true&debugQuery=true

Response:


  
0
488

  true
  0
  1
  22
  xml
  0
  *,score
  true
  true
  0
  division_t:"Accounting";last_name_facet
asc
  company_facet
  standard
  in_redbook_b:true

  
  
  


  
4114
1379
1257
206
154
134
86
80
68
64
56
49
49
45
44
42
41
40
36
36
36
35division_t:"Accounting";last_name_facet ascdivision_t:"Accounting";last_name_facet ascdivision_t:accountdivision_t:accountin_redbook_b:truein_redbook_b:true488.01.01.00.00.00.00.0487.01.0486.00.00.00.0 

-

And then this is the one where I select the first facet.field returned
above, and attempt to pull up those results:


Query: 
/solr/select/?fl=*,score&start=0&wt=json&q=division_t:%22Accounting%22;last_name_facet+asc&qt=standard&fq=company_facet:%22Deloitte+%26+Touche%22&fq=in_redbook_b:true&rows=30&debugQuery=true

Response:


  
0
1

  *,score
  true
  0
  division_t:"Accounting";last_name_facet
asc
  xml
  standard
  
company_facet:"Deloitte & Touche"
in_redbook_b:true
  
  30

  
  
  
division_t:"Accounting";last_name_facet asc
division_t:"Accounting";last_name_facet
asc
division_t:account
division_t:account


  company_facet:"Deloitte & Touche"
  in_redbook_b:true


  company_facet:Deloitte & Touche
  in_redbook_b:true


  1.0
  
1.0

  1.0


  0.0


  0.0


  0.0


  0.0

  
  
0.0

  0.0


  0.0


  0.0


  0.0


  0.0

  

  


(The other filter query, in_redbook_b, is a boolean field used to partition
our dataset. It should affect the results since it's in both queries.)

Thanks again for your help, I really appreciate your time.

Jonathan


hossman wrote:
> 
> 
> : Thanks for the reply. I was in a hurry and made the URL up to illustrate
> my
> : point. The real query string is more like what you suggest. In any case
> I'm
> : certain that the actual query being used is valid (Solr would complain
> if it
> : weren't) and that the ampersand is somehow affecting results. Is there
> any
> 
> no, actually it wouldn't complain in that case ... a URL param with a name 
> it's not expecting would just be ignored.
> 
> if you send us the exact URLs you'rehaving problems with there may be 
> other nuances about it that we can spot to help figure out your problem. 
> (for example: are you absolute

Re: How does solr.StrField handle punctuation?

2008-06-17 Thread Chris Hostetter


: 4114
: 1379

A-Ha! ... this is where the details relaly matter.  unless your email 
program did something funky with the XML you sent, what this tells me is 
that you don't actually have the values "Deloitte & Touche" or "Ernst & 
Young" in your index.  The literal values in your index are "Deloitte 
& Touche" and "Ernst & Young" .. most likely you are "double XML 
escaping" your source data before indexing.  if i'm right, then when you 
use the ruby output format, you'll see...
  ...
  'facet_fields'=>{
'cat'=>[
   'Deloitte & Touche',4114
   'Ernst & Young',1379
  ...

If you change your fq to...

fq=company_facet:%22Deloitte+%26amp%3B+Touche%22

...so that it is an URL escaping of the value you get after un-XML 
escaping the response (just once!) you should start seeing the correct 
results.  but your long term solution is to stop double escaping your data 
before indexing it.



-Hoss

Re: How does solr.StrField handle punctuation?

2008-06-17 Thread terhorst


Nailed it right on the head. That solves it. Thanks much!

Jonathan


hossman wrote:
> 
> 
> : 4114
> : 1379
> 
> A-Ha! ... this is where the details relaly matter.  unless your email 
> program did something funky with the XML you sent, what this tells me is 
> that you don't actually have the values "Deloitte & Touche" or "Ernst & 
> Young" in your index.  The literal values in your index are "Deloitte 
> & Touche" and "Ernst & Young" .. most likely you are "double XML 
> escaping" your source data before indexing.  if i'm right, then when you 
> use the ruby output format, you'll see...
>   ...
>   'facet_fields'=>{
>   'cat'=>[
>'Deloitte & Touche',4114
>'Ernst & Young',1379
>   ...
> 
> If you change your fq to...
> 
>   fq=company_facet:%22Deloitte+%26amp%3B+Touche%22
> 
> ...so that it is an URL escaping of the value you get after un-XML 
> escaping the response (just once!) you should start seeing the correct 
> results.  but your long term solution is to stop double escaping your data 
> before indexing it.
> 
> 
> 
> -Hoss
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/How-does-solr.StrField-handle-punctuation--tp17759824p17959058.html
Sent from the Solr - User mailing list archive at Nabble.com.

41 matches

Mail list logo