Dismax: Impossible to search for a _phrase_ in tokenized and untokenized fields at the same time

2009-10-10 Thread Alex Baranov

Hello,

It seems to me that there is no way how I can use dismax handler for
searching in both tokenized and untokenized fields while I'm searching for a
phrase.

Consider the next example. I have two fields in index: product_name and
product_name_un. The schema looks like:


  
 
  
  



  






  


   
   
 


I'm using dismax to search in both of them at the same time:
"defType=dismax&qf=product_name product_name_un^2.0". (this is done to bring
on top of the results the products which name _equals_ the entered
criteria).

1. When I'm searching for the phrase (two or more keywords), e.g. , the input string is tokenized and even I have in the index
product_name_un="blue car", the "product_name_un^2.0" part of the dismax
config has no effect. 
2. When I enter <"blue car"> (in quotas) the string is not tokenized and
"product_name_un^2.0" part works, but nothing could be found in product_name
field.

I.e. there is no way to have a proper search against two fields at the same
time. The workaround that I found is using "bq" parameter for specifying the
boost query for search in field product_name_un. But I don't think that this
should be the only solution.


Another note, related to that: when I set as a default field for search
product_name_un, and query with the ../select/?q=blue car&rows=10&... I got
empty results despite the fact that I have "blue car" value in the index in
that field. I have to use quotas again to fix that... Shouldn't it determine
the field type and apply corresponding analyzers/tokenizers/etc.?

-- 
View this message in context: 
http://www.nabble.com/Dismax%3A-Impossible-to-search-for-a-_phrase_-in-tokenized-and-untokenized-fields-at-the-same-time-tp25832932p25832932.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Dismax: Impossible to search for a _phrase_ in tokenized and untokenized fields at the same time

2009-10-10 Thread Yonik Seeley
On Sat, Oct 10, 2009 at 6:34 AM, Alex Baranov  wrote:
>
> Hello,
>
> It seems to me that there is no way how I can use dismax handler for
> searching in both tokenized and untokenized fields while I'm searching for a
> phrase.
>
> Consider the next example. I have two fields in index: product_name and
> product_name_un. The schema looks like:
>
>         positionIncrementGap="100" omitNorms="true">
>      
>         
>         
>      
>    
>
>     positionIncrementGap="100">
>      
>        
>         generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>        
>        
>        
>         language="English"/>
>      
>        
>
>    stored="true"/>
>    stored="true"/>
>
> 
>
> I'm using dismax to search in both of them at the same time:
> "defType=dismax&qf=product_name product_name_un^2.0". (this is done to bring
> on top of the results the products which name _equals_ the entered
> criteria).
>
> 1. When I'm searching for the phrase (two or more keywords), e.g.  car>, the input string is tokenized and even I have in the index
> product_name_un="blue car", the "product_name_un^2.0" part of the dismax
> config has no effect.

Hmmm, right.  This is due to the fact that the Lucene query parser
(still actually used in dismax) breaks things up by whitespace
*before* analysis (so the analyzer for the untokenized field never
sees the two tokens together).

> 2. When I enter <"blue car"> (in quotas) the string is not tokenized and
> "product_name_un^2.0" part works, but nothing could be found in product_name
> field.

Using explicit quotes will make a phrase query, so blue and car must
appear right next to eachother in product_name.
If it's OK to require both blue and car, in product_name then you can
just set a slop for explicit phrase queries with the qs parameter.

-Yonik
http://www.lucidimagination.com





> I.e. there is no way to have a proper search against two fields at the same
> time. The workaround that I found is using "bq" parameter for specifying the
> boost query for search in field product_name_un. But I don't think that this
> should be the only solution.
>
>
> Another note, related to that: when I set as a default field for search
> product_name_un, and query with the ../select/?q=blue car&rows=10&... I got
> empty results despite the fact that I have "blue car" value in the index in
> that field. I have to use quotas again to fix that... Shouldn't it determine
> the field type and apply corresponding analyzers/tokenizers/etc.?
>
> --
> View this message in context: 
> http://www.nabble.com/Dismax%3A-Impossible-to-search-for-a-_phrase_-in-tokenized-and-untokenized-fields-at-the-same-time-tp25832932p25832932.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Re: Dismax: Impossible to search for a _phrase_ in tokenized and untokenized fields at the same time

2009-10-10 Thread Alex Baranov
I guess this is a bug that should be added in JIRA (if it is not there
already). Should I add it?


> Hmmm, right.  This is due to the fact that the Lucene query parser
> (still actually used in dismax) breaks things up by whitespace
> *before* analysis (so the analyzer for the untokenized field never
> sees the two tokens together).
>

Is there a way how to tell to Lucene parser not to break things up by the
whitespace? Should one use some whitespace code instead of actual ?

I think what we need here is some kind of a "special quotas" which will tell
not to use Lucene query parser at all (might be very useful for situation
like this when search is applied to the default field, i.e. when the field
is not specified).

If it's OK to require both blue and car, in product_name then you can
> just set a slop for explicit phrase queries with the qs parameter.
>

It's not good for me unfortunately, but thanks for the suggestion.

Alex Baranov.

On Sat, Oct 10, 2009 at 3:01 PM, Yonik Seeley wrote:

> On Sat, Oct 10, 2009 at 6:34 AM, Alex Baranov 
> wrote:
> >
> > Hello,
> >
> > It seems to me that there is no way how I can use dismax handler for
> > searching in both tokenized and untokenized fields while I'm searching
> for a
> > phrase.
> >
> > Consider the next example. I have two fields in index: product_name and
> > product_name_un. The schema looks like:
> >
> > > positionIncrementGap="100" omitNorms="true">
> >  
> > 
> > 
> >  
> >
> >
> > > positionIncrementGap="100">
> >  
> >
> > > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
> >
> >
> >
> > > language="English"/>
> >  
> >
> >
> >> stored="true"/>
> >> stored="true"/>
> >
> > 
> >
> > I'm using dismax to search in both of them at the same time:
> > "defType=dismax&qf=product_name product_name_un^2.0". (this is done to
> bring
> > on top of the results the products which name _equals_ the entered
> > criteria).
> >
> > 1. When I'm searching for the phrase (two or more keywords), e.g.  > car>, the input string is tokenized and even I have in the index
> > product_name_un="blue car", the "product_name_un^2.0" part of the dismax
> > config has no effect.
>
> Hmmm, right.  This is due to the fact that the Lucene query parser
> (still actually used in dismax) breaks things up by whitespace
> *before* analysis (so the analyzer for the untokenized field never
> sees the two tokens together).
>
> > 2. When I enter <"blue car"> (in quotas) the string is not tokenized and
> > "product_name_un^2.0" part works, but nothing could be found in
> product_name
> > field.
>
> Using explicit quotes will make a phrase query, so blue and car must
> appear right next to eachother in product_name.
> If it's OK to require both blue and car, in product_name then you can
> just set a slop for explicit phrase queries with the qs parameter.
>
> -Yonik
> http://www.lucidimagination.com
>
>
>
>
>
> > I.e. there is no way to have a proper search against two fields at the
> same
> > time. The workaround that I found is using "bq" parameter for specifying
> the
> > boost query for search in field product_name_un. But I don't think that
> this
> > should be the only solution.
> >
> >
> > Another note, related to that: when I set as a default field for search
> > product_name_un, and query with the ../select/?q=blue car&rows=10&... I
> got
> > empty results despite the fact that I have "blue car" value in the index
> in
> > that field. I have to use quotas again to fix that... Shouldn't it
> determine
> > the field type and apply corresponding analyzers/tokenizers/etc.?
> >
> > --
> > View this message in context:
> http://www.nabble.com/Dismax%3A-Impossible-to-search-for-a-_phrase_-in-tokenized-and-untokenized-fields-at-the-same-time-tp25832932p25832932.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
> >
>


Re: DIH and EmbeddedSolr

2009-10-10 Thread rohan rai
ModifiableSolrParams p = new ModifiableSolrParams();
p.add("qt", "/dataimport");
p.add("command", "full-import");
server.query(p, METHOD.POST);

I do this

But it starts giving me this exception

SEVERE: Full Import failed
java.util.concurrent.RejectedExecutionException
at
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:1760)
at
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
at
java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:216)
at
java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:366)
at
org.apache.solr.update.DirectUpdateHandler2$CommitTracker.scheduleCommitWithin(DirectUpdateHandler2.java:466)
at
org.apache.solr.update.DirectUpdateHandler2.deleteByQuery(DirectUpdateHandler2.java:322)
at
org.apache.solr.update.processor.RunUpdateProcessor.processDelete(RunUpdateProcessorFactory.java:69)
at
org.apache.solr.handler.dataimport.SolrWriter.doDeleteAll(SolrWriter.java:192)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:332)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:386)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:377)




2009/10/10 Noble Paul നോബിള്‍ नोब्ळ् 

> you may need to extend a SolrRequest and set appropriate path
> ("/dataimport") and other params
> then you may invoke the request method.
>
> On Sat, Oct 10, 2009 at 11:07 AM, rohan rai  wrote:
> > The configuration is not an issue...
> > But how doindex i invoke it...
> >
> > I only have known a url way to invoke it and thus import the data into
> > index...
> > like http://localhost:8983/solr/db/dataimport?command=full-import t
> > But with embedded I havent been able to figure it out
> >
> > Regards
> > Rohan
> > 2009/10/10 Noble Paul നോബിള്‍ नोब्ळ् 
> >>
> >> I guess it should be possible... what are the problems you encounter?
> >>
> >> On Sat, Oct 10, 2009 at 10:56 AM, rohan rai 
> wrote:
> >> > Have been unable to use DIH for Embedded Solr
> >> >
> >> > Is there a way??
> >> >
> >> > Regards
> >> > Rohan
> >> >
> >>
> >>
> >>
> >> --
> >> -
> >> Noble Paul | Principal Engineer| AOL | http://aol.com
> >
> >
>
>
>
> --
> -
> Noble Paul | Principal Engineer| AOL | http://aol.com
>


Solr 1.4 Release Party

2009-10-10 Thread Israel Ekpo
I can't wait...

-- 
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.


Re: DIH and EmbeddedSolr

2009-10-10 Thread rohan rai
This is pretty unstable...anyone has any clue...Sometimes it even creates
index, sometimes it does not ??

But everytime time I do get this exception

Regards
Rohan
On Sat, Oct 10, 2009 at 6:07 PM, rohan rai  wrote:

> ModifiableSolrParams p = new ModifiableSolrParams();
> p.add("qt", "/dataimport");
> p.add("command", "full-import");
> server.query(p, METHOD.POST);
>
> I do this
>
> But it starts giving me this exception
>
> SEVERE: Full Import failed
> java.util.concurrent.RejectedExecutionException
> at
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:1760)
> at
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:216)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:366)
> at
> org.apache.solr.update.DirectUpdateHandler2$CommitTracker.scheduleCommitWithin(DirectUpdateHandler2.java:466)
> at
> org.apache.solr.update.DirectUpdateHandler2.deleteByQuery(DirectUpdateHandler2.java:322)
> at
> org.apache.solr.update.processor.RunUpdateProcessor.processDelete(RunUpdateProcessorFactory.java:69)
> at
> org.apache.solr.handler.dataimport.SolrWriter.doDeleteAll(SolrWriter.java:192)
> at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:332)
> at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:386)
> at
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:377)
>
>
>
>
>
> 2009/10/10 Noble Paul നോബിള്‍ नोब्ळ् 
>
>> you may need to extend a SolrRequest and set appropriate path
>> ("/dataimport") and other params
>> then you may invoke the request method.
>>
>> On Sat, Oct 10, 2009 at 11:07 AM, rohan rai  wrote:
>> > The configuration is not an issue...
>> > But how doindex i invoke it...
>> >
>> > I only have known a url way to invoke it and thus import the data into
>> > index...
>> > like http://localhost:8983/solr/db/dataimport?command=full-import t
>> > But with embedded I havent been able to figure it out
>> >
>> > Regards
>> > Rohan
>> > 2009/10/10 Noble Paul നോബിള്‍ नोब्ळ् 
>> >>
>> >> I guess it should be possible... what are the problems you encounter?
>> >>
>> >> On Sat, Oct 10, 2009 at 10:56 AM, rohan rai 
>> wrote:
>> >> > Have been unable to use DIH for Embedded Solr
>> >> >
>> >> > Is there a way??
>> >> >
>> >> > Regards
>> >> > Rohan
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> -
>> >> Noble Paul | Principal Engineer| AOL | http://aol.com
>> >
>> >
>>
>>
>>
>> --
>> -
>> Noble Paul | Principal Engineer| AOL | http://aol.com
>>
>
>


Re: Question regarding proximity search

2009-10-10 Thread AHMET ARSLAN

> Hi
> I would appreciate if someone can throw some light on the
> following point
> regarding proximity search.
> i have a search box and if a use comes and type in "honda
> car" WITHOUT any
> double quotes, i want to get all documents with matches,
> and also they
> should be ranked based on proximity. i.e. the more the two
> terms are nearer
> the more is the rank. 
> From the admin looks like in order to test proximity i have
> to always give
> the word in double quote and a slop value
> http://localhost:8983/solr/select/?q="honda+car"~12&version=2.2&start=0&rows=10&indent=on
> 
> Hence looks like from admin point of view in order to do
> proximity i have to
> always give it in double quotes.
> 
> My questions is in order to do proximity search we always
> have to pass the
> query as a phrase ie. in double quotes.

Yes, if you are using LuceneQParserPlugin.
 
> The next question is that i thought using dismax handler i
> could do a search
> on a field and i can specify the ps value in order to
> define proximity.

> and this is the query i am giving and i get back no
> results. any advice
> where i am going wrong
> 
> http://localhost:8983/solr/proxTest/?q="honda car"

Can you try http://localhost:8983/solr/proxTest/?q=honda+car
You don't need quotes in dismax.
You can append &debugQuery=true to url to see whats going on.

Hope this helps.




  


Customizing solr search: SpanQueries (revisited)

2009-10-10 Thread seanoc5

Hi all,
I am trying to use SpanQueries to save*all* hits for custom query type
(e.g. defType=fooSpanQuery), along with token positions. I have this working
in straight lucene, so my challenge is to implement it half-intelligently in
solr. At the moment, I can't figure out where and how to customize the
'inner' search process.

So far, I have my own SpanQParser, and SpanQParserPlugin, which
successfully return a hard-coded span query (but this is not critical for my
current challenge, I believe).

I also have managed to configure solr to call my custom
SpanQueryComponent, which I believe is the focus of my challenge. At this
initial stage, I have simply extended QueryComponent, and overriden
QueryComponent.process() while I am trying to find my way through the code
:-).

So, with all that setup, can someone point me in the right direction for
custom processing of a query (or just the query results)? A few differences
for my use-case are:
-- I want to save every hit along with position information. I believe this
means I want to use SpanQueries (like I have in lucene), but perhaps there
are other options.
-- I do not need to build much in the way of a response. This is an
automated analysis, so no user will see the solr results. I will save them
to a database, but for simplicity just a
log.info("Score:{}, Term:{}, TokenNumber:{}",...)
would be great at the moment.
-- I will always process every span, even those with near zero 'score'

 I think I want to focus on SpanQParser.process(), probably overriding
the functionality in (SolrIndexSearcher)searcher.search(result,cmd)
which seems to just call
getDocListC(qr,cmd);// ?? is this my main focus point??

Does this seem like a reasonable approach? If so, how do I do it? I
think I'm missing something obvious; perhaps there is an easy way to extend
SolrIndexSearcher in solrconfig.xml to have my custom SpanQueryComponent
call a custom IndexSearcher where I simply override getDocListC()?

And for extra-karma-credit: any thoughts on performance gains (or loss?)
if I basically drop must of the advanced optimization like TopDocsCollector
and such? If have thousands of queries, and want to save *every* span for
each query, is there likely to be significant overhead from the
optimizations which are intended for users to 'page' through windows of
hits?

Also, thanks to Grant for replying to my previous inquiry
(http://osdir.com/ml/solr-dev.lucene.apache.org/2009-05/msg00010.html). This
email is partly me trying to implement his suggestion, and partly just
trying to understand basic Solr customization. I tried sending out a
previous draft of this message yesterday, but haven't seen it on the lists,
so my apologies if this becomes a duplicate post.
Thank you,

Sean
-- 
View this message in context: 
http://www.nabble.com/Customizing-solr-search%3A-SpanQueries-%28revisited%29-tp25838412p25838412.html
Sent from the Solr - User mailing list archive at Nabble.com.



http replication transfer speed

2009-10-10 Thread Mark Miller
Anyone know why you would see a transfer speed of just 10-20MB over a
gigbit network connection?

Even with standard drives, I would expect to at least see around 40MB.
Has anyone seen over 10-20 using replication?

Any ideas on what the bottleneck should be? I think even a standard
drive can do writes of a bit of 40MB/s, and certainly reads over that.

Thoughts?

-- 
- Mark

http://www.lucidimagination.com





Optimize on slaves?

2009-10-10 Thread Matthew Painter
Hi,
 
Simple question! I have a nightly cron job to send the optimize command
to Solr on our master instance. Is this also required on Solr replicated
slaves to optimise their indexes?
 
Thanks,
Matt

This e-mail message and any attachments are CONFIDENTIAL to the addressee(s) 
and may also be LEGALLY PRIVILEGED.  If you are not the intended addressee, 
please do not use, disclose, copy or distribute the message or the information 
it contains.  Instead, please notify me as soon as possible and delete the 
e-mail, including any attachments.  Thank you.


Re: Optimize on slaves?

2009-10-10 Thread Walter Underwood

No. The slaves will copy the current index, optimized or not. --wunder

On Oct 10, 2009, at 4:33 PM, Matthew Painter wrote:


Hi,

Simple question! I have a nightly cron job to send the optimize  
command
to Solr on our master instance. Is this also required on Solr  
replicated

slaves to optimise their indexes?

Thanks,
Matt

This e-mail message and any attachments are CONFIDENTIAL to the  
addressee(s) and may also be LEGALLY PRIVILEGED.  If you are not the  
intended addressee, please do not use, disclose, copy or distribute  
the message or the information it contains.  Instead, please notify  
me as soon as possible and delete the e-mail, including any  
attachments.  Thank you.




RE: Optimize on slaves?

2009-10-10 Thread Matthew Painter
My apologies; I've just found the answer (that optimisation should be on
the master server only)



From: Matthew Painter 
Sent: Sunday, 11 October 2009 12:34 p.m.
To: 'solr-user@lucene.apache.org'
Subject: Optimize on slaves?


Hi,
 
Simple question! I have a nightly cron job to send the optimize command
to Solr on our master instance. Is this also required on Solr replicated
slaves to optimise their indexes?
 
Thanks,
Matt

This e-mail message and any attachments are CONFIDENTIAL to the addressee(s) 
and may also be LEGALLY PRIVILEGED.  If you are not the intended addressee, 
please do not use, disclose, copy or distribute the message or the information 
it contains.  Instead, please notify me as soon as possible and delete the 
e-mail, including any attachments.  Thank you.


Re: http replication transfer speed

2009-10-10 Thread Mark Miller



On a drive that can do 40+ that's getting query load might have it's  
writes knocked down to that?


- Mark

http://www.lucidimagination.com (mobile)

On Oct 10, 2009, at 6:41 PM, Mark Miller  wrote:


Anyone know why you would see a transfer speed of just 10-20MB over a
gigbit network connection?

Even with standard drives, I would expect to at least see around 40MB.
Has anyone seen over 10-20 using replication?

Any ideas on what the bottleneck should be? I think even a standard
drive can do writes of a bit of 40MB/s, and certainly reads over that.

Thoughts?

--
- Mark

http://www.lucidimagination.com





Tips on speeding up indexing needed...

2009-10-10 Thread William Pierce

Folks:

I have a corpus of approx 6 M documents each of approx 4K bytes. 
Currently, the way indexing is set up I read documents from a database and 
issue solr post requests in batches (batches are set up so that the 
maxPostSize of tomcat which is set to 2MB is adhered to).  This means that 
in each batch we write approx 600 or so documents to SOLR.  What I am seeing 
is that I am able to push about 2500 docs per minute or approx 40 or so per 
second.


I saw in Erik's talk on Friday that speeds of 250 docs/sec to 25000 docs/sec 
have been achieved.  Needless to say I am sure that performance numbers vary 
widely and are dependent on the domain, machine configurations, etc.


I am running on Windows 2003 server, with 4 GB RAM, dual core xeon.

Any tips on what I can do to speed this up?

Thanks,

Bill 



Re: Tips on speeding up indexing needed...

2009-10-10 Thread William Pierce
Oh and one more thing...For historical reasons our apps run using msft 
technologies, so using SolrJ would be next to impossible at the present 
time


Thanks in advance for your help!

-- Bill

--
From: "William Pierce" 
Sent: Saturday, October 10, 2009 5:47 PM
To: 
Subject: Tips on speeding up indexing needed...


Folks:

I have a corpus of approx 6 M documents each of approx 4K bytes. 
Currently, the way indexing is set up I read documents from a database and 
issue solr post requests in batches (batches are set up so that the 
maxPostSize of tomcat which is set to 2MB is adhered to).  This means that 
in each batch we write approx 600 or so documents to SOLR.  What I am 
seeing is that I am able to push about 2500 docs per minute or approx 40 
or so per second.


I saw in Erik's talk on Friday that speeds of 250 docs/sec to 25000 
docs/sec have been achieved.  Needless to say I am sure that performance 
numbers vary widely and are dependent on the domain, machine 
configurations, etc.


I am running on Windows 2003 server, with 4 GB RAM, dual core xeon.

Any tips on what I can do to speed this up?

Thanks,

Bill



Re: Tips on speeding up indexing needed...

2009-10-10 Thread Lance Norskog
A few things off the bat:
1) do not commit until the end.
2) use the DataImportHandler - it runs inside Solr and reads the
database. This cuts out the HTTP transfer/XML xlation overheads.
3) examine your schema. Some of the text analyzers are quite slow.

Solr tips:
http://wiki.apache.org/solr/SolrPerformanceFactors

Lucene tips:
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed

And, what you don't want to hear: for jobs like this, Solr/Lucene is
disk-bound. The Windows NTFS file system is much slower than what is
available for Linux or the Mac, and these numbers are for those
machines.

Good luck!

Lance Norskog


On Sat, Oct 10, 2009 at 5:57 PM, William Pierce  wrote:
> Oh and one more thing...For historical reasons our apps run using msft
> technologies, so using SolrJ would be next to impossible at the present
> time
>
> Thanks in advance for your help!
>
> -- Bill
>
> --
> From: "William Pierce" 
> Sent: Saturday, October 10, 2009 5:47 PM
> To: 
> Subject: Tips on speeding up indexing needed...
>
>> Folks:
>>
>> I have a corpus of approx 6 M documents each of approx 4K bytes.
>> Currently, the way indexing is set up I read documents from a database and
>> issue solr post requests in batches (batches are set up so that the
>> maxPostSize of tomcat which is set to 2MB is adhered to).  This means that
>> in each batch we write approx 600 or so documents to SOLR.  What I am seeing
>> is that I am able to push about 2500 docs per minute or approx 40 or so per
>> second.
>>
>> I saw in Erik's talk on Friday that speeds of 250 docs/sec to 25000
>> docs/sec have been achieved.  Needless to say I am sure that performance
>> numbers vary widely and are dependent on the domain, machine configurations,
>> etc.
>>
>> I am running on Windows 2003 server, with 4 GB RAM, dual core xeon.
>>
>> Any tips on what I can do to speed this up?
>>
>> Thanks,
>>
>> Bill
>>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Facets with an IDF concept

2009-10-10 Thread Lance Norskog
In Solr a facet is assigned one number: the number of documents in
which it appears. The facets are sorted by that number.  Would your
use case be solved with a second number that is formulated from the
relevance of the associated documents? For example:

   facet relevance = count * sum(scores of documents) with
coefficients for each input?

To do this, for each document counted by the facet, you then have to
find that document in the result list and pull the score. This would
be much slower than the current "count the documents" algorithm. But
if you have limited the document list via filter, this could still be
fast enough for interactive use.

If I wanted to make a tag cloud, this is how I would do it.

On Fri, Oct 9, 2009 at 3:58 PM, Asif Rahman  wrote:
> Hi Wojtek:
>
> Sorry for the late, late reply.  I haven't implemented this yet, but it is
> on the (long) list of my todos.  Have you made any progress?
>
> Asif
>
> On Thu, Aug 13, 2009 at 5:42 PM, wojtekpia  wrote:
>
>>
>> Hi Asif,
>>
>> Did you end up implementing this as a custom sort order for facets? I'm
>> facing a similar problem, but not related to time. Given 2 terms:
>> A: appears twice in half the search results
>> B: appears once in every search result
>> I think term A is more "interesting". Using facets sorted by frequency,
>> term
>> B is more important (since it shows up first). To me, terms that appear in
>> all documents aren't really that interesting. I'm thinking of using a
>> combination of document count (in the result set, not globally) and term
>> frequency (in the result set, not globally) to come up with a facet sort
>> order.
>>
>> Wojtek
>> --
>> View this message in context:
>> http://www.nabble.com/Facets-with-an-IDF-concept-tp24071160p24959192.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>
>
> --
> Asif Rahman
> Lead Engineer - NewsCred
> a...@newscred.com
> http://platform.newscred.com
>



-- 
Lance Norskog
goks...@gmail.com


Re: Is negative boost possible?

2009-10-10 Thread ragi

If you dont want to do a pure negative query and just want boost a few
documents down based on a matching criteria try to use linear function (one
of the functions available in boost function) with a negative m (slope).
We could solve our problem this way.


We wanted to do negatively boost some documents based on certain keywords
while 

Marc Sturlese wrote:
> 
> 
> :>the only way to "negative boost" is to "positively boost" the inverse...
> :>
> :>(*:* -field1:value_to_penalize)^10
> 
> This will do the job aswell as bq supports pure negative queries (at least
> in trunk):
> bq=-field1:value_to_penalize^10
> 
> http://wiki.apache.org/solr/SolrRelevancyFAQ#head-76e53db8c5fd31133dc3566318d1aad2bb23e07e
> 
> 
> hossman wrote:
>> 
>> 
>> : Use decimal figure less than 1, e.g. 0.5, to express less importance.
>> 
>> but that's stil la positive boost ... it still increases the scores of 
>> documents that match.
>> 
>> the only way to "negative boost" is to "positively boost" the inverse...
>> 
>>  (*:* -field1:value_to_penalize)^10
>> 
>> : > I am looking for a way to assign negative boost to a term in Solr
>> query.
>> : > Our use scenario is that we want to boost matching documents that are
>> : > updated recently and penalize those that have not been updated for a
>> long
>> : > time.  There are other terms in the query that would affect the
>> scores as
>> : > well.  For example we construct a query similar to this:
>> : > 
>> : > *:* field1:value1^2  field2:value2^2 lastUpdateTime:[NOW/DAY-90DAYS
>> TO *]^5
>> : > lastUpdateTime:[* TO NOW/DAY-365DAYS]^-3
>> : > 
>> : > I notice it's not possible to simply use a negative boosting factor
>> in the
>> : > query.  Is there any way to achieve such result?
>> : > 
>> : > Regards,
>> : > Shi Quan He
>> : > 
>> : >   
>> 
>> 
>> 
>> -Hoss
>> 
>> 
>> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Is-negative-boost-possible--tp25025775p25840621.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Problems with WordDelimiterFilterFactory

2009-10-10 Thread Shalin Shekhar Mangar
On Fri, Oct 9, 2009 at 3:33 AM, Patrick Jungermann <
patrick.jungerm...@googlemail.com> wrote:

> Hi Bern,
>
> the problem is the character sequence "--". A query is not allowed to
> have minus characters that consequent upon another one. Remove one minus
> character and the query will be parsed without problems.
>
>
Or you could escape the hyphen character. If you are using SolrJ, use
ClientUtils.escapeQueryChars on the query string.

-- 
Regards,
Shalin Shekhar Mangar.


Re: Default query parameter for one core

2009-10-10 Thread Shalin Shekhar Mangar
On Fri, Oct 9, 2009 at 7:56 PM, Michael  wrote:

> Hm... still no success.  Can anyone point me to a doc that explains
> how to define and reference core properties?  I've had no luck
> searching Google.
>
> Shalin, I gave an identical '' tag to
> each of my cores, and referenced ${solr.core.shardsParam} (with no
> default specified via a colon) in solrconfig.xml.  I get an error on
> startup:
>
>
I should have mentioned it earlier but the property name in your case would
be just ${shardParam}. The "solr.core" string is only for automatically
added properties such as name, instanceDir, dataDir, configName, schemaName

-- 
Regards,
Shalin Shekhar Mangar.


Re: Default query parameter for one core

2009-10-10 Thread Shalin Shekhar Mangar
On Fri, Oct 9, 2009 at 9:39 PM, Michael  wrote:

> For posterity...
>
> After reading through http://wiki.apache.org/solr/SolrConfigXml and
> http://wiki.apache.org/solr/CoreAdmin and
> http://issues.apache.org/jira/browse/SOLR-646, I think there's no way
> for me to make only one core specify &shards=foo, short of duplicating
> my solrconfig.xml for that core and adding one line:
>
> - I can't use a variable like ${shardsParam} in a single shared
> solrconfig.xml, because the line
>${shardsParam}
>  has to be in there, and that forces a (possibly empty) &shards
> parameter onto cores that *don't* need one, causing a
> NullPointerException.
>
>
Well, we can fix the NPE :)

Please raise an issue.


> - I can't suck in just that one  line via a SOLR-646-style import,
> like
>#solrconfig.xml
>
>  
>
>  
>
>#solr.xml
> value="some_file"/>...
> value="/dev/null"/>...
>  because SOLR-646's  feature got cut.
>
> So I think my best bet is to make two mostly-identical
> solrconfig.xmls, and point core0 to the one specifying a &shards=
> parameter:
>
>
> I don't like the duplication of config, but at least it accomplishes my
> goal!
>
>
There is another way too. Each plugin in Solr now supports a configuration
attribute named "enable" which can be true or false. You can control the
value (true/false) through a variable. So you can duplicate just the handle
instead of the complete solrconfig.xml

-- 
Regards,
Shalin Shekhar Mangar.


Re: Slave re-replication of index over and over

2009-10-10 Thread Shalin Shekhar Mangar
On Fri, Oct 9, 2009 at 9:49 PM, Moshe Cohen  wrote:

> Hi,
> I am using SOLR 1.4   (July 23rd nightly build), with a master-slave setup.
> I have encountered twice an occurrence of the slave recreating the indexes
> over and over gain.
> Couldn't find any pointers in the log.
> Any help would be appreciated
>
>
I vaguely remember a bug which caused the slave to loop. Can you upgrade to
the latest nightly and see if that solves the problem?

-- 
Regards,
Shalin Shekhar Mangar.


Using mincount with date facet in Solr 1.4

2009-10-10 Thread Aakash Dharmadhikari
hi,

  I am creating facets on a field of type

  

  The field can contain any number of dates even 0. I am making a facet
query on the field with following query parameters:

  facet.date=daysForFilter
  facet.date.gap=%2B1DAY
  facet.date.end=2009-10-16T00:00:00Z
  facet=true
  facet.date.start=2009-10-11T00:00:00Z

  But I was getting facets even with count 0. So I tried following
combinations of mincount parameters, as none was specified in the
wiki,
for date faceting.

  f.daysForFilter.facet.mincount=1
  facet.mincount=1
  f.date.mincount=1

  But none of these work. Could anyone please let me know how I can do this?

regards,
aakash