Re: How to improve Solr search performance

2008-04-09 Thread 李银松
2008/4/9, Chris Hostetter <[EMAIL PROTECTED]>:
>
>
> : most of time seems to be used for the writer getting and writing the
> docs
> : can those docs prefetched?
>
> as mentiond, the documentCache can help you out in the common case, but
> 1-4 seconds for just the XMLWriting seems pretty high ...


It really works thanks

1) how are you timing this (ie: what exactly are you measuring)


The QTime it the time used for solr to find the docs
And I got the time from dispatchfilter received the request to
responsewriter write the response
It is much larger than QTime.

2) how many stored fields do each of your documents have? (not how many are
> in your schema.xml, how many do each of your docs really have in them)


7-9 fields
only one of the fields is text, rest of them are short string or int ...


> ...having *lots* of stored fields can slow down retrieval of the Document
> (and Document retrival is delayed untill response writing) so if you have
> thousands thta night account for it.  If you're use case is to only ever
> return the "ID" field, then not storing anything else will help keep your
> total index size smaller and should speed up the response writing.
>
>
>
>
> -Hoss
>
>


Can I find the which field matched?

2008-04-09 Thread Umar Shah
Hi,

If i through a query at the solr index , is there a mechanism where i can
find out which fields matced the query... (score of that match).

Example:
for Fields A,B and C,
if query q has term1 term2 term3
Field A matches term1 term 2
Field C matches term3

can i get component scores of the whole match as
MatchA
MatchB (0.0 perhaps)
MatchC

I will be using these scores from a custom plugin, What classes I need to
use for such scores?

-umar


Re: Snipets Solr/nutch

2008-04-09 Thread khirb7

thank you for your response.

I have  another problem with snippets.here is the problem:
I transform the  HTML code into text then I index all this text generated
into one field called myText , many pages has common header with common
information (example : web site about the president bush) and the word bush
appear in this header, if I want  to highlighting the the field myText and I
am searching the word bush, I will have  the same sentence containing
bush highlighted ( which is the sentence of the comment header containing
bush word  )because I have put fargsize to 150and  Solr return through
the whole  text the first word encountered (bush) highlighted. How can I
deal with that. I was told that nutchwax handle this problem is it true?if
true how can I integarte nutch classes into solr.

thank you in advance.
-- 
View this message in context: 
http://www.nabble.com/Snipets-Solr-nutch-tp16537216p16585594.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Return the result only field A or field B is non-zero?

2008-04-09 Thread Walter Underwood
This would be trivial if you also stored boolean fields for
aiszero and biszero. That would also be fast, I expect.

wunder

On 4/8/08 11:53 PM, "Vinci" <[EMAIL PROTECTED]> wrote:

> 
> Hi all, 
> 
> I want to limit the search result by 2 numerical field, A and B, where Solr
> return the result only value in field A or B is non-zero. Does it possible
> or I need to change the document and schema? or I need to change the schema
> as well as the query?
> 
> Thank you,
> Vinci



Re: Distributed Search

2008-04-09 Thread Yonik Seeley
On Wed, Apr 9, 2008 at 2:00 AM, oleg_gnatovskiy
<[EMAIL PROTECTED]> wrote:
>  We are using the Chain Collapse patch as well. Will that not work over a
>  distributed index?

Since there is no explicit distributed support for it, it would only
collapse per-shard.

-Yonik


Re: How to improve Solr search performance

2008-04-09 Thread Chris Hostetter

: 1) how are you timing this (ie: what exactly are you measuring)

: And I got the time from dispatchfilter received the request to
: responsewriter write the response
: It is much larger than QTime.

can you be more specific about what you mean when you say "And I got the 
time from dispatchfilter..." What *exactly* are you looking at (ie: is 
this a time you are seeing in a log file? ifso which log file? ... is this 
timing code you added to the dispatch filter yourself?  what *exactly* are 
you looking at?)

I ask because it's possible there is included network IO overhead in 
communicating with your client (i would be suprised if it was significant 
if you are only returning a single field for the first 50 results, but i 
know nothing about your network setup, or what your client code look ike 
-- so anythign is possible.

: 7-9 fields
: only one of the fields is text, rest of them are short string or int ...

How big is the text field?  Are you talking about a few hundred chars or 
several KB of text per doc?

Is  set to true in your solrconfig.xml?

(I forgot we had that option when i sent my last email, as long 
as you are using the "fl" param with just your uniqueKey field document 
retrieval should be "fast enough" and fairly consistent.


-Hoss



Re: Nightly build compile error?

2008-04-09 Thread Chris Hostetter
: 
: Hello everyone. I downloaded the latest nightly build from
: http://people.apache.org/builds/lucene/solr/nightly/. When I tried to
: compile it, I got the following errors:
: 
: [javac] Compiling 189 source files to
: /home/csweb/apache-solr-nightly/build/core
: [javac]
: 
/home/csweb/apache-solr-nightly/src/java/org/apache/solr/handler/admin/MultiCoreHandler.java:93:
: cannot find symbol
: [javac] symbol  : variable CREATE

I'm not sure how you managed to get that far ... because of some 
refactoring that was done a little while back, the nightly builds don't 
currently include all of the source, see SOLR-510.

The nightly builds do however already contain all the pre-built jars (and 
war) that you need to run Solr ... if you want to compile from source, I 
would just check out from subversion.



-Hoss



Re: Nightly build compile error?

2008-04-09 Thread oleg_gnatovskiy



hossman wrote:
> 
> : 
> : Hello everyone. I downloaded the latest nightly build from
> : http://people.apache.org/builds/lucene/solr/nightly/. When I tried to
> : compile it, I got the following errors:
> : 
> : [javac] Compiling 189 source files to
> : /home/csweb/apache-solr-nightly/build/core
> : [javac]
> :
> /home/csweb/apache-solr-nightly/src/java/org/apache/solr/handler/admin/MultiCoreHandler.java:93:
> : cannot find symbol
> : [javac] symbol  : variable CREATE
> 
> I'm not sure how you managed to get that far ... because of some 
> refactoring that was done a little while back, the nightly builds don't 
> currently include all of the source, see SOLR-510.
> 
> The nightly builds do however already contain all the pre-built jars (and 
> war) that you need to run Solr ... if you want to compile from source, I 
> would just check out from subversion.
> 
> 
> 
> -Hoss
> 
> 
> 
Yup, that works.
-- 
View this message in context: 
http://www.nabble.com/Nightly-build-compile-error--tp16577739p16592725.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Distributed Search

2008-04-09 Thread oleg_gnatovskiy

Do you have any suggestions as to how we would be able to implement chain
collapse over the entire distributed index? Our collection is 27 GB, 15
million documents. Do you think there is a way to optimize Solr performance
enough to not have to segment such a large collection?


Yonik Seeley wrote:
> 
> On Wed, Apr 9, 2008 at 2:00 AM, oleg_gnatovskiy
> <[EMAIL PROTECTED]> wrote:
>>  We are using the Chain Collapse patch as well. Will that not work over a
>>  distributed index?
> 
> Since there is no explicit distributed support for it, it would only
> collapse per-shard.
> 
> -Yonik
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Distributed-Search-tp16577204p16592826.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Distributed Search

2008-04-09 Thread Yonik Seeley
On Wed, Apr 9, 2008 at 1:57 PM, oleg_gnatovskiy
<[EMAIL PROTECTED]> wrote:
>  Do you have any suggestions as to how we would be able to implement chain
>  collapse over the entire distributed index? Our collection is 27 GB, 15
>  million documents. Do you think there is a way to optimize Solr performance
>  enough to not have to segment such a large collection?

What is the current performance bottleneck that is causing you to have
to segment in the first place?
15M docs is often doable on a single box I think, but it depends
heavily on what the queries are, what faceting is done, etc.

-onik


Re: Distributed Search

2008-04-09 Thread oleg_gnatovskiy



Yonik Seeley wrote:
> 
> On Wed, Apr 9, 2008 at 1:57 PM, oleg_gnatovskiy
> <[EMAIL PROTECTED]> wrote:
>>  Do you have any suggestions as to how we would be able to implement
>> chain
>>  collapse over the entire distributed index? Our collection is 27 GB, 15
>>  million documents. Do you think there is a way to optimize Solr
>> performance
>>  enough to not have to segment such a large collection?
> 
> What is the current performance bottleneck that is causing you to have
> to segment in the first place?
> 15M docs is often doable on a single box I think, but it depends
> heavily on what the queries are, what faceting is done, etc.
> 
> -onik
> 
> 

Well we are running some really heavy faceting, and searching up to 15
fields at a time for each query. The bottleneck was that a single query
either took 15 minutes, or died with a heap space error...

-- 
View this message in context: 
http://www.nabble.com/Distributed-Search-tp16577204p16595616.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Payloads in Solr

2008-04-09 Thread pgwillia

I started this thread back in November.  Recall that I'm indexing xml and
storing the xpath as a payload in each token.  I am not encoding or mapping
the xpath but storing the text directly as String.getBytes().  We're not
using this to query in any way, just to add context to our search results. 
Presently, I'm ready to bounce around some more ideas about encoding xpath
or strings in general.

Back in the day Grant said:

 
> From what I understand from Michael Busch, you can store the path at  
> each token, but this doesn't seem efficient to me.  I would think you  
> may want to come up with some more efficient encoding.  I am cc'ing  
> Michael on this thread to see if he is able to add any light to the  
> subject (he may not be able to b/c of employer reasons).   If he  
> can't, then we can brainstorm a bit more on how to do it most  
> efficiently.
> 

The word "encoding" in Grant's response brings to mind Huffman coding
(http://en.wikipedia.org/wiki/Huffman_coding).  This would not solve the
query on payload problem that Yonik pointed out because the encoding would
be document centric, but could reduce the amount of total bytes that I need
to store. 

Any ideas?

Tricia
-- 
View this message in context: 
http://www.nabble.com/Payloads-in-Solr-tp13812560p16599300.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: indexing slow, IO-bound?

2008-04-09 Thread Norberto Meijome
On Mon, 7 Apr 2008 16:37:48 -0400
"Yonik Seeley" <[EMAIL PROTECTED]> wrote:

> On Mon, Apr 7, 2008 at 4:30 PM, Mike Klaas <[EMAIL PROTECTED]> wrote:
> >  'top', 'vmstat' tell exactly what's going on in terms of io and cpu on
> > unix.  Perhaps someone has gotten these to work under windows with cygwin.  
> 
> The windows task manager is a pretty good replacement of top... do
> "select columns" and you can get all sorts of stuff like number of
> threads, file handles, page faults, etc.  You can also simply see if
> things are CPU bound or not (sort by the CPU column, or go to the
> "Performance" tab.

I suggest you use the Performance monitor tool - in server versions of Win32, 
it should be under Administration tools. You can also generate logs for later 
reviewing (otherwise it only shows u the last x minutes of activity). You can 
mix and match different performance providers ...not sure if Java itself 
providers counters -  you *may* be able to trace CPU / memory by application 
once the app is running, but I doubt you can do that for IO. 
if only u had dtrace in windows ;)

B

_
{Beto|Norberto|Numard} Meijome

"Web2.0 is what you were doing while the rest of us were building businesses."
  The Reverend

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.


Re: Return the result only field A or field B is non-zero?

2008-04-09 Thread Vinci

Hi,

Thank you Underwood.
That still not come up with the solution...doing the boolean operation for
every query (query AND (isAZero OR isBZero) ) if I have the boolean field?
***Adding boolean need largely update the document structure that may not be
preferred...can Solr generate this field for me?

Thank you,
Vinci


Walter Underwood wrote:
> 
> This would be trivial if you also stored boolean fields for
> aiszero and biszero. That would also be fast, I expect.
> 
> wunder
> 
> On 4/8/08 11:53 PM, "Vinci" <[EMAIL PROTECTED]> wrote:
> 
>> 
>> Hi all, 
>> 
>> I want to limit the search result by 2 numerical field, A and B, where
>> Solr
>> return the result only value in field A or B is non-zero. Does it
>> possible
>> or I need to change the document and schema? or I need to change the
>> schema
>> as well as the query?
>> 
>> Thank you,
>> Vinci
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Return-the-result-only-field-A-or-field-B-is-non-zero--tp16580681p16601353.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: How to improve Solr search performance

2008-04-09 Thread Eason . Lee
>
>
> can you be more specific about what you mean when you say "And I got the
> time from dispatchfilter..." What *exactly* are you looking at (ie: is
> this a time you are seeing in a log file? ifso which log file? ... is this
> timing code you added to the dispatch filter yourself?  what *exactly* are
> you looking at?)


the time code is added by myself


I was just testing the solr performance
And I found that avg request time is much longer than QTime(T1)
So I added some code timing the whole SoleDispatchFilter.doFilter()
method(T2)
The time(T2-T1) is used for the the responseWriter writing the document.
Deep into it, most of time cost here:
  public void writeDocs(boolean includeScore, Set fields) throws
IOException {
SolrIndexSearcher searcher = request.getSearcher();
DocIterator iterator = ids.iterator();
int sz = ids.size();
includeScore = includeScore && ids.hasScores();
for (int i=0; i
> How big is the text field?  Are you talking about a few hundred chars or
> several KB of text per doc?


several KB

Is  set to true in your solrconfig.xml?


It has set to true


Human Powered Search Module

2008-04-09 Thread Sushan Rungta

Hello Everybody,

I am a newbie in Lucene and I am from India, currently working for a 
search module for our classifed website search module in 
clickindia.com. I have implemented the basic functionality of solr 
lucen and am pretty happy with the results.


Search in India has its own share of nuances. 'Maruti' is spelt as 
'Maruthi' in most of South India. People spell most of the times 
'Naukri' as 'Naukari'; a loan request would be simply followed in the 
query as 'need money'. These and many more such intricacies are 
typical of Indians and require a special kind of module for the same.


Is there any ready-made solution for the same? Can I get the access 
of words as mentioned above and is used in India, so that I could implement it?


regards,
Sushan Rungta
Mob: +91-9312098968
www.clickindia.com