date:20100818

solr working...

2010-08-18 Thread satya swaroop

hi all,
i am very intrested to know the working of solr. can anyone tell me
which modules or classes that gets invoked when we start the servlet
container like tomcat or when we send any requests to solr like sending pdf
files or what files get invoked at the start of solr.??

regards,
satya

Solr data type for date faceting

2010-08-18 Thread Karthik K

I have a field storing timestamp as text (MMDDHHMM). Can i get the
results as i get with date faceting? (July(30),August(54) etc)
As per my knowledge Solr currently doesn't support range faceting, even if
it does in the future , text will not be recognized as integer/long.

Tried for a workaround with field.prefix, but it cannot give the desired
result.

Thanks,
Karthik

Phrase Highlighting with special characters

2010-08-18 Thread Kranti K K Parisa

Hi All,

I am trying with Solr Highlighting. I have problem in highlighting phrases
consists of special characters

for example, if I search for a phrase like  "united. states. usa"  then the
results are displayed matching the exact phrase and also without special
characters means "united states usa"

but highlights only the exact phrase which is  "united. states. usa", the
results without special characters"united states usa" is not highlighted.

Can anyone give comments/suggestion.

Thanks, Kranti

Re: Solr data type for date faceting

2010-08-18 Thread Mark Allan

If you're storing the timestamp as MMDDHHMM, why don't you make it  
a trie-coded integer field (type 'tint') rather than text?  That way,  
I believe range queries would be more efficient.  You can then do a  
facet query, specifying your desired ranges as one facet query for  
each range.


Note that I think you can also do facet queries with text fields, but  
in this instance, storing it as a number would probably be more  
efficient.  Your user interface can deal with translating it from  
MMDDHHMM to something more display-appropriate.


Mark

On 18 Aug 2010, at 9:28 am, Karthik K wrote:


I have a field storing timestamp as text (MMDDHHMM). Can i get the
results as i get with date faceting? (July(30),August(54) etc)
As per my knowledge Solr currently doesn't support range faceting,  
even if

it does in the future , text will not be recognized as integer/long.

Tried for a workaround with field.prefix, but it cannot give the  
desired

result.

Thanks,
Karthik



--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Re: Solr data type for date faceting

2010-08-18 Thread Karthik K

Thanks Mark. Yeah, storing it as 'tint' would be quite efficient.As i cannot
re-index the massive data, please let me know if the changes i make in
schema reflect to the already indexed data? I am not  sure how type checking
happens in solr.

You can then do a facet query, specifying your desired ranges as one facet
query for each range.

http://wiki.apache.org/solr/SimpleFacetParameters#Facet_by_Range says its
not implemented yet.

Please let me know if there is any workaround in my case.

Thanks,
Karthik

Missing tokens

2010-08-18 Thread paul . moran


Hi, I'm having a problem with certain search terms not being found when I
do a query. I'm using Solrj to index a pdf document, and add the contents
to the 'contents' field. If I query the 'contents' field on the
SolrInputDocument doc object as below, I get 50k tokens.

StringTokenizer to = new StringTokenizer((String)doc.getFieldValue(
"contents"));
System.out.println( "Tokens:"  + to.countTokens() );

However, once the doc is indexed and I use Luke to analyse the index, it
has only 3300 tokens in that field. Where did the other 47k go?

I read some other threads mentioning to increase the maxfieldLength in
solrconfig.xml, and my setting is below.

  2147483647

Any advice is appreciated,
Paul

Re: Function query to boost scores by a constant if all terms are present

2010-08-18 Thread Jan Høydahl / Cominvent

You can use the map() function for this, see 
http://wiki.apache.org/solr/FunctionQuery#map

q=a 
fox&defType=dismax&qf=allfields&bf=map(query($qq),0,0,0,100.0)&qq=allfields:(quick
 AND brown AND fence)

This adds a constant boost of 100.0 if the $qq field returns a non-zero score, 
which it does whenever all three terms match.

PS: You can achieve the same in a Lucene query, using q=a fox 
_val_:"map(query($qq),0,0,0,100.0)"

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 17. aug. 2010, at 22.48, Ahmet Arslan wrote:

>> Most of the time, items that match all three terms will
>> float to the top by
>> normal ranking, but sometimes there are only two terms that
>> are like a rash
>> across the record, and they end up with a higher score than
>> some items that
>> match all three query terms.
>> 
>> I'd like to boost items with all the query terms to the top
>> *without
>> changing their order*.
>> 
>> My first thought was to use a simple boost query
>> allfields:(a AND b AND c),
>> but the order of the set of records that contain all three
>> terms changes
>> when I do that. What I *think* I need to do is basically to
>> say, "Hey, all
>> the items with all three terms get an extra 40,000 points,
>> but change
>> nothing else".
> 
> This is a hard task, and I am not sure it is possible. But you need to change 
> similarity algorithm for that. Final score is composed of many factors. 
> coord, norm, tf-idf ... 
> 
> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html
> 
> May be you can try to customize coord(q,d). But there can be always some 
> cases that you describe. For example very long document containing three 
> terms will be punished due to its length. A very short document with two 
> query terms can pop-up before it.
> 
> It is easy to "rank items with all three terms" so that they comes first, 
> (omitNorms="true" and omitTermFreqAndPositions="true" should almost do it) 
> but "change nothing else" part is not.
> 
> Easiest thing can be throw additional query with pure AND operator and 
> display these result in a special way.
> 
> 
>

Re: Solr data type for date faceting

2010-08-18 Thread Jan Høydahl / Cominvent

If you want to change the schema on the live index, make sure you do a 
compatible change, as Solr does not do any type checking or schema change 
validation.

I would ADD a field with another name for the tint field.
Unfortunately you have to re-index to have an index built on this field.
May I suggest that you start re-feeding a portion of the index every day until 
finished. Use large batches between each commit(), and make sure to run an 
optimize every copule of days to get rid of "dead meat".

If you simply do not have the orignial data for refeed, perhaps it is possible 
to extract all string values offline from the index and somehow rebuild the 
index offline?

Andrzej, is that possible?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 18. aug. 2010, at 11.12, Karthik K wrote:

> Thanks Mark. Yeah, storing it as 'tint' would be quite efficient.As i cannot
> re-index the massive data, please let me know if the changes i make in
> schema reflect to the already indexed data? I am not  sure how type checking
> happens in solr.
> 
> You can then do a facet query, specifying your desired ranges as one facet
> query for each range.
> 
> http://wiki.apache.org/solr/SimpleFacetParameters#Facet_by_Range says its
> not implemented yet.
> 
> Please let me know if there is any workaround in my case.
> 
> Thanks,
> Karthik

Re: Missing tokens

2010-08-18 Thread Jan Høydahl / Cominvent

Hi,

Can you share with us how your schema looks for this field? What FieldType? 
What tokenizer and analyser?
How do you parse the PDF document? Before submitting to Solr? With what tool?
How do you do the query? Do you get the same results when doing the query from 
a browser, not SolrJ?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 18. aug. 2010, at 11.34, paul.mo...@dds.net wrote:

> 
> Hi, I'm having a problem with certain search terms not being found when I
> do a query. I'm using Solrj to index a pdf document, and add the contents
> to the 'contents' field. If I query the 'contents' field on the
> SolrInputDocument doc object as below, I get 50k tokens.
> 
> StringTokenizer to = new StringTokenizer((String)doc.getFieldValue(
> "contents"));
> System.out.println( "Tokens:"  + to.countTokens() );
> 
> However, once the doc is indexed and I use Luke to analyse the index, it
> has only 3300 tokens in that field. Where did the other 47k go?
> 
> I read some other threads mentioning to increase the maxfieldLength in
> solrconfig.xml, and my setting is below.
> 
>  2147483647
> 
> Any advice is appreciated,
> Paul
>

Re: Phrase Highlighting with special characters

2010-08-18 Thread Kranti K K Parisa

Seems the following is working

query.setHighlight(true).setHighlightSnippets(1);
query.setHighlightSimplePre("");
query.setHighlightSimplePost("");
query.setHighlightFragsize(1000);
query.setParam("hl.fl", "");


also I was reading something about (I didn't use it yet)

true





On Wed, Aug 18, 2010 at 2:06 PM, Kranti K K Parisa
wrote:

> Hi All,
>
> I am trying with Solr Highlighting. I have problem in highlighting phrases
> consists of special characters
>
> for example, if I search for a phrase like  "united. states. usa"  then the
> results are displayed matching the exact phrase and also without special
> characters means "united states usa"
>
> but highlights only the exact phrase which is  "united. states. usa", the
> results without special characters"united states usa" is not highlighted.
>
> Can anyone give comments/suggestion.
>
> Thanks, Kranti
>

improving search response time

2010-08-18 Thread Muneeb Ali


Hi All,

I need some guidance over improving search response time for our catalog
search. we are using solr 1.4.0 version and have master/slave setup (3
dedicated servers, one being the master and other two slaves). The server
specs are as follows:
 
Quad Core 2.5Ghz 1333mhz
12GB Ram
2x250GB disks (SATA Enterprise HDD)

Our 60GB index consists of 14million indexed documents. 

I have done some of the configuration tweaks from this list: 
http://wiki.apache.org/lucene-java/ImproveSearchingSpeed 
http://wiki.apache.org/lucene-java/ImproveSearchingSpeed   , including merge
factor reduced to 6, minimizing number of stored fields, apart from hardware
suggestions. 

I would appreciate if anyone with similar background could shed some light
on upgrading hardware in our situation. Or if any other configuration tweak
that is not on the above list. 

Thanks,

-Muneeb
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/improving-search-response-time-tp1204491p1204491.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solrj ContentStreamUpdateRequest Slow

2010-08-18 Thread Tod

On 8/16/2010 6:12 PM, Chris Hostetter wrote:

: > I think your problem may be that StreamingUpdateSolrServer buffers up
: > commands and sends them in batches in a background thread. if you want to
: > send individual updates in real time (and time them) you should just use
: > CommonsHttpSolrServer
:
: My goal is to batch updates. My content lives somewhere else so I was trying

: to find a way to tell Solr where the document lived so it could go out and
: stream it into the index for me. That's where I thought
: StreamingUpdateSolrServer would help.

If your content lives on a machine which is not your "client" nor your
"server" and you want your client to tell your server to go fetch it
directly then the "stream.url" param is what you need -- that is unrelated
to wether you use StreamingUpdateSolrServer or not.

Do you happen to have a code fragment laying around that demonstrates
using CommonsHttpSolrServer and "stream.url"? I've tried it in
conjunction with ContentStreamUpdateRequest and I keep getting an
annoying null pointer exception. In the meantime I will check the
examples...

Thinking about it some more, i suspect the reason you might be seeing a
delay when using StreamingUpdateSolrServer is because of this bug...

https://issues.apache.org/jira/browse/SOLR-1990

...if there are no actual documents in your UpdateRequest (because you are
using the stream.url param) then the StreamingUpdateSolrServer blocks
until all other requests are done, then delegates to the super class (so
it never actaully puts your indexing requests in a buffered queue, it just
delays and then does them immediately)

Not sure of a good way arround this off the top of my head, but i'll note
it in SOLR-1990 as another problematic use case that needs dealt with.

Perhaps I can execute an initial update request using a benign file
before making the "stream.url" call?

Also, to beat a dead horse, this:
'http://localhost:8080/solr/update/extract?stream.url=http://remote_server.mydomain.com/test.pdf&stream.contentType=application/pdf&literal.content_id=12342&commit=true'

... works fine - I just want to do it a LOT and as efficiently as
possible. If I have to I can wrap it in a perl script and run a cURL or
LWP loop but I'd prefer to use SolrJ if I can.

Thanks for all your help.

- Tod

Re: improving search response time

2010-08-18 Thread Jan Høydahl / Cominvent

Some questions:

a) What operating system?
b) What Java container (Tomcat/Jetty)
c) What JAVA_OPTIONS? I.e. memory, garbage collection etc.
d) Example queries? I.e. what features, how many facets, sort fields etc
e) How do you load balance queries between the slaves?
f) What is your search latency now and @ what QPS? Also, where do you measure 
time - on the API or on the end-user page?
g) How often do you replicate?
h) Are you using warm-up-queries?
i) Are you ever optimizing your index?
j) Are you using highlighting? If so, are you using the fast vector highlighter 
or the regex?
k) What other search components are you using?
i) Are you using RAID setup for the disks? If so, what kind of RAID, what 
stripe-size and block size?

Have you benchmarked to see what the bottleneck is, i.e. what is taking the 
most time? Try to add &debugQuery=true and share the  section with us. 
It includes timings for each component.

High latency could be caused by a number of different factors, and it is 
important to first isolate the bottleneck.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 18. aug. 2010, at 14.18, Muneeb Ali wrote:

> 
> Hi All,
> 
> I need some guidance over improving search response time for our catalog
> search. we are using solr 1.4.0 version and have master/slave setup (3
> dedicated servers, one being the master and other two slaves). The server
> specs are as follows:
> 
> Quad Core 2.5Ghz 1333mhz
> 12GB Ram
> 2x250GB disks (SATA Enterprise HDD)
> 
> Our 60GB index consists of 14million indexed documents. 
> 
> I have done some of the configuration tweaks from this list: 
> http://wiki.apache.org/lucene-java/ImproveSearchingSpeed 
> http://wiki.apache.org/lucene-java/ImproveSearchingSpeed   , including merge
> factor reduced to 6, minimizing number of stored fields, apart from hardware
> suggestions. 
> 
> I would appreciate if anyone with similar background could shed some light
> on upgrading hardware in our situation. Or if any other configuration tweak
> that is not on the above list. 
> 
> Thanks,
> 
> -Muneeb
> -- 
> View this message in context: 
> http://lucene.472066.n3.nabble.com/improving-search-response-time-tp1204491p1204491.html
> Sent from the Solr - User mailing list archive at Nabble.com.

World order sensitivity query id Solr

2010-08-18 Thread Krzysztof Szalast

Hi,

First I very sorry form my bad English :/

I am new user of Apache Solr. I have read documentation and check
Google, but I can not find solution of my problem:
I need "words order sensitivity" query, for example I have two
documents in Solr index:
1. one something two something three something four
2. four something three something two something one
and I run query:

a)
query: one AND two
I want in result:
1. one something two something three something four - "one" is before
"two", just like in query
2. four something three something two something one - I don't need
this result, but it don't disturb if it is returned

b)
query: two AND one
I want in result:
1. four something three something two something one - "one" is after
"two", just like in query
2. one something two something three something four - I don't need
this result, but it don't disturb if it is returned

Is it possible? If yes, how?

regards
 Krzysztof

Re: World order sensitivity query id Solr

2010-08-18 Thread Ahmet Arslan

> First I very sorry form my bad English :/
> 
> I am new user of Apache Solr. I have read documentation and
> check
> Google, but I can not find solution of my problem:
> I need "words order sensitivity" query, for example I have
> two
> documents in Solr index:
> 1. one something two something three something four
> 2. four something three something two something one
> and I run query:
> 
> a)
> query: one AND two
> I want in result:
> 1. one something two something three something four - "one"
> is before
> "two", just like in query
> 2. four something three something two something one - I
> don't need
> this result, but it don't disturb if it is returned
> 
> b)
> query: two AND one
> I want in result:
> 1. four something three something two something one - "one"
> is after
> "two", just like in query
> 2. one something two something three something four - I
> don't need
> this result, but it don't disturb if it is returned
> 
> Is it possible? If yes, how?

It is possible but not out-of-the-box. You need to override SolrQueryParser so 
that it returns SpanNearQuery instead of PhraseQuery. SpanNearQuery can do 
ordered proximity search. With a slop of Integer.MAX you can achieve what you 
want. "one two"~Integer.MAX

Lucene in Action Book has a code listing for this task. (replacing 
SpanNearQuery with PhraseQuery) You can copy and use it.

Solr's Index Live Updates

2010-08-18 Thread Gonzalo Payo Navarro

Hi everyone!

I've a question: Is there a way to update a document in Solr (v. 1.4)
and that document is ready for searches without a reindex?

Let me put it this way: My index is filled with documents like, say,
DOC_ID, STATUS and TEXT fields. What if I want to update the TEXT
field and see that change immediately in the Solr's Index.

Thanks

Re: autocomplete: case-insensitive and middle word

2010-08-18 Thread Paul

Here's my solution. I'm posting it in case it is radically wrong; I
hope someone can help straighten me out. It seems to work fine, and
seems fast enough.

In schema.xml:


















Then I restarted solr to pick up the changes.

I then ran a script which reads each document out of the current
index, and adds the new field:

for each doc in my solr index:

doc['ac_name'] = doc['name'].split(' ')

and write the record back out.

Then, using rsolr, I make the following query:

response = @solr.select(:params => {
:q=> "ac_name:#{prefix}",
:start=> 0,
:rows=> 500,
:fl => "name"
})
matches = []
docs = response['response']['docs']
docs.each {|doc|
matches.push(doc['name'])
}

"matches" is now an array of the values I want to display.

Help Debugging Delta Query

2010-08-18 Thread Frank A

Hi,

I'm trying to use a delta query to update a specific entity by its
primary key.  The URL I'm using is:

http://localhost:8080/solr/core2/dataimport?command=delta-import&did=5&commit=true&debug=true

Where 5 is the PK.

In my db config I have:



When I run the URL above I see the following results:



0
0


db-data-config.xml


delta-import
idle


2
1
0
2010-08-18 10:06:17
2010-08-18 10:06:17
2010-08-18 10:06:17
2010-08-18 10:06:17
1
0
0:0:0.32




It looks like its finding the row, but for some reason it's not
processing it - any ideas?

Thanks.
Frank

Re: Help Debugging Delta Query

2010-08-18 Thread Ahmet Arslan

> I'm trying to use a delta query to update a specific entity
> by its
> primary key.  The URL I'm using is:
> 
> http://localhost:8080/solr/core2/dataimport?command=delta-import&did=5&commit=true&debug=true
> 
> Where 5 is the PK.
> 
> In my db config I have:
> 
>              name="place"
>            
>     query="select
> OurID,Name,City,State,lat,lng,cost from place"
>                
> deltaQuery="select OurID from place where OurID=
> '${dataimporter.request.did}'"
>                
> deltaImportQuery="select
> OurID,Name,City,State,lat,lng,cost from place where
> OurID='${dataimporter.delta.id}'"
>                
> >
> 
> When I run the URL above I see the following results:
> 
> 
> 
> 0
> 0
> 
> 
> db-data-config.xml
> 
> 
> delta-import
> idle
> 
> 
> 2
> 1
> 0
> 2010-08-18
> 10:06:17
> 2010-08-18
> 10:06:17
> 2010-08-18
> 10:06:17
> 2010-08-18
> 10:06:17
> 1
> 0
> 0:0:0.32
> 
> 
> 
> 
> It looks like its finding the row, but for some reason it's
> not
> processing it - any ideas?

Shouldn't it be OurID='${dataimporter.delta.OurID}' in deltaQuery?

Re: Help Debugging Delta Query

2010-08-18 Thread Frank A

Uhg... my mistake.

Thanks!


On Wed, Aug 18, 2010 at 10:22 AM, Ahmet Arslan  wrote:
>> I'm trying to use a delta query to update a specific entity
>> by its
>> primary key.  The URL I'm using is:
>>
>> http://localhost:8080/solr/core2/dataimport?command=delta-import&did=5&commit=true&debug=true
>>
>> Where 5 is the PK.
>>
>> In my db config I have:
>>
>>             > name="place"
>>
>>     query="select
>> OurID,Name,City,State,lat,lng,cost from place"
>>
>> deltaQuery="select OurID from place where OurID=
>> '${dataimporter.request.did}'"
>>
>> deltaImportQuery="select
>> OurID,Name,City,State,lat,lng,cost from place where
>> OurID='${dataimporter.delta.id}'"
>>
>> >
>>
>> When I run the URL above I see the following results:
>>
>> 
>> 
>> 0
>> 0
>> 
>> 
>> db-data-config.xml
>> 
>> 
>> delta-import
>> idle
>> 
>> 
>> 2
>> 1
>> 0
>> 2010-08-18
>> 10:06:17
>> 2010-08-18
>> 10:06:17
>> 2010-08-18
>> 10:06:17
>> 2010-08-18
>> 10:06:17
>> 1
>> 0
>> 0:0:0.32
>> 
>>
>> 
>>
>> It looks like its finding the row, but for some reason it's
>> not
>> processing it - any ideas?
>
> Shouldn't it be OurID='${dataimporter.delta.OurID}' in deltaQuery?
>
>
>
>

Re: Missing tokens

2010-08-18 Thread paul . moran

Here's my field description. I mentioned 'contents' field in my original
post. I've changed it to a different field, 'summary'. It's using the
'text' fieldType as you can see below.

I parsed the pdf using pdfbox. I can see my alphanumeric search term 'OB10'
in the extracted text before I add it to the index. I can also go into Luke
and see the 'OB10' in the contents of the 'summary' field even though Luke
can't find it when I do a search.

I can also use the browser to do a search in http://localhost/solr/admin
and again that search term doesn't return any results. I thought it may be
an alphanumber word splitting issue, but that doesn't seem be be the case
since I can search on ME26, and it returns a doc, and in fact, I can see
the 'OB10' search term in the summary field of the doc returned.

Here's a snippet of the summary field from that returned doc

To produce a downloadable file using a format suitable
for OB10. 8-26 Profiles

I'm thinking that the extracted text from pdfbox may have hidden chars that
solr can't parse. However, before I go down that road, I just want to be
sure I'm not making schoolboy errors with my solr setup.

thanks
Paul

From:   Jan Høydahl / Cominvent 
To: solr-user@lucene.apache.org
Date:   18/08/2010 11:56
Subject:Re: Missing tokens

Hi,

Can you share with us how your schema looks for this field? What FieldType?
What tokenizer and analyser?
How do you parse the PDF document? Before submitting to Solr? With what
tool?
How do you do the query? Do you get the same results when doing the query
from a browser, not SolrJ?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 18. aug. 2010, at 11.34, paul.mo...@dds.net wrote:

>
> Hi, I'm having a problem with certain search terms not being found when I
> do a query. I'm using Solrj to index a pdf document, and add the contents
> to the 'contents' field. If I query the 'contents' field on the
> SolrInputDocument doc object as below, I get 50k tokens.
>
> StringTokenizer to = new StringTokenizer((String)doc.getFieldValue(
> "contents"));
> System.out.println( "Tokens:"  + to.countTokens() );
>
> However, once the doc is indexed and I use Luke to analyse the index, it
> has only 3300 tokens in that field. Where did the other 47k go?
>
> I read some other threads mentioning to increase the maxfieldLength in
> solrconfig.xml, and my setting is below.
>
>  2147483647
>
> Any advice is appreciated,
> Paul
>

Re: improving search response time

2010-08-18 Thread Gora Mohanty

On Wed, 18 Aug 2010 05:18:34 -0700 (PDT)
Muneeb Ali  wrote:

> 
> Hi All,
> 
> I need some guidance over improving search response time for our
> catalog search. 
[...]
> I would appreciate if anyone with similar background could shed
> some light on upgrading hardware in our situation. Or if any
> other configuration tweak that is not on the above list. 
[...]

It would probably help if you could post some benchmarks of what
your current search response times are (from the Solr back-end,
and not from any front-end in front of it), and what your desired
response times are. You could use Apache bench, and or JMeter for
this.

As a data point, while our index size/no. of documents is smaller
(~40GB/3.6 million documents), we are seeing a mean response
time/request of ~120ms for numeric fields, at 50 simulated
simultaneous connections, with a single Solr server having 8 ~2GHz
cores, and 12GB RAM. This measure however, *is* influenced by the
effects of Solr caching. Solr is close to a factor of 10 faster
than our front-end, even though that is pulling almost everything
from Solr. So, we are happy on that front :-)

Regards,
Gora

Re: improving search response time

2010-08-18 Thread Muneeb Ali


First, thanks very much for a prompt reply. Here is more info:

===

a) What operating system?
Debian GNU/Linux 5.0

b) What Java container (Tomcat/Jetty) 
Jetty

c) What JAVA_OPTIONS? I.e. memory, garbage collection etc. 
-Xmx9000m   -DDEBUG   -Djava.awt.headless=true  
-Dorg.mortbay.log.class=org.mortbay.log.StdErrLog  
-Dcom.sun.management.jmxremote.port=3000 
-Dcom.sun.management.jmxremote.authenticate=false 
-Dcom.sun.management.jmxremote.ssl=false 
-XX:+UseCompressedOops -XX:+UseConcMarkSweepGC 
-javaagent:/usr/local/lib/newrelic/newrelic.jar

d) Example queries? I.e. what features, how many facets, sort fields etc 
/select?start=0&rows=20&fl=id&hl=true&hl.fl=title%2Cabstract%2Cauthors&hl.fragsize=300&hl.simple.pre=&hl.simple.post=<%2Fstrong>&qt=dismax&q=gene
therapy

We also get queries with filters examples:

/select?start=0&rows=20&fl=id&hl=true&hl.fl=title%2Cabstract%2Cauthors&hl.fragsize=300&hl.simple.pre=&hl.simple.post=<%2Fstrong>&qt=dismax&q=gene
therapy&fq=meshterm:(gene)&fq=author:(david)

e) How do you load balance queries between the slaves? 

proxy based load balance

f) What is your search latency now and @ what QPS? Also, where do you
measure time - on the API or on the end-user page?

Average response time: 2600 - 3000 ms  with average throughput: 4-6 rpm
(from 'new relic RPM' solr performance monitor)
 
g) How often do you replicate? 
Daily (indexer runs each night) and replicates after indexing completes at
master. However lately we are experiencing problems right after replication,
and have to restart jetty (its most likely that slaves are running out of
memory).

h) Are you using warm-up-queries? 
Yes, using autoWarmCount variable in cache configuration/ these are
specified as:

 
 


i) Are you ever optimizing your index? 

Yes, daily after indexing. We are not doing dynamic updates to index, so I
guess its not needed to be done multiple times.

j) Are you using highlighting? If so, are you using the fast vector
highlighter or the regex? 

Yes, we are using the default highlight component, with default fragmenter
called 'gap' and not regex. solr.highlight.GapFragmenter, with fragsize=300.

k) What other search components are you using? 
spellcheck component, we will be using faceting in future soon.

i) Are you using RAID setup for the disks? If so, what kind of RAID, what
stripe-size and block size? 

Yes, RAID-0:
$> cat /proc/mdstat
Personalities : [raid0] 
md0 : active raid0 sda1[0] sdb1[1]
  449225344 blocks 64k chunks


==

I havn't benchmarked it yet as such, however here is the debugQuery
 from query results:


case study research
case study research
−

+(DisjunctionMaxQuery((tags:case^1.2 | authors:case^7.5 | title:case^65.5 |
matchAll:case | keywords:case^2.5 | meshterm:case^3.2 |
abstract1:case^9.5)~0.01) DisjunctionMaxQuery((tags:studi^1.2 |
authors:study^7.5 | title:study^65.5 | matchAll:study | keywords:studi^2.5 |
meshterm:studi^3.2 | abstract1:studi^9.5)~0.01)
DisjunctionMaxQuery((tags:research^1.2 | authors:research^7.5 |
title:research^65.5 | matchAll:research | keywords:research^2.5 |
meshterm:research^3.2 | abstract1:research^9.5)~0.01))
DisjunctionMaxQuery((tags:"case studi research"~50^1.2 | authors:"case study
research"~50^7.5 | title:"case study research"~50^65.5 | matchAll:case study
research | keywords:"case studi research"~50^2.5 | meshterm:"case studi
research"~50^3.2 | abstract1:"case studi research"~50^9.5)~0.01)
FunctionQuery((sum(sdouble(yearScore)))^1.1)
FunctionQuery((sum(sdouble(readerScore)))^2.0)

−

+((tags:case^1.2 | authors:case^7.5 | title:case^65.5 | matchAll:case |
keywords:case^2.5 | meshterm:case^3.2 | abstract1:case^9.5)~0.01
(tags:studi^1.2 | authors:study^7.5 | title:study^65.5 | matchAll:study |
keywords:studi^2.5 | meshterm:studi^3.2 | abstract1:studi^9.5)~0.01
(tags:research^1.2 | authors:research^7.5 | title:research^65.5 |
matchAll:research | keywords:research^2.5 | meshterm:research^3.2 |
abstract1:research^9.5)~0.01) (tags:"case studi research"~50^1.2 |
authors:"case study research"~50^7.5 | title:"case study research"~50^65.5 |
matchAll:case study research | keywords:"case studi research"~50^2.5 |
meshterm:"case studi research"~50^3.2 | abstract1:"case studi
research"~50^9.5)~0.01 (sum(sdouble(yearScore)))^1.1
(sum(sdouble(readerScore)))^2.0

−

−


9.473454 = (MATCH) sum of:
  2.247054 = (MATCH) sum of:
0.7535966 = (MATCH) max plus 0.01 times others of:
  0.7535966 = (MATCH) weight(title:case^65.5 in 6557735), product of:
0.29090396 = queryWeight(title:case^65.5), product of:
  65.5 = boost
  5.181068 = idf(docFreq=204956, maxDocs=13411507)
  8.5721357E-4 = queryNorm
2.590534 = (MATCH) fieldWeight(title:case in 6557735), product of:
  1.0 = tf(termFreq(title:case)=1)
  5.181068 = idf(docFreq=204956, maxDocs=13411507)
  0.5 = fieldNorm(field=title, doc=6557735)
0.5454388 = (MATCH) max plus 0.01 times others of:
  0.545

Re: improving search response time

2010-08-18 Thread Shawn Heisey

 Most of your time is spent doing the query itself, which in the light 
of other information provided, does not surprise me.  With 12GB of RAM 
and 9GB dedicated to the java heap, the available RAM for disk caching 
is pretty low, especially if Solr is actually using all 9GB.


Since your index is 60GB, the system is most likely I/O bound.  
Available memory for disk cache is the best way to make Solr fast.  If 
you increased to 16GB RAM, you'd probably see some performance 
increase.  Going to 32GB would be better, and 64GB would let your system 
load nearly the entire index into the disk cache.


Is matchAll possibly an aggregated field with information copied from 
the other fields that you are searching?  If so, especially since you 
are using dismax, you'd want to strongly consider dropping it entirely, 
which would make your index a lot smaller.  Check your schema for 
information that could be trimmed.  You might not need "stored" on some 
fields, especially if the original values are available from another 
source (like a database, or a central filesystem).  You may not need 
advanced features on everything, like termvectors, termpositions, etc.


If you can't make significant chances in server memory or index size, 
you might want to consider going distributed.  You'd need more servers.  
A few things (More Like This being the one that comes to mind) do not 
work in a distributed index.


Can you reduce the java heap size and still have Solr work correctly?  
You probably do not need your internal Solr caches to be so huge, and 
dropping them would greatly reduce your heap needs.  Here's my cache 
settings, with the numbers being size, initialsize, then autowarm count.


filterCache: 256, 256, 0
queryResultCache: 1024, 512, 128
documentCache: 16384, 4096, n/a

I'm using distributed search with six large shards that each take up 
nearly 13GB.  The machines (VMs) have 9GB of RAM and the java heap size 
is 1280MB.  I'm not using a lot of the advanced features like 
highlighting, so I'm not using termvectors.  Right now, we use facets 
for data mining, but not in production.  My average query time is about 
100 milliseconds, with each shard's average about half that.  
Autowarming usually takes about 10-20 seconds, though sometimes it 
balloons to about 45 seconds.  I started out with much larger cache 
numbers, but that just made my autowarm times huge.


Based on my experience, I imagine that your system takes several minutes 
to autowarm your caches when you do a commit or optimize.  If you are 
doing frequent updates, that would be a major drag on performance.


Two of your caches have a larger initialsize than size, with the former 
meaning the number of slots allocated immediately and the latter 
referring to the maximum size of the cache.  Apparently it's not leading 
to any disastrous problems, but you'll want to adjust accordingly.



On 8/18/2010 9:00 AM, Muneeb Ali wrote:

First, thanks very much for a prompt reply. Here is more info:

===

a) What operating system?
Debian GNU/Linux 5.0

b) What Java container (Tomcat/Jetty)
Jetty

c) What JAVA_OPTIONS? I.e. memory, garbage collection etc.
-Xmx9000m   -DDEBUG   -Djava.awt.headless=true
-Dorg.mortbay.log.class=org.mortbay.log.StdErrLog
-Dcom.sun.management.jmxremote.port=3000
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-XX:+UseCompressedOops -XX:+UseConcMarkSweepGC
-javaagent:/usr/local/lib/newrelic/newrelic.jar

d) Example queries? I.e. what features, how many facets, sort fields etc
/select?start=0&rows=20&fl=id&hl=true&hl.fl=title%2Cabstract%2Cauthors&hl.fragsize=300&hl.simple.pre=&hl.simple.post=<%2Fstrong>&qt=dismax&q=gene
therapy

We also get queries with filters examples:

/select?start=0&rows=20&fl=id&hl=true&hl.fl=title%2Cabstract%2Cauthors&hl.fragsize=300&hl.simple.pre=&hl.simple.post=<%2Fstrong>&qt=dismax&q=gene
therapy&fq=meshterm:(gene)&fq=author:(david)

e) How do you load balance queries between the slaves?

proxy based load balance

f) What is your search latency now and @ what QPS? Also, where do you
measure time - on the API or on the end-user page?

Average response time: 2600 - 3000 ms  with average throughput: 4-6 rpm
(from 'new relic RPM' solr performance monitor)

g) How often do you replicate?
Daily (indexer runs each night) and replicates after indexing completes at
master. However lately we are experiencing problems right after replication,
and have to restart jetty (its most likely that slaves are running out of
memory).

h) Are you using warm-up-queries?
Yes, using autoWarmCount variable in cache configuration/ these are
specified as:





i) Are you ever optimizing your index?

Yes, daily after indexing. We are not doing dynamic updates to index, so I
guess its not needed to be done multiple times.

j) Are you using highlighting? If so, are you using the fast vector
highlighter or the regex?

Yes, we are using the default highlight component, with default

How to use synonms on a faceted field with multiple words

2010-08-18 Thread Scott Zientara

I am trying to use solr.SynonymFilterFactory on a faceted field in Solr 1.3. 
I am using Solr to index resources from a media library. The data is coming 
from various 
sources, some of which I do not have control over. I need to be able to map 
resource 
types in the data to common terms for faceting. For example:
video,audio => digital media
film,laser disc, vhs video => other

I am using solr.KeywordTokenizerFactory for the analyzer, but Solr will not 
treat 
multiple words as a single token. 
A single word to single word map (i.e. film => other) works perfectly .
A single to double word map (i.e. film => other stuff) becomes 2 terms which is 
unfit for 
faceting.
A double word to single word map (i.e. vhs video => videotape) doesn't seem to 
match at 
all.

I've tried this with and without the 
tokenizerFactory="solr.KeywordTokenizerFactory" 
attribute in the synonm filter element. I've tried to escape the space in the 
synonm file 
(i.e. video => digital\bmedia).

Is it possible to use the synonm filter to map multi-word terms for a facteted 
field? If 
so, what am I missing?

tii RAM usage on startup

2010-08-18 Thread Rebecca Watson

hi,

I am running solr 1.4.1 and java 1.6 with 6GB heap and the following
GC settings:
gc_args="-XX:+UseConcMarkSweepGC
-XX:+CMSClassUnloadingEnabled   -XX:NewSize=2g -XX:MaxNewSize=2g
-XX:CMSInitiatingOccupancyFraction=60"

So 6GB total heap and 2GB allocated to eden space.

I have caching, autocommit and auto-warming commented out of
solrconfig.xml

After I index 500k docs and call commit/optimize (via URL after indexing
has completed) my RAM usage is only about 1.5GB, but then if I stop
and restart my Solr server over the same data the RAM immediately
jumps to about 4GB and I can't understand why there is a difference
here? As this is close to the old gen limit -- i quickly find that Solr
becomes unresponsive.

The following shows that tii files are being loaded from 26MB
files to consume over 200MB in RAM when I restart the server.

is this expected?

thanks for any help/advice in advance,

bec :)

-

Rebecca-Watsons-iMac:work iwatson$ jmap -histo:live 8992 | head -30

 num #instances #bytes  class name
--
   1:  18334714 1422732624  [C
   2:  18332491  733299640  java.lang.String
   3:   6104929  244197160  org.apache.lucene.index.TermInfo
   4:   6104929  244197160  org.apache.lucene.index.TermInfo
   5:   6104929  244197160  org.apache.lucene.index.TermInfo
   6:   6104921  195357472  org.apache.lucene.index.Term
   7:   6104921  195357472  org.apache.lucene.index.Term
   8:   6104921  195357472  org.apache.lucene.index.Term
   9:   224  146527408  [J
  10:10   48839592  [Lorg.apache.lucene.index.TermInfo;
  11:10   48839592  [Lorg.apache.lucene.index.Term;
  12:10   48839592  [Lorg.apache.lucene.index.TermInfo;
  13:10   48839592  [Lorg.apache.lucene.index.TermInfo;
  14:10   48839592  [Lorg.apache.lucene.index.Term;
  15:10   48839592  [Lorg.apache.lucene.index.Term;
  16: 416306264728  
  17: 416305005104  
  18:  40494596352  
  19:  40493049984  
  20:  31292580040  
  21: 497132418496  
  22:  49831067192  [B
  23:  4381 806104  java.lang.Class
  24:  5979 533064  [[I
  25:  6124 438080  [S
  26:  7951 381648  java.util.HashMap$Entry
  27:  2071 375744  [Ljava.util.HashMap$Entry;
Rebecca-Watsons-iMac:work iwatson$ ls
./mach-lcf/data/data-serv-lcf/artdoc1/index/*.tii
-rw-r--r--  1 iwatson  staff26M 18 Aug 23:44
./mach-lcf/data/data-serv-lcf/artdoc1/index/_36.tii
-rw-r--r--  1 iwatson  staff26M 19 Aug 00:06
./mach-lcf/data/data-serv-lcf/artdoc1/index/_69.tii
-rw-r--r--  1 iwatson  staff25M 19 Aug 00:26
./mach-lcf/data/data-serv-lcf/artdoc1/index/_9d.tii
-rw-r--r--  1 iwatson  staff24M 19 Aug 00:50
./mach-lcf/data/data-serv-lcf/artdoc1/index/_ch.tii
-rw-r--r--  1 iwatson  staff25M 19 Aug 01:11
./mach-lcf/data/data-serv-lcf/artdoc1/index/_fj.tii
-rw-r--r--  1 iwatson  staff   3.1M 19 Aug 01:12
./mach-lcf/data/data-serv-lcf/artdoc1/index/_fq.tii
-rw-r--r--  1 iwatson  staff   3.1M 19 Aug 01:12
./mach-lcf/data/data-serv-lcf/artdoc1/index/_g1.tii
-rw-r--r--  1 iwatson  staff   167B 19 Aug 01:10
./mach-lcf/data/data-serv-lcf/artdoc1/index/_gb.tii
-rw-r--r--  1 iwatson  staff   3.1M 19 Aug 01:11
./mach-lcf/data/data-serv-lcf/artdoc1/index/_gc.tii
-rw-r--r--  1 iwatson  staff   223K 19 Aug 01:23
./mach-lcf/data/data-serv-lcf/artdoc1/index/_gd.tii

Re: tii RAM usage on startup

2010-08-18 Thread Michael McCandless

I'm not sure why you see 1.5 GB before restart but then 4 GB after.

But seeing a 26 MB tii file --> 200 MB RAM is unfortunately expected;
in 3.x Lucene's in-RAM representation of the terms index is very
inefficient (three separate object instances (TermInfo, Term, String)
per indexed term, with each object having various fields, etc.).

This has been improved substantially in trunk with flexible indexing.

You can increase the terms index divisor when you open your
IndexReader.  EG, passing 2 (instead of the default 1) keeps every
other indexed term, halving the required RAM (but taking more time to
seek to a certain term).  I'm not sure how Solr exposes this
configuration though.

Mike

On Wed, Aug 18, 2010 at 1:54 PM, Rebecca Watson  wrote:
> hi,
>
> I am running solr 1.4.1 and java 1.6 with 6GB heap and the following
> GC settings:
> gc_args="-XX:+UseConcMarkSweepGC
> -XX:+CMSClassUnloadingEnabled   -XX:NewSize=2g -XX:MaxNewSize=2g
> -XX:CMSInitiatingOccupancyFraction=60"
>
> So 6GB total heap and 2GB allocated to eden space.
>
> I have caching, autocommit and auto-warming commented out of
> solrconfig.xml
>
> After I index 500k docs and call commit/optimize (via URL after indexing
> has completed) my RAM usage is only about 1.5GB, but then if I stop
> and restart my Solr server over the same data the RAM immediately
> jumps to about 4GB and I can't understand why there is a difference
> here? As this is close to the old gen limit -- i quickly find that Solr
> becomes unresponsive.
>
> The following shows that tii files are being loaded from 26MB
> files to consume over 200MB in RAM when I restart the server.
>
> is this expected?
>
> thanks for any help/advice in advance,
>
> bec :)
>
> -
>
> Rebecca-Watsons-iMac:work iwatson$ jmap -histo:live 8992 | head -30
>
>  num     #instances         #bytes  class name
> --
>   1:      18334714     1422732624  [C
>   2:      18332491      733299640  java.lang.String
>   3:       6104929      244197160  org.apache.lucene.index.TermInfo
>   4:       6104929      244197160  org.apache.lucene.index.TermInfo
>   5:       6104929      244197160  org.apache.lucene.index.TermInfo
>   6:       6104921      195357472  org.apache.lucene.index.Term
>   7:       6104921      195357472  org.apache.lucene.index.Term
>   8:       6104921      195357472  org.apache.lucene.index.Term
>   9:           224      146527408  [J
>  10:            10       48839592  [Lorg.apache.lucene.index.TermInfo;
>  11:            10       48839592  [Lorg.apache.lucene.index.Term;
>  12:            10       48839592  [Lorg.apache.lucene.index.TermInfo;
>  13:            10       48839592  [Lorg.apache.lucene.index.TermInfo;
>  14:            10       48839592  [Lorg.apache.lucene.index.Term;
>  15:            10       48839592  [Lorg.apache.lucene.index.Term;
>  16:         41630        6264728  
>  17:         41630        5005104  
>  18:          4049        4596352  
>  19:          4049        3049984  
>  20:          3129        2580040  
>  21:         49713        2418496  
>  22:          4983        1067192  [B
>  23:          4381         806104  java.lang.Class
>  24:          5979         533064  [[I
>  25:          6124         438080  [S
>  26:          7951         381648  java.util.HashMap$Entry
>  27:          2071         375744  [Ljava.util.HashMap$Entry;
> Rebecca-Watsons-iMac:work iwatson$ ls
> ./mach-lcf/data/data-serv-lcf/artdoc1/index/*.tii
> -rw-r--r--  1 iwatson  staff    26M 18 Aug 23:44
> ./mach-lcf/data/data-serv-lcf/artdoc1/index/_36.tii
> -rw-r--r--  1 iwatson  staff    26M 19 Aug 00:06
> ./mach-lcf/data/data-serv-lcf/artdoc1/index/_69.tii
> -rw-r--r--  1 iwatson  staff    25M 19 Aug 00:26
> ./mach-lcf/data/data-serv-lcf/artdoc1/index/_9d.tii
> -rw-r--r--  1 iwatson  staff    24M 19 Aug 00:50
> ./mach-lcf/data/data-serv-lcf/artdoc1/index/_ch.tii
> -rw-r--r--  1 iwatson  staff    25M 19 Aug 01:11
> ./mach-lcf/data/data-serv-lcf/artdoc1/index/_fj.tii
> -rw-r--r--  1 iwatson  staff   3.1M 19 Aug 01:12
> ./mach-lcf/data/data-serv-lcf/artdoc1/index/_fq.tii
> -rw-r--r--  1 iwatson  staff   3.1M 19 Aug 01:12
> ./mach-lcf/data/data-serv-lcf/artdoc1/index/_g1.tii
> -rw-r--r--  1 iwatson  staff   167B 19 Aug 01:10
> ./mach-lcf/data/data-serv-lcf/artdoc1/index/_gb.tii
> -rw-r--r--  1 iwatson  staff   3.1M 19 Aug 01:11
> ./mach-lcf/data/data-serv-lcf/artdoc1/index/_gc.tii
> -rw-r--r--  1 iwatson  staff   223K 19 Aug 01:23
> ./mach-lcf/data/data-serv-lcf/artdoc1/index/_gd.tii
>

Re: How to use synonms on a faceted field with multiple words

2010-08-18 Thread Scott Zientara

A quick and dirty work around using Solr 1.4 is to replace spaces in the synonm 
file with 
some other character/pattern. I used ## (i.e. video => digital##media). Then 
add the 
solr.PatternReplaceFilterFactory after the synonm filter to replace pattern 
with space. 
This works, but I'd love to know if there is a better way.

Send reply to:  solr-user@lucene.apache.org
From:   "Scott Zientara" 
Organization:   Tek Data
To: solr-user@lucene.apache.org
Date sent:  Wed, 18 Aug 2010 12:31:57 -0500
Subject:How to use synonms on a faceted field with multiple 
words
Send reply to:  sc...@tekdata.com
Priority:   normal

[ Double-click this line for list subscription options ] 

I am trying to use solr.SynonymFilterFactory on a faceted field in Solr 1.3. 
I am using Solr to index resources from a media library. The data is coming 
from various 
sources, some of which I do not have control over. I need to be able to map 
resource 
types in the data to common terms for faceting. For example:
video,audio => digital media
film,laser disc, vhs video => other

I am using solr.KeywordTokenizerFactory for the analyzer, but Solr will not 
treat 
multiple words as a single token. 
A single word to single word map (i.e. film => other) works perfectly .
A single to double word map (i.e. film => other stuff) becomes 2 terms which is 
unfit for 

faceting.
A double word to single word map (i.e. vhs video => videotape) doesn't seem to 
match at 
all.

I've tried this with and without the 
tokenizerFactory="solr.KeywordTokenizerFactory" 
attribute in the synonm filter element. I've tried to escape the space in the 
synonm file 

(i.e. video => digital\bmedia).

Is it possible to use the synonm filter to map multi-word terms for a facteted 
field? If 
so, what am I missing?

Re: queryResultCache has no hits for date boost function

2010-08-18 Thread Yonik Seeley

On Tue, Aug 17, 2010 at 6:29 PM, Peter Karich  wrote:
> my queryResultCache has no hits. But if I am removing one line from the
> bf section in my dismax handler all is fine. Here is the line:
> recip(ms(NOW,date),3.16e-11,1,1)

NOW has millisecond resolution, so it's actually a different query
each millisecond.
You could either use a date literal instead of NOW, or use date math
to round the date and get a better cache hit ratio... like NOW/hour or
something.

-Yonik
http://www.lucidimagination.com

Jetty rerturning HTTP error code 413

2010-08-18 Thread Alexandre Rocco

Guys,

We are facing an issue executing very large query (~4000 bytes in the URL)
in Solr.
When we execute the query, Solr (probably Jetty) returns a HTTP 413 error
(FULL HEAD).

I guess that this is related to the very big query being executed, and
currently we can't make it short.
Is there any configuration that need to be tweaked on Jetty or other
component to make this query work?

Any advice is really appreciated.

Thanks!
Alexandre Rocco

Re: sort order of "missing" items

2010-08-18 Thread Yonik Seeley

On Tue, Aug 17, 2010 at 4:10 PM, Brad Dewar  wrote:
> When items are sorted, are all the docs with the sort field missing 
> considered "tied" in terms of their sort order, or are they "indeterminate", 
> or do they have some arbitrary order imposed on them (e.g. _docid_)?

If it's a numeric field, it sorts as if the value was 0.
If it's a string field, a missing value is less than other values.
All ties (regardless of missing or not) are broken by docid, and all
docs with a missing value are tied.

The "string" field from the solr example schema has
sortMissingLast="true" set, and so missing will sort after documents
with the value, regardless of sort order.  Here's the blurb from the
example schema:

> For example, would "b" be considered as part of the sort in the following 
> query, or would all the missing 'a' fields be in some kind of order already, 
> thus making the sort algorithm never check the 'b' field?
>
> /select/?q=-a:[* TO *]&sort=a asc,b asc
>
> And would sortMissingLast / sortMissingFirst affect the answer to that 
> question?
>
> I've been seeing weird behaviour in my index with queries (a little) like 
> this one, but I haven't pinpointed the problem yet.

Are you using Solr 1.4?  There was a bug with sortMissingLast/sortMissingFirst.
https://issues.apache.org/jira/browse/SOLR-1777

-Yonik
http://www.lucidimagination.com

Re: Jetty rerturning HTTP error code 413

2010-08-18 Thread didier deshommes

Hi Alexandre,
Have you tried setting a higher headerBufferSize?  Look in
etc/jetty.xml and search for 'headerBufferSize'; I think it controls
the size of the url. By default it is 8192.

didier

On Wed, Aug 18, 2010 at 2:43 PM, Alexandre Rocco  wrote:
> Guys,
>
> We are facing an issue executing very large query (~4000 bytes in the URL)
> in Solr.
> When we execute the query, Solr (probably Jetty) returns a HTTP 413 error
> (FULL HEAD).
>
> I guess that this is related to the very big query being executed, and
> currently we can't make it short.
> Is there any configuration that need to be tweaked on Jetty or other
> component to make this query work?
>
> Any advice is really appreciated.
>
> Thanks!
> Alexandre Rocco
>

Re: Jetty rerturning HTTP error code 413

2010-08-18 Thread Yonik Seeley

Yep, or you can submit the query via POST, which has a much bigger
limit on the size of the body.

-Yonik
http://www.lucidimagination.com


On Wed, Aug 18, 2010 at 3:58 PM, didier deshommes  wrote:
> Hi Alexandre,
> Have you tried setting a higher headerBufferSize?  Look in
> etc/jetty.xml and search for 'headerBufferSize'; I think it controls
> the size of the url. By default it is 8192.
>
> didier
>
> On Wed, Aug 18, 2010 at 2:43 PM, Alexandre Rocco  wrote:
>> Guys,
>>
>> We are facing an issue executing very large query (~4000 bytes in the URL)
>> in Solr.
>> When we execute the query, Solr (probably Jetty) returns a HTTP 413 error
>> (FULL HEAD).
>>
>> I guess that this is related to the very big query being executed, and
>> currently we can't make it short.
>> Is there any configuration that need to be tweaked on Jetty or other
>> component to make this query work?
>>
>> Any advice is really appreciated.
>>
>> Thanks!
>> Alexandre Rocco
>>
>

Re: queryResultCache has no hits for date boost function

2010-08-18 Thread Peter Karich

Thanks a lot Yonik! Rounding makes sense.
Is there a date math for the 'LAST_COMMIT'?

Peter.

> On Tue, Aug 17, 2010 at 6:29 PM, Peter Karich  wrote:
>   
>> my queryResultCache has no hits. But if I am removing one line from the
>> bf section in my dismax handler all is fine. Here is the line:
>> recip(ms(NOW,date),3.16e-11,1,1)
>> 
> NOW has millisecond resolution, so it's actually a different query
> each millisecond.
> You could either use a date literal instead of NOW, or use date math
> to round the date and get a better cache hit ratio... like NOW/hour or
> something.
>
> -Yonik
> http://www.lucidimagination.com
>

Re: queryResultCache has no hits for date boost function

2010-08-18 Thread Yonik Seeley

On Wed, Aug 18, 2010 at 4:34 PM, Peter Karich  wrote:
> Thanks a lot Yonik! Rounding makes sense.
> Is there a date math for the 'LAST_COMMIT'?

No - but it's an interesting idea!

-Yonik
http://www.lucidimagination.com

Re: Solr's Index Live Updates

2010-08-18 Thread Jan Høydahl / Cominvent

Hi,

I'm afraid you'll have to post the full document again, then do a commit.
But it WILL be lightning fast, as it is only the updated document which is 
indexed, all the other existing documents will not be re-indexed.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 18. aug. 2010, at 15.44, Gonzalo Payo Navarro wrote:

> Hi everyone!
> 
> I've a question: Is there a way to update a document in Solr (v. 1.4)
> and that document is ready for searches without a reindex?
> 
> Let me put it this way: My index is filled with documents like, say,
> DOC_ID, STATUS and TEXT fields. What if I want to update the TEXT
> field and see that change immediately in the Solr's Index.
> 
> Thanks

multiple values

2010-08-18 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]

Hello,

I only can display one author which is last one. It looks like overwrite others.

In xml, I have more than one name in 
. 

In data_config.xml, I put the . 

In schema.xml, I put . 

Please let me know if I did something wrong, or how I can display it in jsp.

I really appreciate your help!

Re: Missing tokens

2010-08-18 Thread Jan Høydahl / Cominvent

Cannot see anything obvious...

Try
http://localhost/solr/select?q=contents:OB10*
http://localhost/solr/select?q=contents:"OB 10"
http://localhost/solr/select?q=contents:"OB10.";
http://localhost/solr/select?q=contents:ob10

Also, go to the Analysis page in admin, typie in your field name, enable 
verbose output and copy paste the problematic sentence in the "Index" part and 
then enter a OB10 in the "Query" part, and see how your doc and query gets 
processed.

PS: Why don't you try this instead of doing the PDF extraction yourselv: 
http://wiki.apache.org/solr/ExtractingRequestHandler ??

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 18. aug. 2010, at 16.25, paul.mo...@dds.net wrote:

> Here's my field description. I mentioned 'contents' field in my original
> post. I've changed it to a different field, 'summary'. It's using the
> 'text' fieldType as you can see below.
> 
>   
> 
> 
> 
>  
>
>
>
>ignoreCase="true"
>words="stopwords.txt"
>enablePositionIncrements="true"
>/>
> "1" generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
>
> "protwords.txt"/>
>
>  
>  
>
> ignoreCase="true" expand="true"/>
> "stopwords.txt"/>
> "1" generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="1"/>
>
> "protwords.txt"/>
>
>  
>
> 
> I parsed the pdf using pdfbox. I can see my alphanumeric search term 'OB10'
> in the extracted text before I add it to the index. I can also go into Luke
> and see the 'OB10' in the contents of the 'summary' field even though Luke
> can't find it when I do a search.
> 
> I can also use the browser to do a search in http://localhost/solr/admin
> and again that search term doesn't return any results. I thought it may be
> an alphanumber word splitting issue, but that doesn't seem be be the case
> since I can search on ME26, and it returns a doc, and in fact, I can see
> the 'OB10' search term in the summary field of the doc returned.
> 
> Here's a snippet of the summary field from that returned doc
> 
> To produce a downloadable file using a format suitable
> for OB10. 8-26 Profiles
> 
> I'm thinking that the extracted text from pdfbox may have hidden chars that
> solr can't parse. However, before I go down that road, I just want to be
> sure I'm not making schoolboy errors with my solr setup.
> 
> thanks
> Paul
> 
> 
> 
> From: Jan Høydahl / Cominvent 
> To:   solr-user@lucene.apache.org
> Date: 18/08/2010 11:56
> Subject:  Re: Missing tokens
> 
> 
> 
> Hi,
> 
> Can you share with us how your schema looks for this field? What FieldType?
> What tokenizer and analyser?
> How do you parse the PDF document? Before submitting to Solr? With what
> tool?
> How do you do the query? Do you get the same results when doing the query
> from a browser, not SolrJ?
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
> 
> On 18. aug. 2010, at 11.34, paul.mo...@dds.net wrote:
> 
>> 
>> Hi, I'm having a problem with certain search terms not being found when I
>> do a query. I'm using Solrj to index a pdf document, and add the contents
>> to the 'contents' field. If I query the 'contents' field on the
>> SolrInputDocument doc object as below, I get 50k tokens.
>> 
>> StringTokenizer to = new StringTokenizer((String)doc.getFieldValue(
>> "contents"));
>> System.out.println( "Tokens:"  + to.countTokens() );
>> 
>> However, once the doc is indexed and I use Luke to analyse the index, it
>> has only 3300 tokens in that field. Where did the other 47k go?
>> 
>> I read some other threads mentioning to increase the maxfieldLength in
>> solrconfig.xml, and my setting is below.
>> 
>> 2147483647
>> 
>> Any advice is appreciated,
>> Paul
>> 
> 
> 
>

Re: queryResultCache has no hits for date boost function

2010-08-18 Thread Peter Karich

Hi Yonik,

would you point me to the Java classes where solr handles a commit or an
optimize and then the date math definitions?

Regards,
Peter.

> On Wed, Aug 18, 2010 at 4:34 PM, Peter Karich  wrote:
>   
>> Thanks a lot Yonik! Rounding makes sense.
>> Is there a date math for the 'LAST_COMMIT'?
>> 
> No - but it's an interesting idea!
>
> -Yonik
> http://www.lucidimagination.com
>
>

Re: queryResultCache has no hits for date boost function

2010-08-18 Thread Peter Karich

forget to say: thanks again! Now the cache gets hits!

Regards,
Peter.

> On Wed, Aug 18, 2010 at 4:34 PM, Peter Karich  wrote:
>   
>> Thanks a lot Yonik! Rounding makes sense.
>> Is there a date math for the 'LAST_COMMIT'?
>> 
> No - but it's an interesting idea!
>
> -Yonik
> http://www.lucidimagination.com
>

Re: tii RAM usage on startup

2010-08-18 Thread Koji Sekiguchi


 > I'm not sure how Solr exposes this configuration though.

this one?




Koji

--
http://www.rondhuit.com/en/



(10/08/19 3:36), Michael McCandless wrote:

I'm not sure why you see 1.5 GB before restart but then 4 GB after.

But seeing a 26 MB tii file -->  200 MB RAM is unfortunately expected;
in 3.x Lucene's in-RAM representation of the terms index is very
inefficient (three separate object instances (TermInfo, Term, String)
per indexed term, with each object having various fields, etc.).

This has been improved substantially in trunk with flexible indexing.

You can increase the terms index divisor when you open your
IndexReader.  EG, passing 2 (instead of the default 1) keeps every
other indexed term, halving the required RAM (but taking more time to
seek to a certain term).  I'm not sure how Solr exposes this
configuration though.

Mike

On Wed, Aug 18, 2010 at 1:54 PM, Rebecca Watson  wrote:

hi,

I am running solr 1.4.1 and java 1.6 with 6GB heap and the following
GC settings:
gc_args="-XX:+UseConcMarkSweepGC
-XX:+CMSClassUnloadingEnabled   -XX:NewSize=2g -XX:MaxNewSize=2g
-XX:CMSInitiatingOccupancyFraction=60"

So 6GB total heap and 2GB allocated to eden space.

I have caching, autocommit and auto-warming commented out of
solrconfig.xml

After I index 500k docs and call commit/optimize (via URL after indexing
has completed) my RAM usage is only about 1.5GB, but then if I stop
and restart my Solr server over the same data the RAM immediately
jumps to about 4GB and I can't understand why there is a difference
here? As this is close to the old gen limit -- i quickly find that Solr
becomes unresponsive.

The following shows that tii files are being loaded from 26MB
files to consume over 200MB in RAM when I restart the server.

is this expected?

thanks for any help/advice in advance,

bec :)

-

Rebecca-Watsons-iMac:work iwatson$ jmap -histo:live 8992 | head -30

  num #instances #bytes  class name
--
   1:  18334714 1422732624  [C
   2:  18332491  733299640  java.lang.String
   3:   6104929  244197160  org.apache.lucene.index.TermInfo
   4:   6104929  244197160  org.apache.lucene.index.TermInfo
   5:   6104929  244197160  org.apache.lucene.index.TermInfo
   6:   6104921  195357472  org.apache.lucene.index.Term
   7:   6104921  195357472  org.apache.lucene.index.Term
   8:   6104921  195357472  org.apache.lucene.index.Term
   9:   224  146527408  [J
  10:10   48839592  [Lorg.apache.lucene.index.TermInfo;
  11:10   48839592  [Lorg.apache.lucene.index.Term;
  12:10   48839592  [Lorg.apache.lucene.index.TermInfo;
  13:10   48839592  [Lorg.apache.lucene.index.TermInfo;
  14:10   48839592  [Lorg.apache.lucene.index.Term;
  15:10   48839592  [Lorg.apache.lucene.index.Term;
  16: 416306264728
  17: 416305005104
  18:  40494596352
  19:  40493049984
  20:  31292580040
  21: 497132418496
  22:  49831067192  [B
  23:  4381 806104  java.lang.Class
  24:  5979 533064  [[I
  25:  6124 438080  [S
  26:  7951 381648  java.util.HashMap$Entry
  27:  2071 375744  [Ljava.util.HashMap$Entry;
Rebecca-Watsons-iMac:work iwatson$ ls
./mach-lcf/data/data-serv-lcf/artdoc1/index/*.tii
-rw-r--r--  1 iwatson  staff26M 18 Aug 23:44
./mach-lcf/data/data-serv-lcf/artdoc1/index/_36.tii
-rw-r--r--  1 iwatson  staff26M 19 Aug 00:06
./mach-lcf/data/data-serv-lcf/artdoc1/index/_69.tii
-rw-r--r--  1 iwatson  staff25M 19 Aug 00:26
./mach-lcf/data/data-serv-lcf/artdoc1/index/_9d.tii
-rw-r--r--  1 iwatson  staff24M 19 Aug 00:50
./mach-lcf/data/data-serv-lcf/artdoc1/index/_ch.tii
-rw-r--r--  1 iwatson  staff25M 19 Aug 01:11
./mach-lcf/data/data-serv-lcf/artdoc1/index/_fj.tii
-rw-r--r--  1 iwatson  staff   3.1M 19 Aug 01:12
./mach-lcf/data/data-serv-lcf/artdoc1/index/_fq.tii
-rw-r--r--  1 iwatson  staff   3.1M 19 Aug 01:12
./mach-lcf/data/data-serv-lcf/artdoc1/index/_g1.tii
-rw-r--r--  1 iwatson  staff   167B 19 Aug 01:10
./mach-lcf/data/data-serv-lcf/artdoc1/index/_gb.tii
-rw-r--r--  1 iwatson  staff   3.1M 19 Aug 01:11
./mach-lcf/data/data-serv-lcf/artdoc1/index/_gc.tii
-rw-r--r--  1 iwatson  staff   223K 19 Aug 01:23
./mach-lcf/data/data-serv-lcf/artdoc1/index/_gd.tii

Re: Solr data type for date faceting

2010-08-18 Thread Karthik K

adding 
facet.query=timestamp:[20100601+TO+201006312359]&facet.query=timestamp:[20100701+TO+201007312359]...
in query should give the desired response without changing the schema or
re-indexing.

Re: Indexing Hanging during GC?

2010-08-18 Thread Rebecca Watson

hi all,

in case anyone is having similar issues now / in the future -- here's
what I think
is at least part of the problem:

once I commit the index, the RAM requirement jumps because the .tii files are
loaded in at that point and because i have a very large number of unique terms
I use 200MB+ of RAM for every tii file (even though they are only about 25MB on
disk due to the number of unique terms this results in a large memory
requirement
when they are loaded in). (thanks to people on the solr-user lists answering my
question on this -- search for subject "tii RAM usage on startup").

so when I had auto-commit on, my RAM was slowing disappearing and eventually
Solr hangs because the tii files are too big to load into memory.

the suggestion from my other thread was to try solr/lucene trunk (as
i'm using solr 1.4.1
and they have reduced the memory footprint within flexible indexing in
Lucene) OR
to increase the term index interval so I will try one/both of these and see if
this means I can increase the number of documents I can index given my current
hardware (6GB RAM) where these docs have a lot of unique terms!

thanks :)

bec

On 13 August 2010 19:15, Rebecca Watson  wrote:
> hi,
>
> ok I have a theory about the cause of my problem -- java's GC failure
> I think is due
> to a solr memory leak caused from overlapping auto-commit calls --
> does that sound
> plausible?? (ducking for cover now...)
>
> I watched the log files and noticed that when the threads start to increase
> (from a stable 32 or so up to 72 before hanging!) there are two commit calls
> too close to each other + it looked like the index is in the process
> of merging at the time
> of the first commit call -- i.e. first was a long commit call with
> merge required
> then before that one finished another commit call was issued.
>
> i think this was due to the autocommit settings I had:
> 
>      50 
>      90 
>    
>
> and eventually, it seems these two different auto-commit settings
> would coincide!!
> a few times this seems to happen and not cause a problem -- but I think two
> eventually coincide where the first one is doing something heavy-duty
> like a merge
> over large index segments and so the system spirals downwards
>
> combined with the fact I was posting to Solr as fast as possible (LCF
> was waiting
> for Solr) --> i think this causes java to keel over and die.
>
> Two things were noticeable in Jconsole -
> 1) lots of threads were spawned with the two commit calls - the thread
> spawing started
> after the first commit call making me think it was a commit requiring
> an index merge...
> whereby threads overall went from the stable 32 used during
> indexing for the 2 hours prior to 72 or so within 15 minutes after the
> two commit calls
> were made...
>
> 2) both Old-gen/survivor heaps were almost totally full! so i think a
> memory leak
> is happening with overlapping commit calls + heavy duty lucene index 
> processing
> behind solr (like index merge!?)
>
> So if the overlapping commit call (second commit called before first
> one finished)
> caused a memory leak and with old-gen/survivor heaps full
> at that point, Solr became unresponsive and never recovered.
>
> is this expected when you use both autocommit settings / if concurrent commit
> calls are issued to Solr?
>
> This explains why it was happening even if without the use of my
> custom analysers
> ("text" field type used in place of mine) but took longer to happen
> --> my analysers
> are more expensive CPU/RAM-wise so the overlapping commit calls were less 
> likely
> to be forgiven as my system was already using a lot of RAM...
>
> Also, I played with the GC settings a bit where I could find settings
> that helped
> to postpone this issue as they were more forgiving to the increased
> RAM usage during
> overlapping commit calls (GC settings with increased eden heap space).
>
> Solr was hanging after about 14k files (each one an article with a set
> of  that
> are each sentences in the article) with a total of about
> 7 million index documents.
>
> If i switch off both auto-commit settings I can get through my
> smallish 20k file set (10 million index
> s) in 4 hours.
>
> I'm Trying to run now on 100k articles (50 million index  within
> 100k files)
> where I use LCF to crawl/post each file to Solr so i'll email an
> update about this.
>
> if this works ok i'm then going to try using only one auto-commit
> setting rather than two and see
> if this works ok.
>
> thanks :)
>
> bec
>
>
> On 13 August 2010 00:24, Rebecca Watson  wrote:
>> hi,
>>
>>> 1) I assume you are doing batching interspersed with commits
>>
>> as each file I crawl for are article-level each  contains all the
>> sentences for the article so they are naturally batched into the about
>> 500 documents per post in LCF.
>>
>> I use auto-commit in Solr:
>> 
>>     50 
>>     90 
>>   
>>
>>> 2) Why do you need sentence level Lucene docs?
>>
>> that's an application specific nee

Re: Integrating Solr's SynonymFilter in lucene

2010-08-18 Thread Arun Rangarajan

I think the lucene WhitespaceAnalyzer I am using inside Solr's
SynonymFilteris the one that prevents multi-word synonyms like "New
York" from getting
mapped to the generic synonym name like CONCEPTYcity. It appears to me that
an analyzer which recognizes that a white-space is inside a synonym like
"New York" will be required. Do I need to implement one like this or is
there already an analyzer I can use? Looks like I am missing something here,
since Solr's SynonymFilter is supposed to handle this. Can someone tell me
what is the correct way to integrate Solr's SynonymFilter within a custom
lucene analyzer? Thanks.

On Tue, Aug 17, 2010 at 4:51 PM, Arun Rangarajan
wrote:

> I am trying to have multi-word synonyms work in lucene using Solr's *
> SynonymFilter*.
>
> I need to match synonyms at index time, since many of the synonym lists are
> huge. Actually they are really not synonyms, but are words that belong to a
> concept. For example, I would like to map {"New York", "Los Angeles", "New
> Orleans", "Salt Lake City"...}, a bunch of city names, to the concept called
> "city". While searching, the user query for the concept "city" will be
> translated to a keyword like, say "CONCEPTcity", which is the synonym for
> any city name.
>
> Using lucene's SynonymAnalyzer, as explained in Lucene in Action (p. 131),
> all I could match for "CONCEPTcity" is single word city names like
> "Chicago", "Seattle", "Boston", etc., It would not match multi-word city
> names like "New York", "Los Angeles", etc.,
>
> I tried using Solr's SynonymFilter in tokenStream method in a custom
> Analyzer (that extends org.apache.lucene.analysis.
> Analyzer - lucene ver. 2.9.3) using:
>
> *public TokenStream tokenStream(String fieldName, Reader reader) {
> TokenStream result = new SynonymFilter(
> new WhitespaceTokenizer(reader),
> synonymMap);
> return result;
> }
> *
> where *synonymMap* is loaded with synonyms using
>
> *synonymMap.add(conceptTerms, listOfTokens, true, true);*
>
> where *conceptTerms* is of type *ArrayList* of all the terms in a
> concept and *listofTokens* is of type *List  *and contains only the
> generic synonym identifier like *CONCEPTcity*.
>
> When I print synonymMap using synonymMap.toString(), I get the output like
>
> <{New York=<{Chicago=<{Seattle=<{New
> Orleans=<[(CATEGORYcity,0,0,type=SYNONYM),ORIG],null>}>}>}>}>
>
> so it looks like all the synonyms are loaded. But if I search for
> "CATEGORYcity" then it says no matches found. I am not sure whether I have
> loaded the synonyms correctly in the synonymMap.
>
> Any help will be deeply appreciated. Thanks!
>

Date sorting

2010-08-18 Thread kirsty


Hi I hope someone can point out what I am doing wrong.
I have a date field in my schema


and I am trying to do a sort on it
example url: 
...select/?sort=PublishDate
asc&qt=FinCompanyCodeSearch&rows=20&fq=CompanyCode:1TM&fl=CompanyCode%20Title%20PublishDate

This works for the most part, but if I keep looking down my results list
returned by the search I have some random date and then it's back to sorting
again. 

http://lucene.472066.n3.nabble.com/file/n1219372/SolrSortIssue.jpg 

Looking at the above image you can see the dates seem to be in order, then I
get a 2010-08-17 date between the 2010-07 dates. This is just a sample..but
my full results continues like that.

I am not sure what I am doing wrong and would appreciate any help.
Thanks
Kirsty

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Date-sorting-tp1219372p1219372.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Date sorting

2010-08-18 Thread kirsty


Sorry forgot to mention that I am using SOLR 1.4 
and using the dismax query type.



kirsty wrote:
> 
> Hi I hope someone can point out what I am doing wrong.
> I have a date field in my schema
> 
> 
> and I am trying to do a sort on it
> example url: 
> ...select/?sort=PublishDate
> asc&qt=FinCompanyCodeSearch&rows=20&fq=CompanyCode:1TM&fl=CompanyCode%20Title%20PublishDate
> 
> This works for the most part, but if I keep looking down my results list
> returned by the search I have some random date and then it's back to
> sorting again. 
> 
>  http://lucene.472066.n3.nabble.com/file/n1219372/SolrSortIssue.jpg 
> 
> Looking at the above image you can see the dates seem to be in order, then
> I get a 2010-08-17 date between the 2010-07 dates. This is just a
> sample..but my full results continues like that.
> 
> I am not sure what I am doing wrong and would appreciate any help.
> Thanks
> Kirsty
> 
> 

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Date-sorting-tp1219372p1219377.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: improving search response time

2010-08-18 Thread Lance Norskog

More on this: you should give Solr enough memory to run comfortably,
then stop. Leave as much as you can for the OS to manage its disk
cache. The OS is better at this than Solr is. Also, it does not have
to do garbage collection.

Filter queries are a big help. You should create a set of your basic
filter queries, then compose them as needed. Filters AND together.
Lucene applies them very early in the search process, and they are
effective at cutting the amount of relevance/ranking calculation.

If you want to be really adventurous, there is a crazy new operating
system hack called 'giant pages'. You'll need IT experience to try
this. You'll have to do your own research, sorry.

On Wed, Aug 18, 2010 at 9:27 AM, Shawn Heisey  wrote:
>  Most of your time is spent doing the query itself, which in the light of
> other information provided, does not surprise me.  With 12GB of RAM and 9GB
> dedicated to the java heap, the available RAM for disk caching is pretty
> low, especially if Solr is actually using all 9GB.
>
> Since your index is 60GB, the system is most likely I/O bound.  Available
> memory for disk cache is the best way to make Solr fast.  If you increased
> to 16GB RAM, you'd probably see some performance increase.  Going to 32GB
> would be better, and 64GB would let your system load nearly the entire index
> into the disk cache.
>
> Is matchAll possibly an aggregated field with information copied from the
> other fields that you are searching?  If so, especially since you are using
> dismax, you'd want to strongly consider dropping it entirely, which would
> make your index a lot smaller.  Check your schema for information that could
> be trimmed.  You might not need "stored" on some fields, especially if the
> original values are available from another source (like a database, or a
> central filesystem).  You may not need advanced features on everything, like
> termvectors, termpositions, etc.
>
> If you can't make significant chances in server memory or index size, you
> might want to consider going distributed.  You'd need more servers.  A few
> things (More Like This being the one that comes to mind) do not work in a
> distributed index.
>
> Can you reduce the java heap size and still have Solr work correctly?  You
> probably do not need your internal Solr caches to be so huge, and dropping
> them would greatly reduce your heap needs.  Here's my cache settings, with
> the numbers being size, initialsize, then autowarm count.
>
> filterCache: 256, 256, 0
> queryResultCache: 1024, 512, 128
> documentCache: 16384, 4096, n/a
>
> I'm using distributed search with six large shards that each take up nearly
> 13GB.  The machines (VMs) have 9GB of RAM and the java heap size is 1280MB.
>  I'm not using a lot of the advanced features like highlighting, so I'm not
> using termvectors.  Right now, we use facets for data mining, but not in
> production.  My average query time is about 100 milliseconds, with each
> shard's average about half that.  Autowarming usually takes about 10-20
> seconds, though sometimes it balloons to about 45 seconds.  I started out
> with much larger cache numbers, but that just made my autowarm times huge.
>
> Based on my experience, I imagine that your system takes several minutes to
> autowarm your caches when you do a commit or optimize.  If you are doing
> frequent updates, that would be a major drag on performance.
>
> Two of your caches have a larger initialsize than size, with the former
> meaning the number of slots allocated immediately and the latter referring
> to the maximum size of the cache.  Apparently it's not leading to any
> disastrous problems, but you'll want to adjust accordingly.
>
>
> On 8/18/2010 9:00 AM, Muneeb Ali wrote:
>>
>> First, thanks very much for a prompt reply. Here is more info:
>>
>> ===
>>
>> a) What operating system?
>> Debian GNU/Linux 5.0
>>
>> b) What Java container (Tomcat/Jetty)
>> Jetty
>>
>> c) What JAVA_OPTIONS? I.e. memory, garbage collection etc.
>> -Xmx9000m   -DDEBUG   -Djava.awt.headless=true
>> -Dorg.mortbay.log.class=org.mortbay.log.StdErrLog
>> -Dcom.sun.management.jmxremote.port=3000
>> -Dcom.sun.management.jmxremote.authenticate=false
>> -Dcom.sun.management.jmxremote.ssl=false
>> -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC
>> -javaagent:/usr/local/lib/newrelic/newrelic.jar
>>
>> d) Example queries? I.e. what features, how many facets, sort fields etc
>>
>> /select?start=0&rows=20&fl=id&hl=true&hl.fl=title%2Cabstract%2Cauthors&hl.fragsize=300&hl.simple.pre=&hl.simple.post=<%2Fstrong>&qt=dismax&q=gene
>> therapy
>>
>> We also get queries with filters examples:
>>
>>
>> /select?start=0&rows=20&fl=id&hl=true&hl.fl=title%2Cabstract%2Cauthors&hl.fragsize=300&hl.simple.pre=&hl.simple.post=<%2Fstrong>&qt=dismax&q=gene
>> therapy&fq=meshterm:(gene)&fq=author:(david)
>>
>> e) How do you load balance queries between the slaves?
>>
>> proxy based load balance
>>
>> f) What is your search latency now and @ what QPS? Also,

Re: Date sorting

2010-08-18 Thread Lance Norskog

Wow. Can you try upgrading to 1.4.1 and re-indexing?

On Wed, Aug 18, 2010 at 10:35 PM, kirsty  wrote:
>
> Sorry forgot to mention that I am using SOLR 1.4
> and using the dismax query type.
>
>
>
> kirsty wrote:
>>
>> Hi I hope someone can point out what I am doing wrong.
>> I have a date field in my schema
>> 
>>
>> and I am trying to do a sort on it
>> example url:
>> ...select/?sort=PublishDate
>> asc&qt=FinCompanyCodeSearch&rows=20&fq=CompanyCode:1TM&fl=CompanyCode%20Title%20PublishDate
>>
>> This works for the most part, but if I keep looking down my results list
>> returned by the search I have some random date and then it's back to
>> sorting again.
>>
>>  http://lucene.472066.n3.nabble.com/file/n1219372/SolrSortIssue.jpg
>>
>> Looking at the above image you can see the dates seem to be in order, then
>> I get a 2010-08-17 date between the 2010-07 dates. This is just a
>> sample..but my full results continues like that.
>>
>> I am not sure what I am doing wrong and would appreciate any help.
>> Thanks
>> Kirsty
>>
>>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Date-sorting-tp1219372p1219377.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goks...@gmail.com

Re: Solrj ContentStreamUpdateRequest Slow

2010-08-18 Thread Lance Norskog

'stream.url' is just a simple parameter. You should be able to just
add it directly.

On Wed, Aug 18, 2010 at 5:35 AM, Tod  wrote:
> On 8/16/2010 6:12 PM, Chris Hostetter wrote:
>>
>> : > I think your problem may be that StreamingUpdateSolrServer buffers up
>> : > commands and sends them in batches in a background thread.  if you
>> want to
>> : > send individual updates in real time (and time them) you should just
>> use
>> : > CommonsHttpSolrServer
>> : : My goal is to batch updates.  My content lives somewhere else so I was
>> trying
>> : to find a way to tell Solr where the document lived so it could go out
>> and
>> : stream it into the index for me.  That's where I thought
>> : StreamingUpdateSolrServer would help.
>>
>> If your content lives on a machine which is not your "client" nor your
>> "server" and you want your client to tell your server to go fetch it
>> directly then the "stream.url" param is what you need -- that is unrelated
>> to wether you use StreamingUpdateSolrServer or not.
>
>
> Do you happen to have a code fragment laying around that demonstrates using
> CommonsHttpSolrServer and "stream.url"?  I've tried it in conjunction with
> ContentStreamUpdateRequest and I keep getting an annoying null pointer
> exception.  In the meantime I will check the examples...
>
>
>
>> Thinking about it some more, i suspect the reason you might be seeing a
>> delay when using StreamingUpdateSolrServer is because of this bug...
>>
>>   https://issues.apache.org/jira/browse/SOLR-1990
>>
>> ...if there are no actual documents in your UpdateRequest (because you are
>> using the stream.url param) then the StreamingUpdateSolrServer blocks until
>> all other requests are done, then delegates to the super class (so it never
>> actaully puts your indexing requests in a buffered queue, it just delays and
>> then does them immediately)
>>
>> Not sure of a good way arround this off the top of my head, but i'll note
>> it in SOLR-1990 as another problematic use case that needs dealt with.
>
> Perhaps I can execute an initial update request using a benign file before
> making the "stream.url" call?
>
> Also, to beat a dead horse, this:
> 'http://localhost:8080/solr/update/extract?stream.url=http://remote_server.mydomain.com/test.pdf&stream.contentType=application/pdf&literal.content_id=12342&commit=true'
>
> ... works fine - I just want to do it a LOT and as efficiently as possible.
>  If I have to I can wrap it in a perl script and run a cURL or LWP loop but
> I'd prefer to use SolrJ if I can.
>
> Thanks for all your help.
>
>
> - Tod
>



-- 
Lance Norskog
goks...@gmail.com

Re: Date sorting

2010-08-18 Thread Grijesh.singh


provide schema.xml and solrconfig.xml to dig the problem and by which version
of solr u have indexed the data?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Date-sorting-tp1219372p1219534.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Date sorting

2010-08-18 Thread Grijesh.singh


lance, is there any Bug in solr1.4 related to sorting on date field?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Date-sorting-tp1219372p1219537.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Date sorting

2010-08-18 Thread kirsty



Grijesh.singh wrote:
> 
> provide schema.xml and solrconfig.xml to dig the problem and by which
> version of solr u have indexed the data?
> 
My greatest apologies, I have seen my mistake! ...looks like someone had
added a sort into the requestHandler on another date already...which I was
not aware of, so it seems that was causing a conflict!




-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Date-sorting-tp1219372p1219574.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: specifying the doc id in clustering component

2010-08-18 Thread Stanislaw Osinski

Hi Tommy,

 I'm using the clustering component with solr 1.4.
>
> The response is given by the id field in the doc array like:
>"labels":["Devices"],
>"docs":["200066",
> "195650",
> "204850",
> Is there a way to change the doc label to be another field?
>
> i couldn't this option in http://wiki.apache.org/solr/ClusteringComponent


I'm not sure if I get you right. The "labels" field is generated by the
clustering engine, it's a description of the group (cluster) of documents.
The description is usually a phrase or a number of phrases. The "docs" field
lists the ids of documents that the algorithm assigned to the cluster.

Can you give an example of the input and output you'd expect?

Thanks!

Stanislaw

Re: Solr for multiple websites

2010-08-18 Thread Grijesh.singh


Using multicore is the right approach 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-for-multiple-websites-tp1173220p1219772.html
Sent from the Solr - User mailing list archive at Nabble.com.

54 matches

Mail list logo