query about the server configuration

2011-06-19 Thread Jonty Rhods
Dear all,

I am quite new and not work on solr for heavy request.

I have following server configuration:

16GB RAM
16 CPU

I need to index update in every minutes and at least more than 5000 docs per
day. Size of the data per day will be around 50 MB. I am expecting 10 to
30 concurrent hit on server which is 2 million hits per day and around 30 to
40 concurrent user at peak our.

Right now I had configure core and using static method to call solr server
in solrj (SolrServer server = new HttpSolrServer();). I am worried that at
peak our static instance of the server in solrj will not able to perform the
response and it will become slow.

Is there any way to open more then one connection of server instance in the
SolrJ like connection pool which we are using in Database related connection
pooling (Apache DBCP or Hibernate).

Please help me to configure the server as my heavy requirements.

thanks for your help.

regards
jonty


Re: about the SolrServer server = new CommonsHttpSolrServer(URL);

2011-06-19 Thread Jonty Rhods
for heavy use (30 to 40 concurrent user) will it work.
How to open and maintain more connection at a time like connection pool. So
user cat receive fast response..

regards




On Fri, Jun 17, 2011 at 12:50 PM, Ahmet Arslan  wrote:

> > SolrServer server =  new CommonsHttpSolrServer(URL);
> >
> > through out the class. How can I improve the connection, in
> > my case: should I need to close the server after fetching
> > the result or CommonsHttpSolrServer(URL); will maintain at
> > their end. There is other way: I can make this as static and
> > can use through out the classes.
>
> As wiki [1] says, you must use the same instalce through out all of the
> classes.
>
> [1] http://wiki.apache.org/solr/Solrj#CommonsHttpSolrServer
>


Jonty Rhods wants to chat

2011-06-19 Thread Jonty Rhods
---

Jonty Rhods wants to stay in better touch using some of Google's coolest new
products.

If you already have Gmail or Google Talk, visit:
http://mail.google.com/mail/b-26ddccf9dc-56859aec19-TvU2zC9tjv8Q_u4jzhyceWuZkgs
You'll need to click this link to be able to chat with Jonty Rhods.

To get Gmail - a free email account from Google with over 2,800 megabytes of
storage - and chat with Jonty Rhods, visit:
http://mail.google.com/mail/a-26ddccf9dc-56859aec19-TvU2zC9tjv8Q_u4jzhyceWuZkgs

Gmail offers:
- Instant messaging right inside Gmail
- Powerful spam protection
- Built-in search for finding your messages and a helpful way of organizing
  emails into "conversations"
- No pop-up ads or untargeted banners - just text ads and related information
  that are relevant to the content of your messages

All this, and its yours for free. But wait, there's more! By opening a Gmail
account, you also get access to Google Talk, Google's instant messaging
service:

http://www.google.com/talk/

Google Talk offers:
- Web-based chat that you can use anywhere, without a download
- A contact list that's synchronized with your Gmail account
- Free, high quality PC-to-PC voice calls when you download the Google Talk
  client

We're working hard to add new features and make improvements, so we might also
ask for your comments and suggestions periodically. We appreciate your help in
making our products even better!

Thanks,
The Google Team

To learn more about Gmail and Google Talk, visit:
http://mail.google.com/mail/help/about.html
http://www.google.com/talk/about.html

(If clicking the URLs in this message does not work, copy and paste them into
the address bar of your browser).


Re: about the SolrServer server = new CommonsHttpSolrServer(URL);

2011-06-19 Thread Ahmet Arslan
> for heavy use (30 to 40 concurrent
> user) will it work.
> How to open and maintain more connection at a time like
> connection pool. So
> user cat receive fast response..

It uses HttpClient under the hood. You can pass httpClient to its constructor 
too. It seems that MultiThreadedHttpConnectionManager has 
setMaxConnectionsPerHost method.

String serverPath = "http://localhost:8983/solr";;
HttpClient client = new HttpClient(new MultiThreadedHttpConnectionManager());
URL url = new URL(serverPath);
CommonsHttpSolrServer solrServer = new CommonsHttpSolrServer(url, client);


Weird optimize performance degradation

2011-06-19 Thread Santiago Bazerque
Hello!

Here is a puzzling experiment:

I build an index of about 1.2MM documents using SOLR 3.1. The index has a
large number of dynamic fields (about 15.000). Each document has about 100
fields.

I add the documents in batches of 20, and every 50.000 documents I optimize
the index.

The first 10 optimizes (up to exactly 500k documents) take less than a
minute and a half.

But the 11th and all subsequent commits take north of 10 minutes. The commit
logs look identical (in the INFOSTREAM.txt file), but what used to be

   Jun 19, 2011 4:03:59 AM IW 13 [Sun Jun 19 04:03:59 EDT 2011; Lucene Merge
Thread #0]: merge: total 50 docs

Jun 19, 2011 4:04:37 AM IW 13 [Sun Jun 19 04:04:37 EDT 2011; Lucene Merge
Thread #0]: merge store matchedCount=2 vs 2


now eats a lot of time:


   Jun 19, 2011 4:37:06 AM IW 14 [Sun Jun 19 04:37:06 EDT 2011; Lucene Merge
Thread #0]: merge: total 55 docs

Jun 19, 2011 4:46:42 AM IW 14 [Sun Jun 19 04:46:42 EDT 2011; Lucene Merge
Thread #0]: merge store matchedCount=2 vs 2


What could be happening between those two lines that takes 10 minutes at
full CPU? (and with 50k docs less used to take so much less?).


Thanks in advance,

Santiago


Re: Is it true that I cannot delete stored content from the index?

2011-06-19 Thread François Schiettecatte
That is correct, but you only need to commit, optimize is not a requirement 
here.

François

On Jun 18, 2011, at 11:54 PM, Mohammad Shariq wrote:

> I have define  in my solr and Deleting the docs from solr using
> this uniqueKey.
> and then doing optimization once in a day.
> is this right way to delete ???
> 
> On 19 June 2011 05:14, Erick Erickson  wrote:
> 
>> Yep, you've got to delete and re-add. Although if you have a
>>  defined you
>> can just re-add that document and Solr will automatically delete the
>> underlying
>> document.
>> 
>> You might have to optimize the index afterwards to get the data to really
>> disappear since the deletion process just marks the document as
>> deleted.
>> 
>> Best
>> Erick
>> 
>> On Sat, Jun 18, 2011 at 1:20 PM, Gabriele Kahlout
>>  wrote:
>>> Hello,
>>> 
>>> I've indexing with the content field stored. Now I'd like to delete all
>>> stored content, is there how to do that without re-indexing?
>>> 
>>> It seems not from lucene
>>> FAQ<
>> http://wiki.apache.org/lucene-java/LuceneFAQ#How_do_I_update_a_document_or_a_set_of_documents_that_are_already_indexed.3F
>>> 
>>> :
>>> How do I update a document or a set of documents that are already
>>> indexed? There
>>> is no direct update procedure in Lucene. To update an index incrementally
>>> you must first *delete* the documents that were updated, and *then
>>> re-add*them to the index.
>>> 
>>> --
>>> Regards,
>>> K. Gabriele
>>> 
>>> --- unchanged since 20/9/10 ---
>>> P.S. If the subject contains "[LON]" or the addressee acknowledges the
>>> receipt within 48 hours then I don't resend the email.
>>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>> time(x)
>>> < Now + 48h) ⇒ ¬resend(I, this).
>>> 
>>> If an email is sent by a sender that is not a trusted contact or the
>> email
>>> does not contain a valid code then the email is not received. A valid
>> code
>>> starts with a hyphen and ends with "X".
>>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
>>> L(-[a-z]+[0-9]X)).
>>> 
>> 
> 
> 
> 
> -- 
> Thanks and Regards
> Mohammad Shariq



"site:" feature in Solr?

2011-06-19 Thread Gabriele Kahlout
Hello,

Beside creating an index with just the site in question, is it possible like
with Google to search for results only in a given domain?

-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).


Re: "site:" feature in Solr?

2011-06-19 Thread Ahmet Arslan
> Beside creating an index with just the site in question, is
> it possible like
> with Google to search for results only in a given domain?

If you have an appropriate field that is indexed, yes. fq=site:foo.com
http://wiki.apache.org/solr/CommonQueryParameters#fq


example doesnt run from source?

2011-06-19 Thread Jason Toy
I'm trying to run the example app from the svn source, but it doesn't seem
to work. I am able to run :
java -jar start.jar
and Jetty starts with:
INFO::Started SocketConnector@0.0.0.0:8983

But then when I go to my browser and go to this address:
http://localhost:8983/solr/
I get a 404 error.  What else do I need to do to be able to run the example
from source?


Re: example doesnt run from source?

2011-06-19 Thread Stefan Matheis

Jason,

which source did you use for the checkout and how did you build solr?

Regards
Stefan

Am 19.06.2011 15:00, schrieb Jason Toy:

I'm trying to run the example app from the svn source, but it doesn't seem
to work. I am able to run :
java -jar start.jar
and Jetty starts with:
INFO::Started SocketConnector@0.0.0.0:8983

But then when I go to my browser and go to this address:
http://localhost:8983/solr/
I get a 404 error.  What else do I need to do to be able to run the example
from source?



Re: Multiple indexes

2011-06-19 Thread lee carroll
your data is being used to build an inverted index rather than being
stored as a set of records. de-normalising is fine in most cases. what
is your use case which requires a normalised set of indices ?

2011/6/18 François Schiettecatte :
> You would need to run two independent searches and then 'join' the results.
>
> It is best not to apply a 'sql' mindset to SOLR when it comes to 
> (de)normalization, whereas you strive for normalization in sql, that is 
> usually counter-productive in SOLR. For example, I am working on a project 
> with 30+ normalized tables, but only 4 cores.
>
> Perhaps describing what you are trying to achieve would give us greater 
> insight and thus be able to make more concrete recommendation?
>
> Cheers
>
> François
>
> On Jun 18, 2011, at 2:36 PM, shacky wrote:
>
>> Il 18 giugno 2011 20:27, François Schiettecatte
>>  ha scritto:
>>> Sure.
>>
>> So I can have some searches similar to JOIN on MySQL?
>> The problem is that I need at least two tables in which search data..
>
>


Re: Weird optimize performance degradation

2011-06-19 Thread Erick Erickson
First, there's absolutely no reason to optimize this often, if at all. Older
versions of Lucene would search faster on an optimized index, but
this is no longer necessary. Optimize will reclaim data from
deleted documents, but is generally recommended to be performed
fairly rarely, often at off-peak hours.

Note that optimize will re-write your entire index into a single new segment,
so following your pattern it'll take longer and longer each time.

But the speed change happening at 500,000 documents is suspiciously
close to the default mergeFactor of 10 X 50,000. Do subsequent
optimizes (i.e. on the 750,000th document) still take that long? But
this doesn't make sense because if you're optimizing instead of
committing, each optimize should reduce your index to 1 segment and
you'll never hit a merge.

So I'm a little confused. If you're really optimizing every 50K docs, what
I'd expect to see is successively longer times, and at the end of each
optimize I'd expect there to be only one segment in your index.

Are you sure you're not just seeing successively longer times on each
optimize and just noticing it after 10?

Best
Erick

On Sun, Jun 19, 2011 at 6:04 AM, Santiago Bazerque  wrote:
> Hello!
>
> Here is a puzzling experiment:
>
> I build an index of about 1.2MM documents using SOLR 3.1. The index has a
> large number of dynamic fields (about 15.000). Each document has about 100
> fields.
>
> I add the documents in batches of 20, and every 50.000 documents I optimize
> the index.
>
> The first 10 optimizes (up to exactly 500k documents) take less than a
> minute and a half.
>
> But the 11th and all subsequent commits take north of 10 minutes. The commit
> logs look identical (in the INFOSTREAM.txt file), but what used to be
>
>   Jun 19, 2011 4:03:59 AM IW 13 [Sun Jun 19 04:03:59 EDT 2011; Lucene Merge
> Thread #0]: merge: total 50 docs
>
> Jun 19, 2011 4:04:37 AM IW 13 [Sun Jun 19 04:04:37 EDT 2011; Lucene Merge
> Thread #0]: merge store matchedCount=2 vs 2
>
>
> now eats a lot of time:
>
>
>   Jun 19, 2011 4:37:06 AM IW 14 [Sun Jun 19 04:37:06 EDT 2011; Lucene Merge
> Thread #0]: merge: total 55 docs
>
> Jun 19, 2011 4:46:42 AM IW 14 [Sun Jun 19 04:46:42 EDT 2011; Lucene Merge
> Thread #0]: merge store matchedCount=2 vs 2
>
>
> What could be happening between those two lines that takes 10 minutes at
> full CPU? (and with 50k docs less used to take so much less?).
>
>
> Thanks in advance,
>
> Santiago
>


Re: Is it true that I cannot delete stored content from the index?

2011-06-19 Thread Erick Erickson
That'll work, but you could just as easily simply add the document. Solr
will take care of deleting any other documents with the same 
as a document being added automatically.

Optimizing once a day is reasonable, but note that about all you're
doing here is
reclaiming some space. So if you only do a few deletes a day (by a few
I'm thinking
several thousand), you may be able to reduce that to once a week. But there's
no particular reason to make it less frequent if you're satisfied with
how it works
now.

Best
Erick

On Sat, Jun 18, 2011 at 11:54 PM, Mohammad Shariq  wrote:
> I have define  in my solr and Deleting the docs from solr using
> this uniqueKey.
> and then doing optimization once in a day.
> is this right way to delete ???
>
> On 19 June 2011 05:14, Erick Erickson  wrote:
>
>> Yep, you've got to delete and re-add. Although if you have a
>>  defined you
>> can just re-add that document and Solr will automatically delete the
>> underlying
>> document.
>>
>> You might have to optimize the index afterwards to get the data to really
>> disappear since the deletion process just marks the document as
>> deleted.
>>
>> Best
>> Erick
>>
>> On Sat, Jun 18, 2011 at 1:20 PM, Gabriele Kahlout
>>  wrote:
>> > Hello,
>> >
>> > I've indexing with the content field stored. Now I'd like to delete all
>> > stored content, is there how to do that without re-indexing?
>> >
>> > It seems not from lucene
>> > FAQ<
>> http://wiki.apache.org/lucene-java/LuceneFAQ#How_do_I_update_a_document_or_a_set_of_documents_that_are_already_indexed.3F
>> >
>> > :
>> > How do I update a document or a set of documents that are already
>> > indexed? There
>> > is no direct update procedure in Lucene. To update an index incrementally
>> > you must first *delete* the documents that were updated, and *then
>> > re-add*them to the index.
>> >
>> > --
>> > Regards,
>> > K. Gabriele
>> >
>> > --- unchanged since 20/9/10 ---
>> > P.S. If the subject contains "[LON]" or the addressee acknowledges the
>> > receipt within 48 hours then I don't resend the email.
>> > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>> time(x)
>> > < Now + 48h) ⇒ ¬resend(I, this).
>> >
>> > If an email is sent by a sender that is not a trusted contact or the
>> email
>> > does not contain a valid code then the email is not received. A valid
>> code
>> > starts with a hyphen and ends with "X".
>> > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
>> > L(-[a-z]+[0-9]X)).
>> >
>>
>
>
>
> --
> Thanks and Regards
> Mohammad Shariq
>


Re: example doesnt run from source?

2011-06-19 Thread Erick Erickson
Right, run "ant example" first to build the example code.
You have to run it from the /solr directory.

Best
Erick

On Sun, Jun 19, 2011 at 9:00 AM, Jason Toy  wrote:
> I'm trying to run the example app from the svn source, but it doesn't seem
> to work. I am able to run :
> java -jar start.jar
> and Jetty starts with:
> INFO::Started SocketConnector@0.0.0.0:8983
>
> But then when I go to my browser and go to this address:
> http://localhost:8983/solr/
> I get a 404 error.  What else do I need to do to be able to run the example
> from source?
>


Re: Optimize taking two steps and extra disk space

2011-06-19 Thread Michael McCandless
With LogXMergePolicy (the default before 3.2), optimize respects
mergeFactor, so it's doing 2 steps because you have 37 segments but 35
mergeFactor.

With TieredMergePolicy (default on 3.2 and after), there is now a
separate merge factor used for optimize (maxMergeAtOnceExplicit)... so
you could eg set this factor higher and more often get a single merge
for the optimize.

Mike McCandless

http://blog.mikemccandless.com

On Sat, Jun 18, 2011 at 6:45 PM, Shawn Heisey  wrote:
> I've noticed something odd in Solr 3.2 when it does an optimize.  One of my
> shards (freshly built via DIH full-import) had 37 segments, totalling
> 17.38GB of disk space.  13 of those segments were results of merges during
> initial import, the other 24 were untouched after creation.  Starting at _0,
> the final segment before optimizing is _co.  The mergefactor on the index is
> 35, chosen because it makes merged segments line up nicely on "z"
> boundaries.
>
> The optmization process created a _cp segment of 14.4GB, followed by a _cq
> segment at the final 17.27GB size, so at the peak, it took 49GB of disk
> space to hold the index.
>
> Is there any way to make it do the optimize in one pass?  Is there a
> compelling reason why it does it this way?
>
> Thanks,
> Shawn
>
>


Re: Weird optimize performance degradation

2011-06-19 Thread Santiago Bazerque
Hello Erick, thanks for your answer!

Yes, our over-optimization is mainly due to paranoia over these strange
commit times. The long optimize time persisted in all the subsequent
commits, and this is consistent with what we are seeing in other production
indexes that have the same problem. Once the anomaly shows up, it never
commits quickly again.

I combed through the last 50k documents that were added before the first
slow commit. I found one with a larger than usual number of fields (didn't
write down the number, but it was a few thousands).

I deleted it, and the following optimize was normal again (110 seconds). So
I'm pretty sure a document with lots of fields is the cause of the slowdown.

If that would be useful, I can do some further testing to confirm this
hypothesis and send the document to the list.

Thanks again for your answer.

Best,
Santiago

On Sun, Jun 19, 2011 at 10:21 AM, Erick Erickson wrote:

> First, there's absolutely no reason to optimize this often, if at all.
> Older
> versions of Lucene would search faster on an optimized index, but
> this is no longer necessary. Optimize will reclaim data from
> deleted documents, but is generally recommended to be performed
> fairly rarely, often at off-peak hours.
>
> Note that optimize will re-write your entire index into a single new
> segment,
> so following your pattern it'll take longer and longer each time.
>
> But the speed change happening at 500,000 documents is suspiciously
> close to the default mergeFactor of 10 X 50,000. Do subsequent
> optimizes (i.e. on the 750,000th document) still take that long? But
> this doesn't make sense because if you're optimizing instead of
> committing, each optimize should reduce your index to 1 segment and
> you'll never hit a merge.
>
> So I'm a little confused. If you're really optimizing every 50K docs, what
> I'd expect to see is successively longer times, and at the end of each
> optimize I'd expect there to be only one segment in your index.
>
> Are you sure you're not just seeing successively longer times on each
> optimize and just noticing it after 10?
>
> Best
> Erick
>
> On Sun, Jun 19, 2011 at 6:04 AM, Santiago Bazerque 
> wrote:
> > Hello!
> >
> > Here is a puzzling experiment:
> >
> > I build an index of about 1.2MM documents using SOLR 3.1. The index has a
> > large number of dynamic fields (about 15.000). Each document has about
> 100
> > fields.
> >
> > I add the documents in batches of 20, and every 50.000 documents I
> optimize
> > the index.
> >
> > The first 10 optimizes (up to exactly 500k documents) take less than a
> > minute and a half.
> >
> > But the 11th and all subsequent commits take north of 10 minutes. The
> commit
> > logs look identical (in the INFOSTREAM.txt file), but what used to be
> >
> >   Jun 19, 2011 4:03:59 AM IW 13 [Sun Jun 19 04:03:59 EDT 2011; Lucene
> Merge
> > Thread #0]: merge: total 50 docs
> >
> > Jun 19, 2011 4:04:37 AM IW 13 [Sun Jun 19 04:04:37 EDT 2011; Lucene Merge
> > Thread #0]: merge store matchedCount=2 vs 2
> >
> >
> > now eats a lot of time:
> >
> >
> >   Jun 19, 2011 4:37:06 AM IW 14 [Sun Jun 19 04:37:06 EDT 2011; Lucene
> Merge
> > Thread #0]: merge: total 55 docs
> >
> > Jun 19, 2011 4:46:42 AM IW 14 [Sun Jun 19 04:46:42 EDT 2011; Lucene Merge
> > Thread #0]: merge store matchedCount=2 vs 2
> >
> >
> > What could be happening between those two lines that takes 10 minutes at
> > full CPU? (and with 50k docs less used to take so much less?).
> >
> >
> > Thanks in advance,
> >
> > Santiago
> >
>


Solr Multithreading

2011-06-19 Thread Rahul Warawdekar
Hi,

I am currently working on a search based project which involves
indexing data from a SQL Server database including attachments using
DIH.
For indexing attachments (varbinary DB objects), I am using TikaEntityProcessor.

I am trying to use the multithreading to speed up the indexing but it
seems to fail when indexing attachments, even after appying a few Solr
fix patches.

My question is, Is the current multithreading feature stable in Solr
3.1 or it needs further enhancements ?

-- 
Thanks and Regards
Rahul A. Warawdekar


fq vs adding to query

2011-06-19 Thread Jamie Johnson
Are there any hard and fast rules about when to use fq vs adding to the
query?  For instance if I started with a search of

camera

then wanted to add another keyword say digital, is it better to do

q=camera AND digital

or

q=camera&fq=digital

I know that fq isn't taken into account when doing highlighting, so what I
am currently doing is when there are facet based queries I am doing fqs but
everything else is being added to the query, so in the case above I would
have done q=camera AND digital.  If however there was a field called
category with values standard or digital I would have done
q=camera&fq=category:digital.  Any guidance would be appreciated.


Re: fq vs adding to query

2011-06-19 Thread Mohammad Shariq
fq is filter-query, search based on category, timestamp, language etc. but I
dont see any performance improvement if use 'keyword' in fq.

useCases :
fq=lang:English&q=camera AND digital
OR
fq=time:[13023567 TO 13023900]&q=camera AND digital


On 19 June 2011 20:17, Jamie Johnson  wrote:

> Are there any hard and fast rules about when touse fq vs adding to the
> query?  For instance if I started with a search of
> camera
>
> then wanted to add another keyword say digital, is it better to do
>
> q=camera AND digital
>
> or
>
> q=camera&fq=digital
>
> I know that fq isn't taken into account when doing highlighting, so what I
> am currently doing is when there are facet based queries I am doing fqs but
> everything else is being added to the query, so in the case above I would
> have done q=camera AND digital.  If however there was a field called
> category with values standard or digital I would have done
> q=camera&fq=category:digital.  Any guidance would be appreciated.
>



-- 
Thanks and Regards
Mohammad Shariq


Building Solr 3.2 from sources - can't get war

2011-06-19 Thread Yuriy Akopov

Hi,

This is my first post here so excuse me please if it is not really related.

At the moment I'm using Solr 1.4.1 with SOLR-236 
(https://issues.apache.org/jira/browse/SOLR-236) patch applied to support field 
collapsing.

One of the mandatory fields of documents indexed is generated from the 
*.doc/*.docx/*.pdf files uploaded by users, so Solr Cell is also heavily used 
in the project for the purpose of parsing documents to store their plain text 
content. Unfortunately, it can't parse correctly all the documents but in most 
cases it works well enough.

Recently I learned 
(http://stackoverflow.com/questions/6369214/solr-cell-extractingrequesthandler-cannot-parse-some-doc-files/)
 that Solr Cell I'm using is old so by using its up-to-date version I can get 
more documents parsed correctly. As I am using apache-solr-cell-1.4.1.jar in my 
lib folder, first thing I tried was to replace it with apache-solr-cell-3.2.jar 
from the latest distribution without changing anything else (e.g. war file). 
After Solr instance was restarted, it worked (I managed to fetch the content of 
the parsed document) but after a number of requests crashed.

Then, I decided that in order to use *-3.2 libraries properly I need to use 3.2 
core war file as well. But as I need the collapsing functionality, I need to 
build a custom patched version of it as I did before with 1.4.1.

--
So the first question is if I was really right in my assumption here - maybe it 
is possible to upgrade Solr Cell / Tika to the latest version while still using 
1.4.1 Solr core? If that's possible, my following questions can be skipped.
--

And the problem I am facing is that I can't build 3.2 version war file. I mean, 
when I get source from 
http://svn.apache.org/repos/asf/lucene/solr/tags/1.4.1/release-1.4.1 among the 
build options there is the "dist-war" key which allows to build war core and a 
set of standard libraries. Everything is simple in case you need to build 1.4.1 
core.

For 3.2, I can't see a similar build option. First, there is no release-3.2 
folder, so I tried to checkout http://svn.apache.org/repos/asf/lucene/dev/trunk 
supposing this is the latest stable release (and I might be wrong there). 
However, there is no "dist-war" build option and I only get various jar files 
when building that branch with no war file at all.

--
So the second question is what exactly am I doing wrong - do I checkout 
incorrect branch (and what is the correct one then?) or do I build it 
improperly (maybe I need to modify build.xml somehow)?
--

Many thanks in advance. Feel free to ask for more details if that matters - I 
am a total noob in Java programming so very likely I've missed something here.

--
Yuriy Akopov

  

Re: Optimize taking two steps and extra disk space

2011-06-19 Thread Shawn Heisey

On 6/19/2011 7:32 AM, Michael McCandless wrote:

With LogXMergePolicy (the default before 3.2), optimize respects
mergeFactor, so it's doing 2 steps because you have 37 segments but 35
mergeFactor.

With TieredMergePolicy (default on 3.2 and after), there is now a
separate merge factor used for optimize (maxMergeAtOnceExplicit)... so
you could eg set this factor higher and more often get a single merge
for the optimize.


This makes sense.  the default for maxMergeAtOnceExplicit is 30 
according to LUCENE-854, so it merges the first 30 segments, then it 
goes back and merges the new one plus the other 7 that remain.  To 
counteract this behavior, I've put this in my solrconfig.xml, to test 
next week.



70


I figure that twice the megeFactor (35) will likely cover every possible 
outcome.  Is that a correct thought?


Thanks,
Shawn



Re: Weird optimize performance degradation

2011-06-19 Thread Mohammad Shariq
I also have the solr with around 100mn docs.
I do optimize once in a week, and it takes around 1 hour 30 mins to
optimize.


On 19 June 2011 20:02, Santiago Bazerque  wrote:

> Hello Erick, thanks for your answer!
>
> Yes, our over-optimization is mainly due to paranoia over these strange
> commit times. The long optimize time persisted in all the subsequent
> commits, and this is consistent with what we are seeing in other production
> indexes that have the same problem. Once the anomaly shows up, it never
> commits quickly again.
>
> I combed through the last 50k documents that were added before the first
> slow commit. I found one with a larger than usual number of fields (didn't
> write down the number, but it was a few thousands).
>
> I deleted it, and the following optimize was normal again (110 seconds). So
> I'm pretty sure a document with lots of fields is the cause of the
> slowdown.
>
> If that would be useful, I can do some further testing to confirm this
> hypothesis and send the document to the list.
>
> Thanks again for your answer.
>
> Best,
> Santiago
>
> On Sun, Jun 19, 2011 at 10:21 AM, Erick Erickson  >wrote:
>
> > First, there's absolutely no reason to optimize this often, if at all.
> > Older
> > versions of Lucene would search faster on an optimized index, but
> > this is no longer necessary. Optimize will reclaim data from
> > deleted documents, but is generally recommended to be performed
> > fairly rarely, often at off-peak hours.
> >
> > Note that optimize will re-write your entire index into a single new
> > segment,
> > so following your pattern it'll take longer and longer each time.
> >
> > But the speed change happening at 500,000 documents is suspiciously
> > close to the default mergeFactor of 10 X 50,000. Do subsequent
> > optimizes (i.e. on the 750,000th document) still take that long? But
> > this doesn't make sense because if you're optimizing instead of
> > committing, each optimize should reduce your index to 1 segment and
> > you'll never hit a merge.
> >
> > So I'm a little confused. If you're really optimizing every 50K docs,
> what
> > I'd expect to see is successively longer times, and at the end of each
> > optimize I'd expect there to be only one segment in your index.
> >
> > Are you sure you're not just seeing successively longer times on each
> > optimize and just noticing it after 10?
> >
> > Best
> > Erick
> >
> > On Sun, Jun 19, 2011 at 6:04 AM, Santiago Bazerque 
> > wrote:
> > > Hello!
> > >
> > > Here is a puzzling experiment:
> > >
> > > I build an index of about 1.2MM documents using SOLR 3.1. The index has
> a
> > > large number of dynamic fields (about 15.000). Each document has about
> > 100
> > > fields.
> > >
> > > I add the documents in batches of 20, and every 50.000 documents I
> > optimize
> > > the index.
> > >
> > > The first 10 optimizes (up to exactly 500k documents) take less than a
> > > minute and a half.
> > >
> > > But the 11th and all subsequent commits take north of 10 minutes. The
> > commit
> > > logs look identical (in the INFOSTREAM.txt file), but what used to be
> > >
> > >   Jun 19, 2011 4:03:59 AM IW 13 [Sun Jun 19 04:03:59 EDT 2011; Lucene
> > Merge
> > > Thread #0]: merge: total 50 docs
> > >
> > > Jun 19, 2011 4:04:37 AM IW 13 [Sun Jun 19 04:04:37 EDT 2011; Lucene
> Merge
> > > Thread #0]: merge store matchedCount=2 vs 2
> > >
> > >
> > > now eats a lot of time:
> > >
> > >
> > >   Jun 19, 2011 4:37:06 AM IW 14 [Sun Jun 19 04:37:06 EDT 2011; Lucene
> > Merge
> > > Thread #0]: merge: total 55 docs
> > >
> > > Jun 19, 2011 4:46:42 AM IW 14 [Sun Jun 19 04:46:42 EDT 2011; Lucene
> Merge
> > > Thread #0]: merge store matchedCount=2 vs 2
> > >
> > >
> > > What could be happening between those two lines that takes 10 minutes
> at
> > > full CPU? (and with 50k docs less used to take so much less?).
> > >
> > >
> > > Thanks in advance,
> > >
> > > Santiago
> > >
> >
>



-- 
Thanks and Regards
Mohammad Shariq


Re: fq vs adding to query

2011-06-19 Thread Markus Jelsma
If you wan't to make good use of the filter cache then use filter queries.

> fq is filter-query, search based on category, timestamp, language etc. but
> I dont see any performance improvement if use 'keyword' in fq.
> 
> useCases :
> fq=lang:English&q=camera AND digital
> OR
> fq=time:[13023567 TO 13023900]&q=camera AND digital
> 
> On 19 June 2011 20:17, Jamie Johnson  wrote:
> > Are there any hard and fast rules about when touse fq vs adding to the
> > query?  For instance if I started with a search of
> > camera
> > 
> > then wanted to add another keyword say digital, is it better to do
> > 
> > q=camera AND digital
> > 
> > or
> > 
> > q=camera&fq=digital
> > 
> > I know that fq isn't taken into account when doing highlighting, so what
> > I am currently doing is when there are facet based queries I am doing
> > fqs but everything else is being added to the query, so in the case
> > above I would have done q=camera AND digital.  If however there was a
> > field called category with values standard or digital I would have done
> > q=camera&fq=category:digital.  Any guidance would be appreciated.


Re: Building Solr 3.2 from sources - can't get war

2011-06-19 Thread Shawn Heisey

On 6/19/2011 9:32 AM, Yuriy Akopov wrote:

For 3.2, I can't see a similar build option. First, there is no release-3.2 folder, so I 
tried to checkout http://svn.apache.org/repos/asf/lucene/dev/trunk supposing this is the 
latest stable release (and I might be wrong there). However, there is no 
"dist-war" build option and I only get various jar files when building that 
branch with no war file at all.


I don't know the answer to your first question about Tika, but I can 
tackle the second.


In the checked out lucene (either trunk or one of the 3.x branches) 
source is a solr/ directory.  You just cd into that directory, and 
dist-war becomes a build option.  I tend to build solr with "ant dist" 
which also builds all the contrib jars.  If you are using the 
dataimporthandler, you'll want the contrib jars.  DIH has always been a 
contrib module, and in 3.1 it was removed from the .war file.


Building dist succeeds, but I just tried dist-war on my checked out 3.2 
and it failed, ending with the following error:


BUILD FAILED
/opt/ncindex/src/orig_3_2/solr/build.xml:620: 
/opt/ncindex/src/orig_3_2/solr/build/web not found.


Shawn



Re: fq vs adding to query

2011-06-19 Thread Shawn Heisey

On 6/19/2011 10:00 AM, Markus Jelsma wrote:

If you wan't to make good use of the filter cache then use filter queries.


Additionally, information in filter queries will not affect relevancy 
ranking.  If you want the terms you are using to affect the document 
scores, include them in the main query.  Filter queries are intended for 
just that -- filtering.  They do it very efficiently, especially if you 
reuse them frequently, which hits the filter cache as Markus said.  It's 
often good practice to break up your filter queries into multiple fq 
statements so that there's more likelihood that they will use the cache.


Thanks,
Shawn



Re: query about the server configuration

2011-06-19 Thread Ranveer

Please help I am also in same situation.

regards


On Sunday 19 June 2011 12:59 PM, Jonty Rhods wrote:

Dear all,

I am quite new and not work on solr for heavy request.

I have following server configuration:

16GB RAM
16 CPU

I need to index update in every minutes and at least more than 5000 docs per
day. Size of the data per day will be around 50 MB. I am expecting 10 to
30 concurrent hit on server which is 2 million hits per day and around 30 to
40 concurrent user at peak our.

Right now I had configure core and using static method to call solr server
in solrj (SolrServer server = new HttpSolrServer();). I am worried that at
peak our static instance of the server in solrj will not able to perform the
response and it will become slow.

Is there any way to open more then one connection of server instance in the
SolrJ like connection pool which we are using in Database related connection
pooling (Apache DBCP or Hibernate).

Please help me to configure the server as my heavy requirements.

thanks for your help.

regards
jonty





Re: about the SolrServer server = new CommonsHttpSolrServer(URL);

2011-06-19 Thread Ranveer

thanks..
however few more query.
How to maintain connections threads (max and min settings)?
What would be ideal setting for max in setMaxConnectionsPerHost method. 
Will it be ok for 30 to 40 concurrent user. How thread will be maintain for


MultiThreadedHttpConnectionManager class.



On Sunday 19 June 2011 02:04 PM, Ahmet Arslan wrote:

for heavy use (30 to 40 concurrent
user) will it work.
How to open and maintain more connection at a time like
connection pool. So
user cat receive fast response..

It uses HttpClient under the hood. You can pass httpClient to its constructor 
too. It seems that MultiThreadedHttpConnectionManager has 
setMaxConnectionsPerHost method.

String serverPath = "http://localhost:8983/solr";;
HttpClient client = new HttpClient(new MultiThreadedHttpConnectionManager());
URL url = new URL(serverPath);
CommonsHttpSolrServer solrServer = new CommonsHttpSolrServer(url, client);




Re: Building Solr 3.2 from sources - can't get war

2011-06-19 Thread Yuriy Akopov
In the checked out lucene (either trunk or one of the 3.x branches) source 
is a solr/ directory.  You just cd into that directory, and dist-war 
becomes a build option.


Thanks, Shawn! That worked and by invoking dist-war build I have received 
apache-solr-4.0-SNAPSHOT.war file successfully - but judging by its name it 
is a current 4.0 snapshot rather than stable 3.2.


Alas, 4.0 doesn't suit me for two reasons: first, it is still experimental 
and hasn't been released yet (at least as far as I know) and second, it 
supports field collapsing natively, so it doesn't need to be patched. The 
problem is that the parameters Solr 4.0 uses to control collapsing are not 
compatible with the ones added by SOLR-236 patch so I have to rewrite my 
client application as well. Which is surely inevitable sooner or later but 
until 4.0 is released I'd prefer stick to earlier version.


So I need an advice once again - which folder I need to checkout to get 3.2 
source code? Is it clear with 1.4.1 (.../tags/release-1.4.1 is obvious 
enough), and ...dev/trunk turned out to contain 4.0. Surely my question is 
silly but I can't figure out how can I get Solr 3.2 buildable source code.


-y.



RE: Building Solr 3.2 from sources - can't get war

2011-06-19 Thread Steven A Rowe
https://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_3_2/

> -Original Message-
> From: Yuriy Akopov [mailto:ako...@hotmail.co.uk]
> Sent: Sunday, June 19, 2011 4:38 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Building Solr 3.2 from sources - can't get war
> 
> > In the checked out lucene (either trunk or one of the 3.x branches)
> source
> > is a solr/ directory.  You just cd into that directory, and dist-war
> > becomes a build option.
> 
> Thanks, Shawn! That worked and by invoking dist-war build I have received
> apache-solr-4.0-SNAPSHOT.war file successfully - but judging by its name
> it
> is a current 4.0 snapshot rather than stable 3.2.
> 
> Alas, 4.0 doesn't suit me for two reasons: first, it is still
> experimental
> and hasn't been released yet (at least as far as I know) and second, it
> supports field collapsing natively, so it doesn't need to be patched. The
> problem is that the parameters Solr 4.0 uses to control collapsing are
> not
> compatible with the ones added by SOLR-236 patch so I have to rewrite my
> client application as well. Which is surely inevitable sooner or later
> but
> until 4.0 is released I'd prefer stick to earlier version.
> 
> So I need an advice once again - which folder I need to checkout to get
> 3.2
> source code? Is it clear with 1.4.1 (.../tags/release-1.4.1 is obvious
> enough), and ...dev/trunk turned out to contain 4.0. Surely my question
> is
> silly but I can't figure out how can I get Solr 3.2 buildable source
> code.
> 
> -y.



Re: Solr and Tag Cloud

2011-06-19 Thread Alexey Serba
Consider you have multivalued field _tag_ related to every document in
your corpus. Then you can build tag cloud relevant for all data set or
specific query by retrieving facets for field _tag_ for "*:*" or any
other query. You'll get a list of popular _tag_ values relevant to
this query with occurrence counts.

If you want to build tag cloud for general analyzed text fields you
still can do that the same way, but you should note that you can hit
some performance/memory problems if you have significant data set and
huge text fields. You should probably use stop words to filter popular
general terms.

On Sat, Jun 18, 2011 at 8:12 AM, Jamie Johnson  wrote:
> Does anyone have details of how to generate a tag cloud of popular terms
> across an entire data set and then also across a query?
>


paging and maintaingin a cursor just like ScrollableResultSet

2011-06-19 Thread Hiller, Dean x66079
As you probably know, using Query in hibernate/JPA gets slower and slower each 
page since it starts all over on the index tree :( WHILE ScrollableResultSet 
does NOT because the database maintains a cursor into the index that just picks 
up where it left off so as you go to the next page, next page, the speed stays 
linearly the same

Does something like that exist in solr?

I was looking at the api and all the examples are just for returning all 
results from what I could tell.

I went into Lucene and it looks like it can do it kind of if you code up your 
own Collector and unfortunately make the Collector.collect(int doc) block on a 
lock while waiting for the client to ask for the next page(or ask to release 
the resource since it is complete).

Ie. ScrollableResultSet obviously has to be closed when complete and so would 
this method as well.

Any ideas on how to achieve this as my client is a computer not a webapp with a 
human clicking next page and we want the resultset paging to be linear as it 
really hurts our performance.

Thanks,
Dean

This message and any attachments are intended only for the use of the addressee 
and
may contain information that is privileged and confidential. If the reader of 
the 
message is not the intended recipient or an authorized representative of the
intended recipient, you are hereby notified that any dissemination of this
communication is strictly prohibited. If you have received this communication in
error, please notify us immediately by e-mail and delete the message and any
attachments from your system.



Re: Why are not query keywords treated as a set?

2011-06-19 Thread lee carroll
do you mean a phrase query? "past past"
can you give some more detail?

On 18 June 2011 13:02, Gabriele Kahlout  wrote:
> q=past past
>
> 1.0 = (MATCH) sum of:
> *  0.5 = (MATCH) fieldWeight(content:past in 0), product of:*
>   1.0 = tf(termFreq(content:past)=1)
>   1.0 = idf(docFreq=1, maxDocs=2)
>   0.5 = fieldNorm(field=content, doc=0)
> *  0.5 = (MATCH) fieldWeight(content:past in 0), product of:*
>   1.0 = tf(termFreq(content:past)=1)
>   1.0 = idf(docFreq=1, maxDocs=2)
>   0.5 = fieldNorm(field=content, doc=0)
>
> Is there how I can treat the query keywords as a set?
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
> < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>


Re: paging and maintaingin a cursor just like ScrollableResultSet

2011-06-19 Thread Michael Sokolov
One technique I've used to page through huge result sets that could 
help: if you have a sortable key (like an id), you can just fetch all 
docs, sorted by the key, and then on subsequent page requests use the 
last value from the previous page as a filter in a range term like:


id:[ TO *]

where you substitute for 

there may be a better approach though...

-Mike

On 6/19/2011 6:02 PM, Hiller, Dean x66079 wrote:

As you probably know, using Query in hibernate/JPA gets slower and slower each 
page since it starts all over on the index tree :( WHILE ScrollableResultSet 
does NOT because the database maintains a cursor into the index that just picks 
up where it left off so as you go to the next page, next page, the speed stays 
linearly the same

Does something like that exist in solr?

I was looking at the api and all the examples are just for returning all 
results from what I could tell.

I went into Lucene and it looks like it can do it kind of if you code up your 
own Collector and unfortunately make the Collector.collect(int doc) block on a 
lock while waiting for the client to ask for the next page(or ask to release 
the resource since it is complete).

Ie. ScrollableResultSet obviously has to be closed when complete and so would 
this method as well.

Any ideas on how to achieve this as my client is a computer not a webapp with a 
human clicking next page and we want the resultset paging to be linear as it 
really hurts our performance.

Thanks,
Dean

This message and any attachments are intended only for the use of the addressee 
and
may contain information that is privileged and confidential. If the reader of 
the
message is not the intended recipient or an authorized representative of the
intended recipient, you are hereby notified that any dissemination of this
communication is strictly prohibited. If you have received this communication in
error, please notify us immediately by e-mail and delete the message and any
attachments from your system.






Re: solr highliting feature

2011-06-19 Thread Jan Høydahl
Hi,

First, you should consider SolrJ API if you're working from Java/JSP.

Then, say you want to highlight title. In you loop across the N hits, instead 
of pulling the title from the hits themselves, check if you find a highlighted 
result with the same ID in the  section.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 18. juni 2011, at 11.26, Romi wrote:

> I want to highlight some search result value. i used solr for this. as i
> suppose solr provides highlighting feature. i used it i configure
> highlighting in solr-config.xml. i set hl="true" and hl.fl="somefield" at
> query time in my url when i run the url it gives me a xml representation of
> search results where i got a tag .
> 
> further i am parsing this xml response to show result in a jsp page. but i
> ma not getting how can i high lite the fields in jsp page
> 
> -
> Thanks & Regards
> Romi
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/solr-highliting-feature-tp3079239p3079239.html
> Sent from the Solr - User mailing list archive at Nabble.com.



why too many open files?

2011-06-19 Thread Jason, Kim
Hi, All

I have 12 shards and ramBufferSizeMB=512, mergeFactor=5.
But solr raise java.io.FileNotFoundException (Too many open files).
mergeFactor is just 5. How can this happen?
Below is segments of some shard. That is too many segments over mergFactor.
What's wrong and How should I set the mergeFactor?

==
[root@solr solr]# ls indexData/multicore-us/usn02/data/index/
_0.fdt   _gs.fdt  _h5.tii  _hl.nrm  _i1.nrm  _kn.nrm  _l1.nrm  _lq.tii
_0.fdx   _gs.fdx  _h5.tis  _hl.prx  _i1.prx  _kn.prx  _l1.prx  _lq.tis
_3i.fdt  _gs.fnm  _h7.fnm  _hl.tii  _i1.tii  _kn.tii  _l1.tii 
lucene-2de7b31b5eabdff0b6ec7fd32eecf8c7-write.lock
_3i.fdx  _gs.frq  _h7.frq  _hl.tis  _i1.tis  _kn.tis  _l1.tis  _lu.fnm
_3s.fnm  _gs.nrm  _h7.nrm  _hn.fnm  _j7.fdt  _kp.fnm  _l2.fnm  _lu.frq
_3s.frq  _gs.prx  _h7.prx  _hn.frq  _j7.fdx  _kp.frq  _l2.frq  _lu.nrm
_3s.nrm  _gs.tii  _h7.tii  _hn.nrm  _kb.fnm  _kp.nrm  _l2.nrm  _lu.prx
_3s.prx  _gs.tis  _h7.tis  _hn.prx  _kb.frq  _kp.prx  _l2.prx  _lu.tii
_3s.tii  _gu.fnm  _h9.fnm  _hn.tii  _kb.nrm  _kp.tii  _l2.tii  _lu.tis
_3s.tis  _gu.frq  _h9.frq  _hn.tis  _kb.prx  _kp.tis  _l2.tis  _ly.fnm
_48.fdt  _gu.nrm  _h9.nrm  _hp.fnm  _kb.tii  _kq.fnm  _l6.fnm  _ly.frq
_48.fdx  _gu.prx  _h9.prx  _hp.frq  _kb.tis  _kq.frq  _l6.frq  _ly.nrm
_4d.fnm  _gu.tii  _h9.tii  _hp.nrm  _kc.fnm  _kq.nrm  _l6.nrm  _ly.prx
_4d.frq  _gu.tis  _h9.tis  _hp.prx  _kc.frq  _kq.prx  _l6.prx  _ly.tii
_4d.nrm  _gw.fnm  _hb.fnm  _hp.tii  _kc.nrm  _kq.tii  _l6.tii  _ly.tis
_4d.prx  _gw.frq  _hb.frq  _hp.tis  _kc.prx  _kq.tis  _l6.tis  _m3.fnm
_4d.tii  _gw.nrm  _hb.nrm  _hr.fnm  _kc.tii  _kr.fnm  _la.fnm  _m3.frq
_4d.tis  _gw.prx  _hb.prx  _hr.frq  _kc.tis  _kr.frq  _la.frq  _m3.nrm
_5b.fdt  _gw.tii  _hb.tii  _hr.nrm  _kf.fdt  _kr.nrm  _la.nrm  _m3.prx
_5b.fdx  _gw.tis  _hb.tis  _hr.prx  _kf.fdx  _kr.prx  _la.prx  _m3.tii
_5b.fnm  _gy.fnm  _he.fdt  _hr.tii  _kf.fnm  _kr.tii  _la.tii  _m3.tis
_5b.frq  _gy.frq  _he.fdx  _hr.tis  _kf.frq  _kr.tis  _la.tis  _m8.fnm
_5b.nrm  _gy.nrm  _he.fnm  _ht.fnm  _kf.nrm  _kt.fnm  _le.fnm  _m8.frq
_5b.prx  _gy.prx  _he.frq  _ht.frq  _kf.prx  _kt.frq  _le.frq  _m8.nrm
_5b.tii  _gy.tii  _he.nrm  _ht.nrm  _kf.tii  _kt.nrm  _le.nrm  _m8.prx
_5b.tis  _gy.tis  _he.prx  _ht.prx  _kf.tis  _kt.prx  _le.prx  _m8.tii
_5m.fnm  _h0.fnm  _he.tii  _ht.tii  _kg.fnm  _kt.tii  _le.tii  _m8.tis
_5m.frq  _h0.frq  _he.tis  _ht.tis  _kg.frq  _kt.tis  _le.tis  _md.fnm
_5m.nrm  _h0.nrm  _hh.fnm  _hv.fnm  _kg.nrm  _kw.fnm  _li.fnm  _md.frq
_5m.prx  _h0.prx  _hh.frq  _hv.frq  _kg.prx  _kw.frq  _li.frq  _md.nrm
_5m.tii  _h0.tii  _hh.nrm  _hv.nrm  _kg.tii  _kw.nrm  _li.nrm  _md.prx
_5m.tis  _h0.tis  _hh.prx  _hv.prx  _kg.tis  _kw.prx  _li.prx  _md.tii
_5n.fnm  _h2.fnm  _hh.tii  _hv.tii  _kj.fdt  _kw.tii  _li.tii  _md.tis
_5n.frq  _h2.frq  _hh.tis  _hv.tis  _kj.fdx  _kw.tis  _li.tis  _mi.fnm
_5n.nrm  _h2.nrm  _hk.fnm  _hz.fdt  _kj.fnm  _ky.fnm  _lm.fnm  _mi.frq
_5n.prx  _h2.prx  _hk.frq  _hz.fdx  _kj.frq  _ky.frq  _lm.frq  _mi.nrm
_5n.tii  _h2.tii  _hk.nrm  _hz.fnm  _kj.nrm  _ky.nrm  _lm.nrm  _mi.prx
_5n.tis  _h2.tis  _hk.prx  _hz.frq  _kj.prx  _ky.prx  _lm.prx  _mi.tii
_5x.fnm  _h5.fdt  _hk.tii  _hz.nrm  _kj.tii  _ky.tii  _lm.tii  _mi.tis
_5x.frq  _h5.fdx  _hk.tis  _hz.prx  _kj.tis  _ky.tis  _lm.tis  segments_1
_5x.nrm  _h5.fnm  _hl.fdt  _hz.tii  _kn.fdt  _l1.fdt  _lq.fnm  segments.gen
_5x.prx  _h5.frq  _hl.fdx  _hz.tis  _kn.fdx  _l1.fdx  _lq.frq
_5x.tii  _h5.nrm  _hl.fnm  _i1.fnm  _kn.fnm  _l1.fnm  _lq.nrm
_5x.tis  _h5.prx  _hl.frq  _i1.frq  _kn.frq  _l1.frq  _lq.prx
==

Thanks in advance.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/why-too-many-open-files-tp3084407p3084407.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: query about the server configuration

2011-06-19 Thread Jonty Rhods
I forgot an important point that I need to commit the server in 2 to 5
minutes..

please help..

regards


On Sun, Jun 19, 2011 at 11:29 PM, Ranveer  wrote:

> Please help I am also in same situation.
>
> regards
>
>
>
> On Sunday 19 June 2011 12:59 PM, Jonty Rhods wrote:
>
>> Dear all,
>>
>> I am quite new and not work on solr for heavy request.
>>
>> I have following server configuration:
>>
>> 16GB RAM
>> 16 CPU
>>
>> I need to index update in every minutes and at least more than 5000 docs
>> per
>> day. Size of the data per day will be around 50 MB. I am expecting 10 to
>> 30 concurrent hit on server which is 2 million hits per day and around 30
>> to
>> 40 concurrent user at peak our.
>>
>> Right now I had configure core and using static method to call solr server
>> in solrj (SolrServer server = new HttpSolrServer();). I am worried that at
>> peak our static instance of the server in solrj will not able to perform
>> the
>> response and it will become slow.
>>
>> Is there any way to open more then one connection of server instance in
>> the
>> SolrJ like connection pool which we are using in Database related
>> connection
>> pooling (Apache DBCP or Hibernate).
>>
>> Please help me to configure the server as my heavy requirements.
>>
>> thanks for your help.
>>
>> regards
>> jonty
>>
>>
>


score of Infinity on dismax query

2011-06-19 Thread Chris Book
Hello, I have a solr search server running and in at least one very rare
case, I'm seeing a strange scoring result.  The following example will cause
solr to return a score of "Infinity":

Query: {!dismax tie=0.1 qf=lyrics pf=lyrics ps=5}drugs the drugs

Here is the debug output:
Infinity = (MATCH) sum of:
  0.0758089 = (MATCH) sum of:
0.03790445 = (MATCH) weight(lyrics:drug in 0), product of:
  0.40824828 = queryWeight(lyrics:drug), product of:
0.30685282 = idf(docFreq=1, maxDocs=1)
1.3304368 = queryNorm
  0.09284656 = (MATCH) fieldWeight(lyrics:drug in 0), product of:
3.8729835 = tf(termFreq(lyrics:drug)=15)
0.30685282 = idf(docFreq=1, maxDocs=1)
0.078125 = fieldNorm(field=lyrics, doc=0)
0.03790445 = (MATCH) weight(lyrics:drug in 0), product of:
  0.40824828 = queryWeight(lyrics:drug), product of:
0.30685282 = idf(docFreq=1, maxDocs=1)
1.3304368 = queryNorm
  0.09284656 = (MATCH) fieldWeight(lyrics:drug in 0), product of:
3.8729835 = tf(termFreq(lyrics:drug)=15)
0.30685282 = idf(docFreq=1, maxDocs=1)
0.078125 = fieldNorm(field=lyrics, doc=0)
  Infinity = (MATCH) weight(lyrics:"drug ? drug"~5 in 0), product of:
0.81649655 = queryWeight(lyrics:"drug ? drug"~5), product of:
  0.61370564 = idf(lyrics: drug=1 drug=1)
  1.3304368 = queryNorm
Infinity = fieldWeight(lyrics:"drug drug" in 0), product of:
  Infinity = tf(phraseFreq=Infinity)
  0.61370564 = idf(lyrics: drug=1 drug=1)
  0.078125 = fieldNorm(field=lyrics, doc=0)

Here is the text of the 'lyrics' field entry that gives the Infinity score:
http://pastebin.com/JcN5hM8c



There seems to be some kind of issue with the search query having the
two consecutive words, a reserved word (the) in the middle.  To me it looks
like a bug but I wanted to check here first.  I'm seeing this in both 1.4.1
and 3.1.0.

Regards,
Chris


Re: Why are not query keywords treated as a set?

2011-06-19 Thread Gabriele Kahlout
past past
*past past*
*content:past content:past*

I was expecting the query to get parsed into content:past only and not
content:past content:past.

On Mon, Jun 20, 2011 at 12:12 AM, lee carroll
wrote:

> do you mean a phrase query? "past past"
> can you give some more detail?
>
> On 18 June 2011 13:02, Gabriele Kahlout  wrote:
> > q=past past
> >
> > 1.0 = (MATCH) sum of:
> > *  0.5 = (MATCH) fieldWeight(content:past in 0), product of:*
> >   1.0 = tf(termFreq(content:past)=1)
> >   1.0 = idf(docFreq=1, maxDocs=2)
> >   0.5 = fieldNorm(field=content, doc=0)
> > *  0.5 = (MATCH) fieldWeight(content:past in 0), product of:*
> >   1.0 = tf(termFreq(content:past)=1)
> >   1.0 = idf(docFreq=1, maxDocs=2)
> >   0.5 = fieldNorm(field=content, doc=0)
> >
> > Is there how I can treat the query keywords as a set?
> >
> > --
> > Regards,
> > K. Gabriele
> >
> > --- unchanged since 20/9/10 ---
> > P.S. If the subject contains "[LON]" or the addressee acknowledges the
> > receipt within 48 hours then I don't resend the email.
> > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x)
> > < Now + 48h) ⇒ ¬resend(I, this).
> >
> > If an email is sent by a sender that is not a trusted contact or the
> email
> > does not contain a valid code then the email is not received. A valid
> code
> > starts with a hyphen and ends with "X".
> > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> > L(-[a-z]+[0-9]X)).
> >
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).


Re: score of Infinity on dismax query

2011-06-19 Thread Robert Muir
This is a bug, thanks for including all the information necessary to reproduce!

https://issues.apache.org/jira/browse/LUCENE-3215

On Sun, Jun 19, 2011 at 10:24 PM, Chris Book  wrote:
> Hello, I have a solr search server running and in at least one very rare
> case, I'm seeing a strange scoring result.  The following example will cause
> solr to return a score of "Infinity":
>
> Query: {!dismax tie=0.1 qf=lyrics pf=lyrics ps=5}drugs the drugs
>
> Here is the debug output:
> Infinity = (MATCH) sum of:
>  0.0758089 = (MATCH) sum of:
>    0.03790445 = (MATCH) weight(lyrics:drug in 0), product of:
>      0.40824828 = queryWeight(lyrics:drug), product of:
>        0.30685282 = idf(docFreq=1, maxDocs=1)
>        1.3304368 = queryNorm
>      0.09284656 = (MATCH) fieldWeight(lyrics:drug in 0), product of:
>        3.8729835 = tf(termFreq(lyrics:drug)=15)
>        0.30685282 = idf(docFreq=1, maxDocs=1)
>        0.078125 = fieldNorm(field=lyrics, doc=0)
>    0.03790445 = (MATCH) weight(lyrics:drug in 0), product of:
>      0.40824828 = queryWeight(lyrics:drug), product of:
>        0.30685282 = idf(docFreq=1, maxDocs=1)
>        1.3304368 = queryNorm
>      0.09284656 = (MATCH) fieldWeight(lyrics:drug in 0), product of:
>        3.8729835 = tf(termFreq(lyrics:drug)=15)
>        0.30685282 = idf(docFreq=1, maxDocs=1)
>        0.078125 = fieldNorm(field=lyrics, doc=0)
>  Infinity = (MATCH) weight(lyrics:"drug ? drug"~5 in 0), product of:
>    0.81649655 = queryWeight(lyrics:"drug ? drug"~5), product of:
>      0.61370564 = idf(lyrics: drug=1 drug=1)
>      1.3304368 = queryNorm
>    Infinity = fieldWeight(lyrics:"drug drug" in 0), product of:
>      Infinity = tf(phraseFreq=Infinity)
>      0.61370564 = idf(lyrics: drug=1 drug=1)
>      0.078125 = fieldNorm(field=lyrics, doc=0)
>
> Here is the text of the 'lyrics' field entry that gives the Infinity score:
> http://pastebin.com/JcN5hM8c
>
>
>
> There seems to be some kind of issue with the search query having the
> two consecutive words, a reserved word (the) in the middle.  To me it looks
> like a bug but I wanted to check here first.  I'm seeing this in both 1.4.1
> and 3.1.0.
>
> Regards,
> Chris
>


Re: solr highliting feature

2011-06-19 Thread Romi
yes, I find title in  section. If i am getting results say by
parsing json object then do i need to parse ?

-
Thanks & Regards
Romi
--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-highliting-feature-tp3079239p3084890.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr highliting feature

2011-06-19 Thread Jan Høydahl
Perhaps I don't understand your question right, but if you're working with the 
json response format, yes, you need to pull the highlighted version of the 
field from the highlighting section.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 20. juni 2011, at 07.22, Romi wrote:

> yes, I find title in  section. If i am getting results say by
> parsing json object then do i need to parse ?
> 
> -
> Thanks & Regards
> Romi
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/solr-highliting-feature-tp3079239p3084890.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Request handle solrconfig.xml Spellchecker

2011-06-19 Thread Romi
I am trying to set up spellchecker, according to solr documentation. But when
I am testing, I don't have any suggestion. My piece of code follows: 



textSpell


  solr.IndexBasedSpellChecker
  default
  name
  ./spellchecker







  explicit
  
  default
  
  false
  
  false
  
  1


  spellcheck

  







i build the dictionary as
http://localhost:8983/solr/select/?q=*:*&spellcheck=true&spellcheck.build=true


but when i run the query i am not getting any suggestion
http://localhost:8983/solr/select?q=komputer&spellcheck=true

-
Thanks & Regards
Romi
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Request-handle-solrconfig-xml-Spellchecker-tp3085053p3085053.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: why too many open files?

2011-06-19 Thread Mark Schoy
Hi,

did you have checked the max opened files of your OS?

see: http://lj4newbies.blogspot.com/2007/04/too-many-open-files.html



2011/6/20 Jason, Kim 

> Hi, All
>
> I have 12 shards and ramBufferSizeMB=512, mergeFactor=5.
> But solr raise java.io.FileNotFoundException (Too many open files).
> mergeFactor is just 5. How can this happen?
> Below is segments of some shard. That is too many segments over mergFactor.
> What's wrong and How should I set the mergeFactor?
>
> ==
> [root@solr solr]# ls indexData/multicore-us/usn02/data/index/
> _0.fdt   _gs.fdt  _h5.tii  _hl.nrm  _i1.nrm  _kn.nrm  _l1.nrm  _lq.tii
> _0.fdx   _gs.fdx  _h5.tis  _hl.prx  _i1.prx  _kn.prx  _l1.prx  _lq.tis
> _3i.fdt  _gs.fnm  _h7.fnm  _hl.tii  _i1.tii  _kn.tii  _l1.tii
> lucene-2de7b31b5eabdff0b6ec7fd32eecf8c7-write.lock
> _3i.fdx  _gs.frq  _h7.frq  _hl.tis  _i1.tis  _kn.tis  _l1.tis  _lu.fnm
> _3s.fnm  _gs.nrm  _h7.nrm  _hn.fnm  _j7.fdt  _kp.fnm  _l2.fnm  _lu.frq
> _3s.frq  _gs.prx  _h7.prx  _hn.frq  _j7.fdx  _kp.frq  _l2.frq  _lu.nrm
> _3s.nrm  _gs.tii  _h7.tii  _hn.nrm  _kb.fnm  _kp.nrm  _l2.nrm  _lu.prx
> _3s.prx  _gs.tis  _h7.tis  _hn.prx  _kb.frq  _kp.prx  _l2.prx  _lu.tii
> _3s.tii  _gu.fnm  _h9.fnm  _hn.tii  _kb.nrm  _kp.tii  _l2.tii  _lu.tis
> _3s.tis  _gu.frq  _h9.frq  _hn.tis  _kb.prx  _kp.tis  _l2.tis  _ly.fnm
> _48.fdt  _gu.nrm  _h9.nrm  _hp.fnm  _kb.tii  _kq.fnm  _l6.fnm  _ly.frq
> _48.fdx  _gu.prx  _h9.prx  _hp.frq  _kb.tis  _kq.frq  _l6.frq  _ly.nrm
> _4d.fnm  _gu.tii  _h9.tii  _hp.nrm  _kc.fnm  _kq.nrm  _l6.nrm  _ly.prx
> _4d.frq  _gu.tis  _h9.tis  _hp.prx  _kc.frq  _kq.prx  _l6.prx  _ly.tii
> _4d.nrm  _gw.fnm  _hb.fnm  _hp.tii  _kc.nrm  _kq.tii  _l6.tii  _ly.tis
> _4d.prx  _gw.frq  _hb.frq  _hp.tis  _kc.prx  _kq.tis  _l6.tis  _m3.fnm
> _4d.tii  _gw.nrm  _hb.nrm  _hr.fnm  _kc.tii  _kr.fnm  _la.fnm  _m3.frq
> _4d.tis  _gw.prx  _hb.prx  _hr.frq  _kc.tis  _kr.frq  _la.frq  _m3.nrm
> _5b.fdt  _gw.tii  _hb.tii  _hr.nrm  _kf.fdt  _kr.nrm  _la.nrm  _m3.prx
> _5b.fdx  _gw.tis  _hb.tis  _hr.prx  _kf.fdx  _kr.prx  _la.prx  _m3.tii
> _5b.fnm  _gy.fnm  _he.fdt  _hr.tii  _kf.fnm  _kr.tii  _la.tii  _m3.tis
> _5b.frq  _gy.frq  _he.fdx  _hr.tis  _kf.frq  _kr.tis  _la.tis  _m8.fnm
> _5b.nrm  _gy.nrm  _he.fnm  _ht.fnm  _kf.nrm  _kt.fnm  _le.fnm  _m8.frq
> _5b.prx  _gy.prx  _he.frq  _ht.frq  _kf.prx  _kt.frq  _le.frq  _m8.nrm
> _5b.tii  _gy.tii  _he.nrm  _ht.nrm  _kf.tii  _kt.nrm  _le.nrm  _m8.prx
> _5b.tis  _gy.tis  _he.prx  _ht.prx  _kf.tis  _kt.prx  _le.prx  _m8.tii
> _5m.fnm  _h0.fnm  _he.tii  _ht.tii  _kg.fnm  _kt.tii  _le.tii  _m8.tis
> _5m.frq  _h0.frq  _he.tis  _ht.tis  _kg.frq  _kt.tis  _le.tis  _md.fnm
> _5m.nrm  _h0.nrm  _hh.fnm  _hv.fnm  _kg.nrm  _kw.fnm  _li.fnm  _md.frq
> _5m.prx  _h0.prx  _hh.frq  _hv.frq  _kg.prx  _kw.frq  _li.frq  _md.nrm
> _5m.tii  _h0.tii  _hh.nrm  _hv.nrm  _kg.tii  _kw.nrm  _li.nrm  _md.prx
> _5m.tis  _h0.tis  _hh.prx  _hv.prx  _kg.tis  _kw.prx  _li.prx  _md.tii
> _5n.fnm  _h2.fnm  _hh.tii  _hv.tii  _kj.fdt  _kw.tii  _li.tii  _md.tis
> _5n.frq  _h2.frq  _hh.tis  _hv.tis  _kj.fdx  _kw.tis  _li.tis  _mi.fnm
> _5n.nrm  _h2.nrm  _hk.fnm  _hz.fdt  _kj.fnm  _ky.fnm  _lm.fnm  _mi.frq
> _5n.prx  _h2.prx  _hk.frq  _hz.fdx  _kj.frq  _ky.frq  _lm.frq  _mi.nrm
> _5n.tii  _h2.tii  _hk.nrm  _hz.fnm  _kj.nrm  _ky.nrm  _lm.nrm  _mi.prx
> _5n.tis  _h2.tis  _hk.prx  _hz.frq  _kj.prx  _ky.prx  _lm.prx  _mi.tii
> _5x.fnm  _h5.fdt  _hk.tii  _hz.nrm  _kj.tii  _ky.tii  _lm.tii  _mi.tis
> _5x.frq  _h5.fdx  _hk.tis  _hz.prx  _kj.tis  _ky.tis  _lm.tis  segments_1
> _5x.nrm  _h5.fnm  _hl.fdt  _hz.tii  _kn.fdt  _l1.fdt  _lq.fnm  segments.gen
> _5x.prx  _h5.frq  _hl.fdx  _hz.tis  _kn.fdx  _l1.fdx  _lq.frq
> _5x.tii  _h5.nrm  _hl.fnm  _i1.fnm  _kn.fnm  _l1.fnm  _lq.nrm
> _5x.tis  _h5.prx  _hl.frq  _i1.frq  _kn.frq  _l1.frq  _lq.prx
> ==
>
> Thanks in advance.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/why-too-many-open-files-tp3084407p3084407.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Why are not query keywords treated as a set?

2011-06-19 Thread lee carroll
this might help in your analysis chain

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.RemoveDuplicatesTokenFilterFactory



On 20 June 2011 04:21, Gabriele Kahlout  wrote:
> past past
> *past past*
> *content:past content:past*
>
> I was expecting the query to get parsed into content:past only and not
> content:past content:past.
>
> On Mon, Jun 20, 2011 at 12:12 AM, lee carroll
> wrote:
>
>> do you mean a phrase query? "past past"
>> can you give some more detail?
>>
>> On 18 June 2011 13:02, Gabriele Kahlout  wrote:
>> > q=past past
>> >
>> > 1.0 = (MATCH) sum of:
>> > *  0.5 = (MATCH) fieldWeight(content:past in 0), product of:*
>> >   1.0 = tf(termFreq(content:past)=1)
>> >   1.0 = idf(docFreq=1, maxDocs=2)
>> >   0.5 = fieldNorm(field=content, doc=0)
>> > *  0.5 = (MATCH) fieldWeight(content:past in 0), product of:*
>> >   1.0 = tf(termFreq(content:past)=1)
>> >   1.0 = idf(docFreq=1, maxDocs=2)
>> >   0.5 = fieldNorm(field=content, doc=0)
>> >
>> > Is there how I can treat the query keywords as a set?
>> >
>> > --
>> > Regards,
>> > K. Gabriele
>> >
>> > --- unchanged since 20/9/10 ---
>> > P.S. If the subject contains "[LON]" or the addressee acknowledges the
>> > receipt within 48 hours then I don't resend the email.
>> > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>> time(x)
>> > < Now + 48h) ⇒ ¬resend(I, this).
>> >
>> > If an email is sent by a sender that is not a trusted contact or the
>> email
>> > does not contain a valid code then the email is not received. A valid
>> code
>> > starts with a hyphen and ends with "X".
>> > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
>> > L(-[a-z]+[0-9]X)).
>> >
>>
>
>
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
> < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>