Re: Any tips for indexing large amounts of data?

2007-11-02 Thread Brendan Grainger
Thanks so much for your suggestions. I am attempting to index 550K  
docs at once, but have found I've had to break them up into smaller  
batches. Indexing seems to stop at around 47K docs (the index reaches  
264M in size at this point). The index eventually itself grows to  
about 2Gb. I am using embedded solr and adding a document with code  
very similar to this:




private void addModel(Model model) throws IOException {
UpdateHandler updateHandler = solrCore.getUpdateHandler();
AddUpdateCommand addcmd = new AddUpdateCommand();

DocumentBuilder builder = new DocumentBuilder 
(solrCore.getSchema());

builder.startDoc();
builder.addField("id", "Model:" + model.getUuid());
builder.addField("class", "Model");
builder.addField("uuid", model.getUuid());
builder.addField("one_facet", model.getOneFacet());
builder.addField("another_facet", model.getAnotherFacet());

  .. other fields

addcmd.doc = builder.getDoc();
addcmd.allowDups = false;
addcmd.overwritePending = true;
addcmd.overwriteCommitted = true;
updateHandler.addDoc(addcmd);
}

I have other 'Model' objects I'm adding also.

Thanks

On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote:



: I would think you would see better performance by allowing auto  
commit
: to handle the commit size instead of reopening the connection all  
the

: time.

if your goal is "fast" indexing, don't use autoCommit at all ... just
index everything, and don't commit until you are completely done.

autoCommitting will slow your indexing down (the benefit being that  
more

results will be visible to searchers as you proceed)




-Hoss





Re: Phrase Query Performance Question

2007-11-02 Thread Walter Underwood
He means "extremely frequent" and I agree. --wunder

On 11/2/07 1:51 AM, "Haishan Chen" <[EMAIL PROTECTED]> wrote:

> Thanks for the advice. You certainly have a point. I believe you mean a query
> term that appears in 5-10% of an index in a natural language corpus is
> extremely INFREQUENT?  



RE: Phrase Query Performance Question

2007-11-02 Thread Haishan Chen




> Date: Fri, 2 Nov 2007 07:32:30 -0700> Subject: Re: Phrase Query Performance 
> Question> From: [EMAIL PROTECTED]> To: solr-user@lucene.apache.org> > He 
> means "extremely frequent" and I agree. --wunder
 
 
Then it means a PHRASE (combination of terms except stopwords) appear in 5% to 
10% of an index should NOT be that frequent? I guess I get the idea.
 
 
> > On 11/2/07 1:51 AM, "Haishan Chen" <[EMAIL PROTECTED]> wrote:> > > Thanks 
> > for the advice. You certainly have a point. I believe you mean a query> > 
> > term that appears in 5-10% of an index in a natural language corpus is> > 
> > extremely INFREQUENT? > 
_
Windows Live Hotmail and Microsoft Office Outlook – together at last.  Get it 
now.
http://office.microsoft.com/en-us/outlook/HA102225181033.aspx?pid=CL100626971033

Solr and Lucene Indexing Performance

2007-11-02 Thread Jae Joo
Hi,

I have 6 millions article to be indexed by Solr and do need your
recommendation.

I do need to parse and generate the Solr based xml file to post it. How
about to use Lucene directly?
I have short testing, it looks like Sola based indexing is faster than
direct indexing through Lucene.

Am I did something wrong and/or does Solr use multiple threading or
something else to get the good indexing performance?

Thanks

Jae Joo


RE: Phrase Query Performance Question

2007-11-02 Thread Haishan Chen




> From: [EMAIL PROTECTED]> Subject: Re: Phrase Query Performance Question> 
> Date: Thu, 1 Nov 2007 11:25:26 -0700> To: solr-user@lucene.apache.org> > On 
> 31-Oct-07, at 11:54 PM, Haishan Chen wrote:> > >> >> Date: Wed, 31 Oct 2007 
> 17:54:53 -0700> Subject: Re: Phrase Query > >> Performance Question> From: 
> [EMAIL PROTECTED]> To: solr- > >> [EMAIL PROTECTED]> > "hurricane katrina" is 
> a very expensive > >> query against a collection> focused on Hurricane 
> Katrina. There > >> will be many matches in many> documents. If you want to 
> measure > >> worst-case, this is fine.> > I'd try other things, like:> > * > 
> >> ninth ward> * Ray Nagin> * Audubon Park> * Canal Street> * French > >> 
> Quarter> * FEMA mistakes> * storm surge> * Jackson Square> > Of > >> course, 
> real query logs are the only real test.> > wunder> >> > These terms are not 
> frequent in my index. I believe they are going > > to be fast. The thing is 
> that I feel 2 million documents is a small > > index.> > 100,000 or 200,000 
> hits is a small set and should always have sub > > second query performance. 
> Now I am only querying one field and the> > response is almost one second. I 
> feel I can't achieve sub second > > performance if I add a bit more 
> complexity to the query.> >> > Many of the category terms in my index will 
> appear in more than 5% > > of the documents and those category terms are very 
> popular search> > terms. So the example I gave were not extreme cases for my 
> index> > I think that you are somewhat misguided about what constitutes a > 
> small set. A query term that appears in 5-10% of the index in a > natural 
> language corpus is _extremely_ frequent. Not quite on the > order of 
> stopwords, but getting there. As a comparison, on an > extremely large corpus 
> that I have handy, documents containing both > the word 'auto' and 'repair' 
> (not necessarily adjacent) constitute > 0.1% of the index. The frequency of 
> the phrase "auto repair" is 0.025%.> > @200k docs would be the response rate 
> from an 800million-doc corpus.> > What data are you indexing, what what is 
> the intended effect of the > phrase queries you are performing? Perhaps 
> getting at the issue from > this end would be more productive than hammering 
> at the phrasequery > performance question.
 
 
 
 
Thanks for the advice. You certainly have a point. I believe you mean a query 
term that appears in 5-10% of an index in a  natural language corpus is 
extremely INFREQUENT?  
 
 
 
 
> > > When I start tomcat I saw this message:> > The Apache Tomcat Native 
> > > library which allows optimal performance > > in production environments 
> > > was not found on the java.library.path> >> > Is that mean if I use Apache 
> > > Tomcat Native library the query > > performance will be better. Anyone 
> > > has experience on that?> > Unlikely, though it might help you slightly at 
> > > a high query rate with > high cache hit ratios.> > -Mike
 
I have try Apache Tomcat Native library on my window machine and you are right. 
No obvious difference on query performance
 
 
 
I have try the index on a linux machine. 
The windows machine:  Windows 2003, one intel(R) Xeon(TM) CPU 3.00 GHZ 
(Quo-core cpu) 4G Ram
The linux machine:  (not sure what version of linux), two  Intel(R) Xeon(R) CPU 
E5310 1.6 GHZ (Quo-core cpu) 4G Ram
 
Both system have raid5 but I don't know the difference.
 
I found substantial indexing performance improvement on the linux machine. On 
the windows machine it took more than 5 hours. 
But it took only one hour to index 2 million documents on the linux system. I 
am really happy to see that. I guess both linux and the extra CPU contributed 
to the improvement.
 
Query performance are almost the same though. The cpu on linux machine is 
slower so I think if the linux system were using the same cpu as the windows 
system query performance will improve too.  Both index and query are cpu bound. 
If I am right.
 
I guess I got enough on this question. But I still want to try the solr-trunk. 
Will update with everyone later.
 
 
 
Thanks
-Haishan
 
 
 
 
 
 
 
 
 
 
 
 
 
_
Boo! Scare away worms, viruses and so much more! Try Windows Live OneCare!
http://onecare.live.com/standard/en-us/purchase/trial.aspx?s_cid=wl_hotmailnews

Re: Phrase Query Performance Question

2007-11-02 Thread Mike Klaas

On 2-Nov-07, at 10:03 AM, Haishan Chen wrote:






Date: Fri, 2 Nov 2007 07:32:30 -0700> Subject: Re: Phrase Query  
Performance Question> From: [EMAIL PROTECTED]> To: solr- 
[EMAIL PROTECTED]> > He means "extremely frequent" and I  
agree. --wunder



Then it means a PHRASE (combination of terms except stopwords)  
appear in 5% to 10% of an index should NOT be that frequent? I  
guess I get the idea.


Phrases should be rarer than individual keywords.  5-10% is  
moderately high even for a _single_ keyword, let alone the  
conjunction of two keywords, let alone the _exact phrase_ of two  
keywords (non stopwords in all of this discussion).


As I mentioned, the 'natural' rate of 'auto'+'repair' on a corpus  
100's of times bigger than yours (web documents) is .1%, and the rate  
of the phrase 'auto repair' is .025%.


It still feels to me that you are trying doing something unique with  
your phrase queries.  Unfortunately, you still haven't said what you  
are trying to do in general terms, which makes it very difficult for  
people to help you.


-Mike


Re: Solr and Lucene Indexing Performance

2007-11-02 Thread Mike Klaas

On 2-Nov-07, at 11:41 AM, Jae Joo wrote:


Hi,

I have 6 millions article to be indexed by Solr and do need your
recommendation.

I do need to parse and generate the Solr based xml file to post it.  
How

about to use Lucene directly?
I have short testing, it looks like Sola based indexing is faster than
direct indexing through Lucene.


I wouldn't recommend that.  If you use persistent connections,  
multiple threads and >1 docs/update you should achieve comparable  
performance (about 10docs/request is about the right balance for web- 
sized docs).


If you want index directly, use embedded Solr, not Lucene directly  
(see the wiki).



Am I did something wrong and/or does Solr use multiple threading or
something else to get the good indexing performance?


It does use multiple threads if you connect to Solr using multiple  
threads.  But it doesn't do it behind the scenes if you aren't using  
multiple threads.


Some possible differences:

1. Solr has more aggressive default buffering settings  
(maxBufferedDocs, mergeFactor)
2. solr trunk (if that is what you are using) is using a more recent  
version of Lucene than the released 2.2


-Mike


Re: Phrase Query Performance Question

2007-11-02 Thread Chris Hostetter

: It still feels to me that you are trying doing something unique with your
: phrase queries.  Unfortunately, you still haven't said what you are trying to
: do in general terms, which makes it very difficult for people to help you.

Agreed.  This seems very special case, but we dont' know what the case is.

If there are specific phrases you know in advance that you will care 
about, and those phrases occur as frequetnly as the individual 
"words", then the best way to deal with them is to index each "phrase" as 
a single Term (and ignore the individual words)

Speaking more generally to mike's point...

http://people.apache.org/~hossman/#xyproblem
Your question appears to be an "XY Problem" ... that is: you are dealing
with "X", you are assuming "Y" will help you, and you are asking about "Y"
without giving more details about the "X" so that we can understand the
full issue.  Perhaps the best solution doesn't involve "Y" at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341





-Hoss



RE: Phrase Query Performance Question

2007-11-02 Thread Haishan Chen


> Date: Fri, 2 Nov 2007 12:31:29 -0700> From: [EMAIL PROTECTED]> To: 
> solr-user@lucene.apache.org> Subject: Re: Phrase Query Performance Question> 
> > > : It still feels to me that you are trying doing something unique with 
> your> : phrase queries. Unfortunately, you still haven't said what you are 
> trying to> : do in general terms, which makes it very difficult for people to 
> help you.> > Agreed. This seems very special case, but we dont' know what the 
> case is.> > If there are specific phrases you know in advance that you will 
> care > about, and those phrases occur as frequetnly as the individual > 
> "words", then the best way to deal with them is to index each "phrase" as > a 
> single Term (and ignore the individual words)> > Speaking more generally to 
> mike's point...> > http://people.apache.org/~hossman/#xyproblem> Your 
> question appears to be an "XY Problem" ... that is: you are dealing> with 
> "X", you are assuming "Y" will help you, and you are asking about "Y"> 
> without giving more details about the "X" so that we can understand the> full 
> issue. Perhaps the best solution doesn't involve "Y" at all?> See Also: 
> http://www.perlmonks.org/index.pl?node_id=542341> > > > > > -Hoss> 
I think the documents I was indexing can not be considered a natural language 
documents. It is constructed following certain rules and then feed into the 
indexing process. I guess because of the rules many targeting searching terms 
have high document frequency. I am not in obligation to achieve the quarter 
second performance I am just interested to see whether it is achievable. 
 
Thanks everyone for offering advice
-Haishan
 
 
 
 
 
 
 
_
Help yourself to FREE treats served up daily at the Messenger Café. Stop by 
today.
http://www.cafemessenger.com/info/info_sweetstuff2.html?ocid=TXT_TAGLM_OctWLtagline

Re: Solr production live implementation

2007-11-02 Thread Otis Gospodnetic
Hi Tim (switching to the more appropriate solr-user list)

It's hard to tell and depends on thing like integration of search in the rest 
of the site, the placement of search field/form, the exposure, etc.  The 
corpus/index does not sound large, but the mention of Windows scares me, as 
does 2GB of RAM (this won't be enough - your index is likely going to be too 
big to fit in RAM, causing a lot of disk IO).

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Tim Archambault <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Friday, November 2, 2007 11:50:21 AM
Subject: Solr production live implementation

If this is the wrong email forum, I apologize in advance.

Looking to use Solr as the PRIMARY search engine for our newspaper
 website.
The index will initially hold between 200,000 - 500,000 documents.

I'm not sure what analytic data you'd need to help me with my question,
 but
I can tell you our website incurs roughly 4 million page views monthly
 and
about 30,000 absolute unique visitors per month. Our website traffic is
concentrated between 8am - 12noon so we have a lot of off-peak time on
 our
server.

I am currently trying SOLR out off of my dedicated Windows server (IIS
 5)
with Jetty.  My server has 2GB Ram and tons of space. What is the
 likelyhood
that this environment is "good enough" for my production environment?

Any feedback is greatly appreciated.

Tim