Re: Any tips for indexing large amounts of data?
Thanks so much for your suggestions. I am attempting to index 550K docs at once, but have found I've had to break them up into smaller batches. Indexing seems to stop at around 47K docs (the index reaches 264M in size at this point). The index eventually itself grows to about 2Gb. I am using embedded solr and adding a document with code very similar to this: private void addModel(Model model) throws IOException { UpdateHandler updateHandler = solrCore.getUpdateHandler(); AddUpdateCommand addcmd = new AddUpdateCommand(); DocumentBuilder builder = new DocumentBuilder (solrCore.getSchema()); builder.startDoc(); builder.addField("id", "Model:" + model.getUuid()); builder.addField("class", "Model"); builder.addField("uuid", model.getUuid()); builder.addField("one_facet", model.getOneFacet()); builder.addField("another_facet", model.getAnotherFacet()); .. other fields addcmd.doc = builder.getDoc(); addcmd.allowDups = false; addcmd.overwritePending = true; addcmd.overwriteCommitted = true; updateHandler.addDoc(addcmd); } I have other 'Model' objects I'm adding also. Thanks On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote: : I would think you would see better performance by allowing auto commit : to handle the commit size instead of reopening the connection all the : time. if your goal is "fast" indexing, don't use autoCommit at all ... just index everything, and don't commit until you are completely done. autoCommitting will slow your indexing down (the benefit being that more results will be visible to searchers as you proceed) -Hoss
Re: Phrase Query Performance Question
He means "extremely frequent" and I agree. --wunder On 11/2/07 1:51 AM, "Haishan Chen" <[EMAIL PROTECTED]> wrote: > Thanks for the advice. You certainly have a point. I believe you mean a query > term that appears in 5-10% of an index in a natural language corpus is > extremely INFREQUENT?
RE: Phrase Query Performance Question
> Date: Fri, 2 Nov 2007 07:32:30 -0700> Subject: Re: Phrase Query Performance > Question> From: [EMAIL PROTECTED]> To: solr-user@lucene.apache.org> > He > means "extremely frequent" and I agree. --wunder Then it means a PHRASE (combination of terms except stopwords) appear in 5% to 10% of an index should NOT be that frequent? I guess I get the idea. > > On 11/2/07 1:51 AM, "Haishan Chen" <[EMAIL PROTECTED]> wrote:> > > Thanks > > for the advice. You certainly have a point. I believe you mean a query> > > > term that appears in 5-10% of an index in a natural language corpus is> > > > extremely INFREQUENT? > _ Windows Live Hotmail and Microsoft Office Outlook – together at last. Get it now. http://office.microsoft.com/en-us/outlook/HA102225181033.aspx?pid=CL100626971033
Solr and Lucene Indexing Performance
Hi, I have 6 millions article to be indexed by Solr and do need your recommendation. I do need to parse and generate the Solr based xml file to post it. How about to use Lucene directly? I have short testing, it looks like Sola based indexing is faster than direct indexing through Lucene. Am I did something wrong and/or does Solr use multiple threading or something else to get the good indexing performance? Thanks Jae Joo
RE: Phrase Query Performance Question
> From: [EMAIL PROTECTED]> Subject: Re: Phrase Query Performance Question> > Date: Thu, 1 Nov 2007 11:25:26 -0700> To: solr-user@lucene.apache.org> > On > 31-Oct-07, at 11:54 PM, Haishan Chen wrote:> > >> >> Date: Wed, 31 Oct 2007 > 17:54:53 -0700> Subject: Re: Phrase Query > >> Performance Question> From: > [EMAIL PROTECTED]> To: solr- > >> [EMAIL PROTECTED]> > "hurricane katrina" is > a very expensive > >> query against a collection> focused on Hurricane > Katrina. There > >> will be many matches in many> documents. If you want to > measure > >> worst-case, this is fine.> > I'd try other things, like:> > * > > >> ninth ward> * Ray Nagin> * Audubon Park> * Canal Street> * French > >> > Quarter> * FEMA mistakes> * storm surge> * Jackson Square> > Of > >> course, > real query logs are the only real test.> > wunder> >> > These terms are not > frequent in my index. I believe they are going > > to be fast. The thing is > that I feel 2 million documents is a small > > index.> > 100,000 or 200,000 > hits is a small set and should always have sub > > second query performance. > Now I am only querying one field and the> > response is almost one second. I > feel I can't achieve sub second > > performance if I add a bit more > complexity to the query.> >> > Many of the category terms in my index will > appear in more than 5% > > of the documents and those category terms are very > popular search> > terms. So the example I gave were not extreme cases for my > index> > I think that you are somewhat misguided about what constitutes a > > small set. A query term that appears in 5-10% of the index in a > natural > language corpus is _extremely_ frequent. Not quite on the > order of > stopwords, but getting there. As a comparison, on an > extremely large corpus > that I have handy, documents containing both > the word 'auto' and 'repair' > (not necessarily adjacent) constitute > 0.1% of the index. The frequency of > the phrase "auto repair" is 0.025%.> > @200k docs would be the response rate > from an 800million-doc corpus.> > What data are you indexing, what what is > the intended effect of the > phrase queries you are performing? Perhaps > getting at the issue from > this end would be more productive than hammering > at the phrasequery > performance question. Thanks for the advice. You certainly have a point. I believe you mean a query term that appears in 5-10% of an index in a natural language corpus is extremely INFREQUENT? > > > When I start tomcat I saw this message:> > The Apache Tomcat Native > > > library which allows optimal performance > > in production environments > > > was not found on the java.library.path> >> > Is that mean if I use Apache > > > Tomcat Native library the query > > performance will be better. Anyone > > > has experience on that?> > Unlikely, though it might help you slightly at > > > a high query rate with > high cache hit ratios.> > -Mike I have try Apache Tomcat Native library on my window machine and you are right. No obvious difference on query performance I have try the index on a linux machine. The windows machine: Windows 2003, one intel(R) Xeon(TM) CPU 3.00 GHZ (Quo-core cpu) 4G Ram The linux machine: (not sure what version of linux), two Intel(R) Xeon(R) CPU E5310 1.6 GHZ (Quo-core cpu) 4G Ram Both system have raid5 but I don't know the difference. I found substantial indexing performance improvement on the linux machine. On the windows machine it took more than 5 hours. But it took only one hour to index 2 million documents on the linux system. I am really happy to see that. I guess both linux and the extra CPU contributed to the improvement. Query performance are almost the same though. The cpu on linux machine is slower so I think if the linux system were using the same cpu as the windows system query performance will improve too. Both index and query are cpu bound. If I am right. I guess I got enough on this question. But I still want to try the solr-trunk. Will update with everyone later. Thanks -Haishan _ Boo! Scare away worms, viruses and so much more! Try Windows Live OneCare! http://onecare.live.com/standard/en-us/purchase/trial.aspx?s_cid=wl_hotmailnews
Re: Phrase Query Performance Question
On 2-Nov-07, at 10:03 AM, Haishan Chen wrote: Date: Fri, 2 Nov 2007 07:32:30 -0700> Subject: Re: Phrase Query Performance Question> From: [EMAIL PROTECTED]> To: solr- [EMAIL PROTECTED]> > He means "extremely frequent" and I agree. --wunder Then it means a PHRASE (combination of terms except stopwords) appear in 5% to 10% of an index should NOT be that frequent? I guess I get the idea. Phrases should be rarer than individual keywords. 5-10% is moderately high even for a _single_ keyword, let alone the conjunction of two keywords, let alone the _exact phrase_ of two keywords (non stopwords in all of this discussion). As I mentioned, the 'natural' rate of 'auto'+'repair' on a corpus 100's of times bigger than yours (web documents) is .1%, and the rate of the phrase 'auto repair' is .025%. It still feels to me that you are trying doing something unique with your phrase queries. Unfortunately, you still haven't said what you are trying to do in general terms, which makes it very difficult for people to help you. -Mike
Re: Solr and Lucene Indexing Performance
On 2-Nov-07, at 11:41 AM, Jae Joo wrote: Hi, I have 6 millions article to be indexed by Solr and do need your recommendation. I do need to parse and generate the Solr based xml file to post it. How about to use Lucene directly? I have short testing, it looks like Sola based indexing is faster than direct indexing through Lucene. I wouldn't recommend that. If you use persistent connections, multiple threads and >1 docs/update you should achieve comparable performance (about 10docs/request is about the right balance for web- sized docs). If you want index directly, use embedded Solr, not Lucene directly (see the wiki). Am I did something wrong and/or does Solr use multiple threading or something else to get the good indexing performance? It does use multiple threads if you connect to Solr using multiple threads. But it doesn't do it behind the scenes if you aren't using multiple threads. Some possible differences: 1. Solr has more aggressive default buffering settings (maxBufferedDocs, mergeFactor) 2. solr trunk (if that is what you are using) is using a more recent version of Lucene than the released 2.2 -Mike
Re: Phrase Query Performance Question
: It still feels to me that you are trying doing something unique with your : phrase queries. Unfortunately, you still haven't said what you are trying to : do in general terms, which makes it very difficult for people to help you. Agreed. This seems very special case, but we dont' know what the case is. If there are specific phrases you know in advance that you will care about, and those phrases occur as frequetnly as the individual "words", then the best way to deal with them is to index each "phrase" as a single Term (and ignore the individual words) Speaking more generally to mike's point... http://people.apache.org/~hossman/#xyproblem Your question appears to be an "XY Problem" ... that is: you are dealing with "X", you are assuming "Y" will help you, and you are asking about "Y" without giving more details about the "X" so that we can understand the full issue. Perhaps the best solution doesn't involve "Y" at all? See Also: http://www.perlmonks.org/index.pl?node_id=542341 -Hoss
RE: Phrase Query Performance Question
> Date: Fri, 2 Nov 2007 12:31:29 -0700> From: [EMAIL PROTECTED]> To: > solr-user@lucene.apache.org> Subject: Re: Phrase Query Performance Question> > > > : It still feels to me that you are trying doing something unique with > your> : phrase queries. Unfortunately, you still haven't said what you are > trying to> : do in general terms, which makes it very difficult for people to > help you.> > Agreed. This seems very special case, but we dont' know what the > case is.> > If there are specific phrases you know in advance that you will > care > about, and those phrases occur as frequetnly as the individual > > "words", then the best way to deal with them is to index each "phrase" as > a > single Term (and ignore the individual words)> > Speaking more generally to > mike's point...> > http://people.apache.org/~hossman/#xyproblem> Your > question appears to be an "XY Problem" ... that is: you are dealing> with > "X", you are assuming "Y" will help you, and you are asking about "Y"> > without giving more details about the "X" so that we can understand the> full > issue. Perhaps the best solution doesn't involve "Y" at all?> See Also: > http://www.perlmonks.org/index.pl?node_id=542341> > > > > > -Hoss> I think the documents I was indexing can not be considered a natural language documents. It is constructed following certain rules and then feed into the indexing process. I guess because of the rules many targeting searching terms have high document frequency. I am not in obligation to achieve the quarter second performance I am just interested to see whether it is achievable. Thanks everyone for offering advice -Haishan _ Help yourself to FREE treats served up daily at the Messenger Café. Stop by today. http://www.cafemessenger.com/info/info_sweetstuff2.html?ocid=TXT_TAGLM_OctWLtagline
Re: Solr production live implementation
Hi Tim (switching to the more appropriate solr-user list) It's hard to tell and depends on thing like integration of search in the rest of the site, the placement of search field/form, the exposure, etc. The corpus/index does not sound large, but the mention of Windows scares me, as does 2GB of RAM (this won't be enough - your index is likely going to be too big to fit in RAM, causing a lot of disk IO). Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Tim Archambault <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Friday, November 2, 2007 11:50:21 AM Subject: Solr production live implementation If this is the wrong email forum, I apologize in advance. Looking to use Solr as the PRIMARY search engine for our newspaper website. The index will initially hold between 200,000 - 500,000 documents. I'm not sure what analytic data you'd need to help me with my question, but I can tell you our website incurs roughly 4 million page views monthly and about 30,000 absolute unique visitors per month. Our website traffic is concentrated between 8am - 12noon so we have a lot of off-peak time on our server. I am currently trying SOLR out off of my dedicated Windows server (IIS 5) with Jetty. My server has 2GB Ram and tons of space. What is the likelyhood that this environment is "good enough" for my production environment? Any feedback is greatly appreciated. Tim