indexing excel file
Hi, i want to index an excel file and i have the following error: http://dev.torrez.us/public/2006/pundit/java/src/plugin/parse-msexcel/sample/test.xls: failed(2,0): Can't be handled as Microsoft document. java.lang.ArrayIndexOutOfBoundsException: No cell at position col1, row 0. I already add msexcel in the plugin.includes: plugin.includes protocol-http|urlfilter-regex|parse-(text|html|htm|js|pdf| msword|mspowerpoint|msexcel)|index-basic|query-(basic|site|url)|summary- msword|mspowerpoint|basic| scoring-opic|urlnormalizer-(pass|regex|basic) i don't now where is the probleme help plz -- View this message in context: http://www.nabble.com/indexing-excel-file-tf4841896.html#a13852743 Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr PHP client
You can use curl (www.php.net/curl) to interface with solr, its a piece of cake! -Nick On 11/20/07, SDIS M. Beauchamp <[EMAIL PROTECTED]> wrote: > I use the php and php serialized writer to query Solr from php > > It's very easy to use > > But it's not so easy to update solr from php ( that's why my crawlers are not > written in php ) > > Florent BEAUCHAMP > > -Message d'origine- > De : Jonathan Ariel [mailto:[EMAIL PROTECTED] > Envoyé : mardi 20 novembre 2007 02:49 > À : solr-user@lucene.apache.org > Objet : Solr PHP client > > Hi! > I'm wondering if someone is using a PHP client for solr. Actually I'm not > sure if there is one out there. > Would you be interested in having a SolrJ port for PHP? > > Thanks, > > Jonathan Leibiusky > >
Re: Solr on Windows / Linux
On Tue, 20 Nov 2007 10:55:04 +0530 "Eswar K" <[EMAIL PROTECTED]> wrote: > Is there any difference in the way any of the Solr's features work on > Windows/Linux. Hi Eswar, I am developing on FreeBSD 6.2 and 7, testing on a VM with Windows 2003 Server, and deploying for now, on Win32 too. We will very possibly deploy to *nix servers at a later stage (when we start to have dedicated servers for the SOLR component). I haven't found any issues across platforms, other than the new line - for some reason, SOLR didnt seem to like the new line as read from the system environment under Win32 - defaulting to unix's \n was enough to fix it. > Ideally it should not as its a java implementation. I was > looking at CollectionsDistribution and its documentation ( > http://wiki.apache.org/solr/CollectionDistribution). It appeared that it > uses rsync which is specific to Linux systems. use cygwin for this. I haven't used it for solr, but i've used it extensively elsewhere when I have to endure win32. cheers, B _ {Beto|Norberto|Numard} Meijome Exhilaration is that feeling you get just after a great idea hits you, and just before you realize what is wrong with it. I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: Performance of Solr on different Platforms
Eswar, This link would give you a fair idea of how Solr is used by some of the sites/companies - http://wiki.apache.org/solr/SolrPerformanceData Rishabh On Nov 20, 2007 10:49 AM, Eswar K <[EMAIL PROTECTED]> wrote: > In our case, the load is kind of distributed. On an average, the QPS could > be much less than that. 1000 qps could be the peak load ever expected > could > ever reach. However the number of documents going to be in the range of 2 > - > 20 million documents. > > We would possibly distribute the indexes to different solr instances and > possibly direct it accordingly to reduce the QPS. > > - Eswar > > On Nov 20, 2007 10:42 AM, Walter Underwood <[EMAIL PROTECTED]> wrote: > > > 1000 qps is a lot of load, at least 30M queries/day. > > > > We are running dual CPU Power P5 machines and getting about 80 qps > > with worst case response times of 5 seconds. 90% of responses are > > under 70 msec. > > > > Our expected peak load is 300 qps on our back-end Solr farm. > > We execute multiple back-end queries for each query page. > > > > With N+1 sizing (full throughput with one server down), we > > have five servers to do that. We have a separate server > > for indexing and use the Solr distribution scripts. > > > > We have a relatively small index, about 250K docs. > > > > wunder > > > > > > On 11/19/07 8:48 PM, "Eswar K" <[EMAIL PROTECTED]> wrote: > > > > > Its not going to hit 1000 all the time, its the expected peak value. > > > > > > I guess for distributing the load we should be using collections and I > > was > > > looking at the collections documentation ( > > > http://wiki.apache.org/solr/CollectionDistribution) . > > > > > > - Eswar > > > On Nov 20, 2007 12:07 AM, Matthew Runo <[EMAIL PROTECTED]> wrote: > > > > > >> I'd think that any platform that can run Java would be fine to run > > >> SOLR on. Maybe this is more a question of preferred platforms for > Java > > >> deployments? That is quite the load for SOLR though, you may find > that > > >> you want more than one server. > > >> > > >> Do you mean that you're expecting about 1000 QPS over an index with > up > > >> to 20 million documents? > > >> > > >> --Matthew > > >> > > >> On Nov 19, 2007, at 6:00 AM, Eswar K wrote: > > >> > > >>> All, > > >>> > > >>> Can you give some information on this or atleast let me know where I > > >>> can > > >>> find this information if its already listed out anywhere. > > >>> > > >>> Regards, > > >>> Eswar > > >>> > > >>> On Nov 18, 2007 9:45 PM, Eswar K <[EMAIL PROTECTED]> wrote: > > >>> > > Hi, > > > > I understand that Solr can be used on different Linux flavors. Is > > there > > any preferred flavor (Like Red Hat, Ubuntu, etc)? > > Also what is the kind of configuration of hardware (Processors, > > RAM, etc) > > be best suited for the install? > > We expect to load it with millions of documents (varying from 2 - > 20 > > million). There might be around 1000 concurrent users. > > > > Your help in this regard will be appreciated. > > > > Regards, > > Eswar > > > > > > >> > > >> > > > > >
Re: Invalid value 'explicit' for echoParams parameter
-Orijinal e-posta iletisi- From: Ryan McKinley [EMAIL PROTECTED] Date: Tue, 20 Nov 2007 07:16:53 +0200 To: solr-user@lucene.apache.org Subject: Re: Invalid value 'explicit' for echoParams parameter > AHMET ARSLAN wrote: > > I am a newbie at solr. I have done everything in the solr tutorial section. > > I am using the latest versions of both JDK(1.6.03) and Solr(2.2). I can see > > the solr admin page http://localhost:8983/solr/admin/ But when I hit the > > search button I receive an http error: > > > > HTTP ERROR: 400 > > > > Invalid value 'explicit' for echoParams parameter, use 'EXPLICIT' or 'ALL' > > RequestURI=/solr/select/ > > > > I also tried to run solr under Tomcat but again I was unsuccessful. > > > > Any solutions or document links will be appreciated. > > > > Thanks for your help... > > > > what is the URL when you get this error? > > Have you edited the solrconfig.xml? What happens if you put: > &echoParams=explicit in the query? > > The URL is http://localhost:8983/solr/select/?q=solr&version=2.2&start=0&rows=10&indent=on When i added &echoParams=explicit to the query nothing has changed. But when I find and replaced the word 'explicit' to uppercase 'EXPLICIT' in the solrconfig.xml it worked. The problem has solved. Thanks for your help.
Re: Invalid value 'explicit' for echoParams parameter
The URL is http://localhost:8983/solr/select/?q=solr&version=2.2&start=0&rows=10&indent=on When i added &echoParams=explicit to the query nothing has changed. But when I find and replaced the word 'explicit' to uppercase 'EXPLICIT' in the solrconfig.xml it worked. The problem has solved. Thanks for your help. hymmm. what version are you using? I'm confident that /trunk accepts any case: v = v.toUpperCase(); if( v.equals( "EXPLICIT" ) ) { return EXPLICIT; } ryan
Re: rows=VERY_LARGE_VALUE throws exception, and error in some cases
I recently fixed this in the trunk. -Yonik On Nov 20, 2007 10:31 AM, Rishabh Joshi <[EMAIL PROTECTED]> wrote: > Hi, > > We are using Solr 1.2 for our project and have come across the following > exception and error: > > Exception: > SEVERE: java.lang.OutOfMemoryError: Java heap space > at org.apache.lucene.util.PriorityQueue.initialize (PriorityQueue.java > :36) > > Steps to reproduce: > 1. Restart your Web Server. > 2. Enter a query with VERY_LARGE_VALUE for "rows" field. For example: > http://xx.xx.xx.xx:8080/solr/select?q=unix&%20start=0&fl=id&indent=off&rows=9 > 3. Press enter or click on the 'Go' button on the browser. > > NOTE: > 1. This exception is thrown if'999' (seven digits) < > VERY_LARGE_VALUE < '9' (nine digits). > 2. The exception DOES NOT APPEAR AGAIN if we change the VERY_LARGE_VALUE to > <= '999', execute the query and then change the VERY_LARGE_VALUE back > to it's original value and execute the query again. > 3. If the VERY_LARGE_VALUE >= '99' (ten digits) we get the following > error: > > Error: > HTTP Status 400 - For input string: "99" > > Has anyone come across this scenario before? > > Regards, > Rishabh >
Re: Invalid value 'explicit' for echoParams parameter
: I'm confident that /trunk accepts any case: : : v = v.toUpperCase(); thats in Solr 1.2 as well hmmm Ahmet: what is the default Locale of your JVM? String.toUpper() does use the default Locale ... i guess maybe we should start being more strict about using "compareToIgnoreCase" (or use toUpperCase(Locale.ENGLISH)) in cases like this where we want to test input strings against expected constants. -Hoss
Solr cluster topology.
Hi All! I just started reading about Solr a couple of days ago (not full time of course) and it looks like a pretty impressive set of technologies... I have still a few questions I have not clearly found: Q: On a cluster, as I understand it, one and only one machine is a master, and N servers could be slaves...The clients, do they all talk to the master for indexing and to a load balancer for searching? Is one particular machine configured to know it is the master? Or is it only the settings for replicating the index that matter? Or does one post reindex petitions to any of the slaves and they will forward it to the master? How can we have failover in the master? It is a correct assumption that slaves could always be a bit out of sync with the master, correct? A matter of minutes perhaps... Thanks in advance for your responses!
Re: Pagination with Solr
: What I'm trying is to parse the response for "numFound:" : and if this number is greater than the "rows" parameter, I send another : search request to Solr with a new "start" parameter. Is there a better : way to do this? Specifically, is there another way to obtain the : "numFound" rather than parsing the response stream/string? i really don't understand your question ... how do you get any useful information from Solr unless you parse the responses to your requests? -Hoss
rows=VERY_LARGE_VALUE throws exception, and error in some cases
Hi, We are using Solr 1.2 for our project and have come across the following exception and error: Exception: SEVERE: java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.util.PriorityQueue.initialize (PriorityQueue.java :36) Steps to reproduce: 1. Restart your Web Server. 2. Enter a query with VERY_LARGE_VALUE for "rows" field. For example: http://xx.xx.xx.xx:8080/solr/select?q=unix&%20start=0&fl=id&indent=off&rows=9 3. Press enter or click on the 'Go' button on the browser. NOTE: 1. This exception is thrown if'999' (seven digits) < VERY_LARGE_VALUE < '9' (nine digits). 2. The exception DOES NOT APPEAR AGAIN if we change the VERY_LARGE_VALUE to <= '999', execute the query and then change the VERY_LARGE_VALUE back to it's original value and execute the query again. 3. If the VERY_LARGE_VALUE >= '99' (ten digits) we get the following error: Error: HTTP Status 400 - For input string: "99" Has anyone come across this scenario before? Regards, Rishabh
SolrJ "commit" problem
Hi I've got a problem with solrj from nightly build (from 2007-11-12). I have this code: solrClient = new CommonsHttpSolrServer(new URL(indexServerUrl)); and after "add" operation firing solrClient.commit(true, true); But commit operation is not processing in Solr as I can see in log files (but I can see in debug mode that status 200 is returning after executing getHttpConnection().executeMethod(method); in SolrJ client class file) Command from console actually do the trick [EMAIL PROTECTED] ~]$ curl http://traut-base:/-solr-network/update -H "Content-Type: text/xml" --data-binary '' I must say that I'm trying to use SolrJ client from nightly build with Solr server release 1.2. Most likely it is actually the root of the problem so, can I use Solr release 1.2 with nightly-build SolrJ client? Are there any problems? What can you cay about my "commit" problem? Thank you in advance -- Best regards, Traut
Re: Weird memory error.
Can you recommend one? I am not familar with how to profile under Java. Yonik Seeley schrieb: Can you try a profiler to see where the memory is being used? -Yonik On Nov 20, 2007 11:16 AM, Brian Carmalt <[EMAIL PROTECTED]> wrote: Hello all, I started looking into the scalability of solr, and have started getting weird results. I am getting the following error: Exception in thread "btpool0-3" java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:574) at org.mortbay.thread.BoundedThreadPool.newThread(BoundedThreadPool.java:377) at org.mortbay.thread.BoundedThreadPool.dispatch(BoundedThreadPool.java:94) at org.mortbay.jetty.bio.SocketConnector$Connection.dispatch(SocketConnector.java:187) at org.mortbay.jetty.bio.SocketConnector.accept(SocketConnector.java:101) at org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:516) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) This only occurs when I send docs to the server in batches of around 10 as separate processes. If I send the serially, the heap grows up to 1200M and with no errors. When I observe the VM during it's operation, It doesn't seem to run out of memory. The VM starts with 1024M and can allocate up to 1800M. I start getting the error listed above when the memory usage is right around 1 G. I have been using the Jconsole program on windows to observe the jetty server by using the com.sun.management.jmxremote* functions on the server side. The number of threads is always around 30, and jetty can create up 250, so I don't think that's the problem. I can't really image that the monitoring process is using the other 800M of the allowable heap memory, but it could be. But the problem occurs without monitoring, even when the VM heap is set to 1500M. Does anyone have an idea as to why this error is occurring? Thanks, Brian
Re: Weird memory error.
Can you try a profiler to see where the memory is being used? -Yonik On Nov 20, 2007 11:16 AM, Brian Carmalt <[EMAIL PROTECTED]> wrote: > Hello all, > > I started looking into the scalability of solr, and have started getting > weird results. > I am getting the following error: > > Exception in thread "btpool0-3" java.lang.OutOfMemoryError: unable to > create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:574) > at > org.mortbay.thread.BoundedThreadPool.newThread(BoundedThreadPool.java:377) > at > org.mortbay.thread.BoundedThreadPool.dispatch(BoundedThreadPool.java:94) > at > org.mortbay.jetty.bio.SocketConnector$Connection.dispatch(SocketConnector.java:187) > at > org.mortbay.jetty.bio.SocketConnector.accept(SocketConnector.java:101) > at > org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:516) > at > org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) > > This only occurs when I send docs to the server in batches of around 10 > as separate processes. > If I send the serially, the heap grows up to 1200M and with no errors. > > When I observe the VM during it's operation, It doesn't seem to run out > of memory. The VM starts > with 1024M and can allocate up to 1800M. I start getting the error > listed above when the memory > usage is right around 1 G. I have been using the Jconsole program on > windows to observe the > jetty server by using the com.sun.management.jmxremote* functions on the > server side. The number of threads > is always around 30, and jetty can create up 250, so I don't think > that's the problem. I can't really image that > the monitoring process is using the other 800M of the allowable heap > memory, but it could be. > But the problem occurs without monitoring, even when the VM heap is set > to 1500M. > > Does anyone have an idea as to why this error is occurring? > > Thanks, > Brian >
Weird memory error.
Hello all, I started looking into the scalability of solr, and have started getting weird results. I am getting the following error: Exception in thread "btpool0-3" java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:574) at org.mortbay.thread.BoundedThreadPool.newThread(BoundedThreadPool.java:377) at org.mortbay.thread.BoundedThreadPool.dispatch(BoundedThreadPool.java:94) at org.mortbay.jetty.bio.SocketConnector$Connection.dispatch(SocketConnector.java:187) at org.mortbay.jetty.bio.SocketConnector.accept(SocketConnector.java:101) at org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:516) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) This only occurs when I send docs to the server in batches of around 10 as separate processes. If I send the serially, the heap grows up to 1200M and with no errors. When I observe the VM during it's operation, It doesn't seem to run out of memory. The VM starts with 1024M and can allocate up to 1800M. I start getting the error listed above when the memory usage is right around 1 G. I have been using the Jconsole program on windows to observe the jetty server by using the com.sun.management.jmxremote* functions on the server side. The number of threads is always around 30, and jetty can create up 250, so I don't think that's the problem. I can't really image that the monitoring process is using the other 800M of the allowable heap memory, but it could be. But the problem occurs without monitoring, even when the VM heap is set to 1500M. Does anyone have an idea as to why this error is occurring? Thanks, Brian
Re: Weird memory error.
I'm using the Eclipse TPTP platfrom and I'm very happy with it. You will also find good howto or tutorial pages on the web. - simon On Nov 20, 2007 5:29 PM, Brian Carmalt <[EMAIL PROTECTED]> wrote: > Can you recommend one? I am not familar with how to profile under Java. > > Yonik Seeley schrieb: > > Can you try a profiler to see where the memory is being used? > > -Yonik > > > > On Nov 20, 2007 11:16 AM, Brian Carmalt <[EMAIL PROTECTED]> wrote: > > > >> Hello all, > >> > >> I started looking into the scalability of solr, and have started > getting > >> weird results. > >> I am getting the following error: > >> > >> Exception in thread "btpool0-3" java.lang.OutOfMemoryError: unable to > >> create new native thread > >> at java.lang.Thread.start0(Native Method) > >> at java.lang.Thread.start(Thread.java:574) > >> at > >> org.mortbay.thread.BoundedThreadPool.newThread(BoundedThreadPool.java > :377) > >> at > >> org.mortbay.thread.BoundedThreadPool.dispatch(BoundedThreadPool.java > :94) > >> at > >> org.mortbay.jetty.bio.SocketConnector$Connection.dispatch( > SocketConnector.java:187) > >> at > >> org.mortbay.jetty.bio.SocketConnector.accept(SocketConnector.java:101) > >> at > >> org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java > :516) > >> at > >> org.mortbay.thread.BoundedThreadPool$PoolThread.run( > BoundedThreadPool.java:442) > >> > >> This only occurs when I send docs to the server in batches of around 10 > >> as separate processes. > >> If I send the serially, the heap grows up to 1200M and with no errors. > >> > >> When I observe the VM during it's operation, It doesn't seem to run out > >> of memory. The VM starts > >> with 1024M and can allocate up to 1800M. I start getting the error > >> listed above when the memory > >> usage is right around 1 G. I have been using the Jconsole program on > >> windows to observe the > >> jetty server by using the com.sun.management.jmxremote* functions on > the > >> server side. The number of threads > >> is always around 30, and jetty can create up 250, so I don't think > >> that's the problem. I can't really image that > >> the monitoring process is using the other 800M of the allowable heap > >> memory, but it could be. > >> But the problem occurs without monitoring, even when the VM heap is set > >> to 1500M. > >> > >> Does anyone have an idea as to why this error is occurring? > >> > >> Thanks, > >> Brian > >> > >> > > > > > >
Re: Weird memory error.
On Nov 20, 2007 11:29 AM, Brian Carmalt <[EMAIL PROTECTED]> wrote: > Can you recommend one? I am not familar with how to profile under Java. Netbeans has one for free: http://www.netbeans.org/products/profiler/ -Yonik
OR-ing together filter queries
Hello all, I am writing my own handler, and I would like to pre-filter the results based on a field. I’m calling searcher.getDocList() with a custom constructed query and filters list, but the filters always seem to AND together. My question is this: how can I construct the List of filters to make them OR together (documents are included in results if they match *any* of my filters)? For reference, here’s how I’m constructing my filters: List filters = new LinkedList(); . . . while (fieldIter.hasNext()) { String filterStr = fieldIter.next(); filters.add(new TermQuery(new Term(accessField, filterStr))); // accessField is known ahead of time } . . . results.docList = s.getDocList(finalQuery, filters.size() != 0 ? filters : null, Sort.RELEVANCE, start, indexSize, SolrIndexSearcher.GET_SCORES); Thanks for any help Anthony PS: you might notice that I'm asking for ALL of the results in that search. Never fear - I do a lot of post processing myself, and return a sane (~1000) amount of results in JSON.
BooleanQuery exception
I am trying to run a very simple query via the Admin interface and receive the exception below. The query is: description_t:guard AND title_t:help I am using dynamic fields (hence the underscored suffix). Any ideas? Thanks in advance /cody Nov 19, 2007 3:01:31 PM org.apache.solr.core.SolrException log SEVERE: java.lang.NoSuchMethodError: org.apache.lucene.search.BooleanQuery.clauses()Ljava/util/List; at org.apache.solr.search.QueryUtils.isNegative(QueryUtils.java:38) at org.apache.solr.search.QueryUtils.makeQueryable(QueryUtils.java:92) at org .apache .solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:827) at org .apache .solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:805) at org .apache .solr.search.SolrIndexSearcher.getDocList(SolrIndexSearcher.java:698) at org .apache .solr .request .StandardRequestHandler.handleRequestBody(StandardRequestHandler.java: 122) at org .apache .solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java: 77) at org.apache.solr.core.SolrCore.execute(SolrCore.java:658) at org .apache .solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:191) at org .apache .solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:159) at org .apache .catalina .core .ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java: 215) at org .apache .catalina .core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188) at org .apache .catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java: 213) at org .apache .catalina.core.StandardContextValve.invoke(StandardContextValve.java: 174) at org .apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java: 127) at org .apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java: 117) at org .apache .catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java: 151) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java: 874) at org.apache.coyote.http11.Http11BaseProtocol $Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665) at org .apache .tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528) at org .apache .tomcat .util .net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java: 81) at org.apache.tomcat.util.threads.ThreadPool $ControlRunnable.run(ThreadPool.java:689) at java.lang.Thread.run(Thread.java:619)
Re: Payloads, Tokenizers, and Filters. Oh My!
: I apologize for cross-posting but I believe both Solr and Lucene users and : developers should be concerned with this. I am not aware of a better way to : reach both communities. some of these questions strike me as being largely unrelated. if anyone wishes to followup on them further, let's do it in (new) seperate threads for each topic, on the specific list appropriate to the topic... :* Do TokenFilters belong in the Solr code base at all? Yes, in so much as any java code belongs in the Solr code base (or the nutch code base for that matter). They are seperate projects with seperate communities and seperate needs -- that doesn't mean that there isn't code in Solr which could be useful to the broader community of lucene-java; in that case the appropriate course of action is to open a LUCENE issue to "promote" the code up into lucene-java, and a dependent issue in SOLR to deprecate the current code and use the newer code instead. as some people may be aware, there was a discussion aboutthis sort of thing at ApacheCon during the Lucene BOF -- some reasons this doesn't happen as often as it seems like it should are: * the code may have subtle dependency tendrals that make it hard to refactor from one code base to the other. * the tests are frequently harder to "promote" then the code (in the case of most Solr tests that use the TestHarness, it's probably easier to write new tests from scratch) * when promoting the code, it's the best time to consider wether the existing API is really the "best" API before a lot of new people start using it (compare Solr's FunctionQuery and Lucenes CustomScoreQuery for example) * someone needs to care enough to follow through on the promotion. ...further discussion is best suited for java-dev since the topic is not Solr specific (there's a lot of Nutch code out there that people have sked about promoting as well) :* How to deal with TokenFilters that add new Tokens to the stream? This is specificly regarding Payloads yes? also a pretty clear cut java-dev discussion (and one possibly already being discussed in the monolithic Payload API thread i haven't started reading yet). lucene-java sets the API and the semantics ... Solr code will follow them. :* How to patch TokenFilters and Tokenizers using the model of : LUCENE-969 in the Solr code base and in Lucene contrib? open SOLR issues containing a patchs for any Solr code that needs changed, and LUCENE issues containing patches for contrib code that needs changed. : I thought it might be useful to figure out which existing TokenFilters need to : know about Payloads. To this end I have taken an inventory of the : TokenFilters out there. I think it is fair to categorize them by Add (A), : Delete (D), Modify (M), Observe (O): again: this is a straight forward luence-java question ... once the semantics have been worked out, then there can be a Solr specific discussion about following them. (which is not to say that the Solr classes/use-cases shouldn't be considered in the discussion, just that java-dev is the right place to have the conversation) -Hoss
Re: Solr cluster topology.
Yes. The clients will always be a minute or two behind the master. I like the way some people are doing it - make them all masters! Just post your updates to each of them - you loose a bit of performance perhaps, but it doesn't matter if a server bombs out or you have to upgrade them, since they're all exactly the same. --Matthew On Nov 20, 2007, at 7:43 AM, Alexander Wallace wrote: Hi All! I just started reading about Solr a couple of days ago (not full time of course) and it looks like a pretty impressive set of technologies... I have still a few questions I have not clearly found: Q: On a cluster, as I understand it, one and only one machine is a master, and N servers could be slaves...The clients, do they all talk to the master for indexing and to a load balancer for searching? Is one particular machine configured to know it is the master? Or is it only the settings for replicating the index that matter? Or does one post reindex petitions to any of the slaves and they will forward it to the master? How can we have failover in the master? It is a correct assumption that slaves could always be a bit out of sync with the master, correct? A matter of minutes perhaps... Thanks in advance for your responses!
Re: Weird memory error.
On 20-Nov-07, at 8:16 AM, Brian Carmalt wrote: Hello all, I started looking into the scalability of solr, and have started getting weird results. I am getting the following error: Exception in thread "btpool0-3" java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:574) at org.mortbay.thread.BoundedThreadPool.newThread (BoundedThreadPool.java:377) at org.mortbay.thread.BoundedThreadPool.dispatch (BoundedThreadPool.java:94) at org.mortbay.jetty.bio.SocketConnector$Connection.dispatch (SocketConnector.java:187) at org.mortbay.jetty.bio.SocketConnector.accept (SocketConnector.java:101) at org.mortbay.jetty.AbstractConnector$Acceptor.run (AbstractConnector.java:516) at org.mortbay.thread.BoundedThreadPool$PoolThread.run (BoundedThreadPool.java:442) This only occurs when I send docs to the server in batches of around 10 as separate processes. If I send the serially, the heap grows up to 1200M and with no errors. Could be running out of stack space (which is used by other things as well as threads). But its hard to imagine that happening at 30 threads. -Mike
RE: Solr cluster topology.
http://wiki.apache.org/solr/CollectionDistribution http://wiki.apache.org/solr/SolrCollectionDistributionScripts http://wiki.apache.org/solr/SolrCollectionDistributionStatusStats http://wiki.apache.org/solr/SolrOperationsTools http://wiki.apache.org/solr/SolrCollectionDistributionOperationsOutline http://wiki.apache.org/solr/CollectionRebuilding http://wiki.apache.org/solr/SolrAdminGUI -Original Message- From: Matthew Runo [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 20, 2007 10:54 AM To: solr-user@lucene.apache.org Subject: Re: Solr cluster topology. Yes. The clients will always be a minute or two behind the master. I like the way some people are doing it - make them all masters! Just post your updates to each of them - you loose a bit of performance perhaps, but it doesn't matter if a server bombs out or you have to upgrade them, since they're all exactly the same. --Matthew On Nov 20, 2007, at 7:43 AM, Alexander Wallace wrote: > Hi All! > > I just started reading about Solr a couple of days ago (not full time > of course) and it looks like a pretty impressive set of > technologies... I have still a few questions I have not clearly found: > > Q: On a cluster, as I understand it, one and only one machine is a > master, and N servers could be slaves...The clients, do they all > talk to the master for indexing and to a load balancer for > searching? Is one particular machine configured to know it is the > master? Or is it only the settings for replicating the index that > matter? Or does one post reindex petitions to any of the slaves > and they will forward it to the master? > > How can we have failover in the master? > > It is a correct assumption that slaves could always be a bit out of > sync with the master, correct? A matter of minutes perhaps... > > Thanks in advance for your responses! > >
RE: Weird memory error.
AppPerfect has a free-for-noncommercial-use version of their tools. I've used them before and was very impressed. http://www.appperfect.com/products/devtest.html#versions -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Tuesday, November 20, 2007 9:12 AM To: solr-user@lucene.apache.org Subject: Re: Weird memory error. On Nov 20, 2007 11:29 AM, Brian Carmalt <[EMAIL PROTECTED]> wrote: > Can you recommend one? I am not familar with how to profile under Java. Netbeans has one for free: http://www.netbeans.org/products/profiler/ -Yonik
Re: Solr cluster topology.
Thanks for the response! Interesting, this ALL MASTERS mode... I guess you don't do any replication then... In the single master, several slaves mode, I'm assuming the client still writes to one and reads from the others... right? On Nov 20, 2007, at 12:54 PM, Matthew Runo wrote: Yes. The clients will always be a minute or two behind the master. I like the way some people are doing it - make them all masters! Just post your updates to each of them - you loose a bit of performance perhaps, but it doesn't matter if a server bombs out or you have to upgrade them, since they're all exactly the same. --Matthew On Nov 20, 2007, at 7:43 AM, Alexander Wallace wrote: Hi All! I just started reading about Solr a couple of days ago (not full time of course) and it looks like a pretty impressive set of technologies... I have still a few questions I have not clearly found: Q: On a cluster, as I understand it, one and only one machine is a master, and N servers could be slaves...The clients, do they all talk to the master for indexing and to a load balancer for searching? Is one particular machine configured to know it is the master? Or is it only the settings for replicating the index that matter? Or does one post reindex petitions to any of the slaves and they will forward it to the master? How can we have failover in the master? It is a correct assumption that slaves could always be a bit out of sync with the master, correct? A matter of minutes perhaps... Thanks in advance for your responses!
facet - associated fields
Hi, Can anyone help me how to facet and/or search for associated fields? - 1234 Baseball hall of Fame opens Jackie Robinson exhibit Description about the new JR hall of fame exhibit. 20071114 200711 0 press Sports Baseball Major League Baseball Arts and Culture Culture Heritage Sites Thanks, Jae
Help with Debian solr/jetty install?
Hi, I've successfully run as far as the example admin page on Debian linux 2.6. So I installed the solr-jetty packaged for Debian testing which gives me Jetty 5.1.14-1 and Solr 1.2.0+ds1-1. Jetty starts fine and so does the Solr home page at http://localhost:8280/solr But I get an error when I try to run http://localhost:8280/solr/admin HTTP ERROR: 500 No Java compiler available I have sun-java6-jre and sun-java6-jdk packages installed. I'm new to servlet containers and java webapps. What should I be looking for to fix this or what information could I provide the list to get me moving forward from here? I've included the trace from the Jetty log, and the java properties dump from the example below. Thanks, Phil --- Java properties (from the example): -- sun.boot.library.path = /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386 java.vm.version = 1.6.0-b105 java.vm.name = Java HotSpot(TM) Client VM user.dir = /tmp/apache-solr-1.2.0/example java.runtime.version = 1.6.0-b105 os.arch = i386 java.io.tmpdir = /tmp java.library.path = /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386/client:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/../lib/i386:/usr/java/packages/lib/i386:/lib:/usr/lib java.class.version = 50.0 jetty.home = /tmp/apache-solr-1.2.0/example sun.management.compiler = HotSpot Client Compiler os.version = 2.6.22-2-686 java.class.path = /tmp/apache-solr-1.2.0/example:/tmp/apache-solr-1.2.0/example/lib/jetty-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/jetty-util-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/servlet-api-2.5-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/ant-1.6.5.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/core-3.1.1.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/jsp-2.1.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/jsp-api-2.1.jar:/usr/share/ant/lib/ant.jar java.home = /usr/lib/jvm/java-6-sun-1.6.0.00/jre java.version = 1.6.0 java.ext.dirs = /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/ext:/usr/java/packages/lib/ext sun.boot.class.path = /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/resources.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/rt.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/sunrsasign.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/jsse.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/jce.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/charsets.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/classes Jetty log (from the error under Debian Solr/Jetty): org.apache.jasper.JasperException: No Java compiler available at org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:460) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:367) at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:329) at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:265) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:428) at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:473) at org.mortbay.jetty.servlet.Dispatcher.dispatch(Dispatcher.java:286) at org.mortbay.jetty.servlet.Dispatcher.forward(Dispatcher.java:171) at org.mortbay.jetty.servlet.Default.handleGet(Default.java:302) at org.mortbay.jetty.servlet.Default.service(Default.java:223) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:428) at org.mortbay.jetty.servlet.WebApplicationHandler$CachedChain.doFilter(WebApplicationHandler.java:830) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:185) at org.mortbay.jetty.servlet.WebApplicationHandler$CachedChain.doFilter(WebApplicationHandler.java:821) at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:471) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:568) at org.mortbay.http.HttpContext.handle(HttpContext.java:1530) at org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:633) at org.mortbay.http.HttpContext.handle(HttpContext.java:1482) at org.mortbay.http.HttpServer.service(HttpServer.java:909) at org.mortbay.http.HttpConnection.service(HttpConnection.java:820) at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:986) at org.mortbay.http.HttpConnection.handle(HttpConnection.java:837) at org.mortbay.http.SocketListener.handleConnection(SocketListener.java:245) at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357) at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534) getRootCause(): java.lang.IllegalStateException: No Java compiler available at org.apache.
Re: Solr cluster topology.
On Tue, 20 Nov 2007 16:26:27 -0600 Alexander Wallace <[EMAIL PROTECTED]> wrote: > Interesting, this ALL MASTERS mode... I guess you don't do any > replication then... correct > In the single master, several slaves mode, I'm assuming the client > still writes to one and reads from the others... right? Correct again. There is also another approach which I think in SOLR is called FederatedSearch , where a front end queries a number of index servers (each with overlapping or non-overlapping data sets) and puts together 1 result stream for the answer. There was some discussion on the list, http://www.mail-archive.com/solr-user@lucene.apache.org/msg06081.html is the earliest link in the archive i can find . B _ {Beto|Norberto|Numard} Meijome "People demand freedom of speech to make up for the freedom of thought which they avoid. " Soren Aabye Kierkegaard I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: facet - associated fields
On Tue, 20 Nov 2007 17:39:58 -0500 "Jae Joo" <[EMAIL PROTECTED]> wrote: > Hi, > Can anyone help me how to facet and/or search for associated fields? - http://wiki.apache.org/solr/SimpleFacetParameters _ {Beto|Norberto|Numard} Meijome Fear not the path of truth for the lack of people walking on it. I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: Problems with Basic Install (newbie question)
: As far as I know, I do have a full JDK. I'm on OS X and it should come with : a full JDK: : http://developer.apple.com/java/ well, 1) it depends on which version of "OS X" you are running (10.1, 10.2?, 10.3?, 10.4?, 10.5?) but i don't think that's your problem ... you said you could see the admin screen before (so JSPs were working) then you installed Solr-Drupal, and then you couldn't get the admin screens to work. and you were getting this exception. did you manually start jetty when you saw this error? or did drupal? ... either way might be the cause of hte problem ... if drupal started jetty, it may be trying to use a "nobody" user, and can't write to jetyt's tmp dir for JSPs since you already created it as yourself. you might also be seeing a varient of this issue... https://issues.apache.org/jira/browse/SOLR-118 -Hoss
Finding the right place to start ...
I'm trying to find the right place to start in this community. I recently posted a question in the thread on SOLR-236. In that posting I mentioned that I was hoping to persuade my management to move from a FAST installation to a SOLR-based one. The changeover was approved in principle today. Our application is a large Rails application. I integrated Solr and created a proof-of-concept that covered almost all existing functionality and projected new functionality for 2008. So, I have a few requests for information and possibly help. I will need the result collapsing described in SOLR 236 to deploy Solr. It's an absolute requirement. I understand that it's to be available in Solr 1.3. Is there updated information for the timetable for Solr 1.3, and what's to be included? I would very much also like to have SOLR 103 - SQL Upload plugin available, though I think I have a work around if it isn't in Solr 1.3. I would be happy to offer help in any way I can - e.g. with testing. If someone can point me to the places I need to look to find information that bears on these questions, I'm happy to go and dig. Thanks for any help. Tracy Flynn
Re: Help with Debian solr/jetty install?
Phillip, I won't go into details, but I'll point out that the Java compiler is called javac and if memory serves me well, it is defined in one of Jetty's XML config files in its etc/ dir. The java compiler is used to compile JSPs that Solr uses for the admin UI. So, make sure you have javac and make sure Jetty can find it. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Phillip Farber <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Tuesday, November 20, 2007 5:55:27 PM Subject: Help with Debian solr/jetty install? Hi, I've successfully run as far as the example admin page on Debian linux 2.6. So I installed the solr-jetty packaged for Debian testing which gives me Jetty 5.1.14-1 and Solr 1.2.0+ds1-1. Jetty starts fine and so does the Solr home page at http://localhost:8280/solr But I get an error when I try to run http://localhost:8280/solr/admin HTTP ERROR: 500 No Java compiler available I have sun-java6-jre and sun-java6-jdk packages installed. I'm new to servlet containers and java webapps. What should I be looking for to fix this or what information could I provide the list to get me moving forward from here? I've included the trace from the Jetty log, and the java properties dump from the example below. Thanks, Phil --- Java properties (from the example): -- sun.boot.library.path = /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386 java.vm.version = 1.6.0-b105 java.vm.name = Java HotSpot(TM) Client VM user.dir = /tmp/apache-solr-1.2.0/example java.runtime.version = 1.6.0-b105 os.arch = i386 java.io.tmpdir = /tmp java.library.path = /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386/client:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/../lib/i386:/usr/java/packages/lib/i386:/lib:/usr/lib java.class.version = 50.0 jetty.home = /tmp/apache-solr-1.2.0/example sun.management.compiler = HotSpot Client Compiler os.version = 2.6.22-2-686 java.class.path = /tmp/apache-solr-1.2.0/example:/tmp/apache-solr-1.2.0/example/lib/jetty-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/jetty-util-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/servlet-api-2.5-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/ant-1.6.5.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/core-3.1.1.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/jsp-2.1.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/jsp-api-2.1.jar:/usr/share/ant/lib/ant.jar java.home = /usr/lib/jvm/java-6-sun-1.6.0.00/jre java.version = 1.6.0 java.ext.dirs = /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/ext:/usr/java/packages/lib/ext sun.boot.class.path = /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/resources.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/rt.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/sunrsasign.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/jsse.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/jce.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/charsets.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/classes Jetty log (from the error under Debian Solr/Jetty): org.apache.jasper.JasperException: No Java compiler available at org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:460) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:367) at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:329) at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:265) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:428) at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:473) at org.mortbay.jetty.servlet.Dispatcher.dispatch(Dispatcher.java:286) at org.mortbay.jetty.servlet.Dispatcher.forward(Dispatcher.java:171) at org.mortbay.jetty.servlet.Default.handleGet(Default.java:302) at org.mortbay.jetty.servlet.Default.service(Default.java:223) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:428) at org.mortbay.jetty.servlet.WebApplicationHandler$CachedChain.doFilter(WebApplicationHandler.java:830) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:185) at org.mortbay.jetty.servlet.WebApplicationHandler$CachedChain.doFilter(WebApplicationHandler.java:821) at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:471) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:568) at org.mortbay.http.HttpContext.handle(HttpContext.java:1530) at org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:633) at org.mortbay.http.HttpContext.handle(HttpContext.java:1482) at org.mortbay.http.HttpServer.service(HttpServer.java
Re: Any tips for indexing large amounts of data?
Mike is right about the occasional slow-down, which appears as a pause and is due to large Lucene index segment merging. This should go away with newer versions of Lucene where this is happening in the background. That said, we just indexed about 20MM documents on a single 8-core machine with 8 GB of RAM, resulting in nearly 20 GB index. The whole process took a little less than 10 hours - that's over 550 docs/second. The vanilla approach before some of our changes apparently required several days to index the same amount of data. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Mike Klaas <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Monday, November 19, 2007 5:50:19 PM Subject: Re: Any tips for indexing large amounts of data? There should be some slowdown in larger indices as occasionally large segment merge operations must occur. However, this shouldn't really affect overall speed too much. You haven't really given us enough data to tell you anything useful. I would recommend trying to do the indexing via a webapp to eliminate all your code as a possible factor. Then, look for signs to what is happening when indexing slows. For instance, is Solr high in cpu, is the computer thrashing, etc? -Mike On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote: > Hi, > > Thanks for answering this question a while back. I have made some > of the suggestions you mentioned. ie not committing until I've > finished indexing. What I am seeing though, is as the index get > larger (around 1Gb), indexing is taking a lot longer. In fact it > slows down to a crawl. Have you got any pointers as to what I might > be doing wrong? > > Also, I was looking at using MultiCore solr. Could this help in > some way? > > Thank you > Brendan > > On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote: > >> >> : I would think you would see better performance by allowing auto >> commit >> : to handle the commit size instead of reopening the connection >> all the >> : time. >> >> if your goal is "fast" indexing, don't use autoCommit at all ... just >> index everything, and don't commit until you are completely done. >> >> autoCommitting will slow your indexing down (the benefit being >> that more >> results will be visible to searchers as you proceed) >> >> >> >> >> -Hoss >> >
Re: Any tips for indexing large amounts of data?
Hi otis, I understand that is slightly off track question, but I am just curious to know the performance of Search on a 20 GB index file. What has been your observation? Regards, Eswar On Nov 21, 2007 12:33 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > Mike is right about the occasional slow-down, which appears as a pause and > is due to large Lucene index segment merging. This should go away with > newer versions of Lucene where this is happening in the background. > > That said, we just indexed about 20MM documents on a single 8-core machine > with 8 GB of RAM, resulting in nearly 20 GB index. The whole process took a > little less than 10 hours - that's over 550 docs/second. The vanilla > approach before some of our changes apparently required several days to > index the same amount of data. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > - Original Message > From: Mike Klaas <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Monday, November 19, 2007 5:50:19 PM > Subject: Re: Any tips for indexing large amounts of data? > > There should be some slowdown in larger indices as occasionally large > segment merge operations must occur. However, this shouldn't really > affect overall speed too much. > > You haven't really given us enough data to tell you anything useful. > I would recommend trying to do the indexing via a webapp to eliminate > all your code as a possible factor. Then, look for signs to what is > happening when indexing slows. For instance, is Solr high in cpu, is > the computer thrashing, etc? > > -Mike > > On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote: > > > Hi, > > > > Thanks for answering this question a while back. I have made some > > of the suggestions you mentioned. ie not committing until I've > > finished indexing. What I am seeing though, is as the index get > > larger (around 1Gb), indexing is taking a lot longer. In fact it > > slows down to a crawl. Have you got any pointers as to what I might > > be doing wrong? > > > > Also, I was looking at using MultiCore solr. Could this help in > > some way? > > > > Thank you > > Brendan > > > > On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote: > > > >> > >> : I would think you would see better performance by allowing auto > >> commit > >> : to handle the commit size instead of reopening the connection > >> all the > >> : time. > >> > >> if your goal is "fast" indexing, don't use autoCommit at all ... > just > >> index everything, and don't commit until you are completely done. > >> > >> autoCommitting will slow your indexing down (the benefit being > >> that more > >> results will be visible to searchers as you proceed) > >> > >> > >> > >> > >> -Hoss > >> > > > > > > >
Re: Near Duplicate Documents
To whomever started this thread: look at Nutch. I believe something related to this already exists in Nutch for near-duplicate detection. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Mike Klaas <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Sunday, November 18, 2007 11:08:38 PM Subject: Re: Near Duplicate Documents On 18-Nov-07, at 8:17 AM, Eswar K wrote: > Is there any idea implementing that feature in the up coming releases? Not currently. Feel free to contribute something if you find a good solution . -Mike > On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote: > >> On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote: >>> We have a scenario, where we want to find out documents which are >> similar in >>> content. To elaborate a little more on what we mean here, lets >>> take an >>> example. >>> >>> The example of this email chain in which we are interacting on, >>> can be >> best >>> used for illustrating the concept of near dupes (We are not getting >> confused >>> with threads, they are two different things.). Each email in this >>> thread >> is >>> treated as a document by the system. A reply to the original mail >>> also >>> includes the original mail in which case it becomes a near >>> duplicate of >> the >>> orginal mail (depending on the percentage of similarity). >>> Similarly it >> goes >>> on. The near dupes need not be limited to emails. >> >> I think this is what's known as "shingling." See >> http://en.wikipedia.org/wiki/W-shingling >> Lucene (and therefore Solr) does not implement shingling. The >> "MoreLikeThis" query might be close enough, however. >> >> -Stuart >>
Re: Performance question: Solr 64 bit java vs 32 bit mode.
Solr runs equally well on both 64-bit and 32-bit systems. Your 15 second problem could be caused by IO bottleneck (not likely if your index is small and fits in RAM), could be concurrency (esp. if you are using compound index format), could be something else on production killing your CPU, could be the JVM being busy sweeping the garbage out, etc. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Robert Purdy <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Thursday, November 15, 2007 4:05:00 PM Subject: Performance question: Solr 64 bit java vs 32 bit mode. Would anyone know if solr runs better in 64bit java vs 32 bit and could answer another possible related question. I currently have two servers running solr under identical tomcat installations. One is the production server and is under heavy user load and the other is under no load at all because it is a test box. I was looking in the logs on the production server and noticed some queries were taking about 15 seconds, and this is after auto-warming. So I decided to execute that same query on the other server with nothing in the caches and found that it only took 2 seconds to complete. My question is why an Dual Intel Core Duo Xserve server in 64 bit java mode with 8GB of ram allocated to the tomcat server be slower than a Dual Power PC G5 server running in 32 bit mode with only 2GB of ram allocated? Is it because of the load/concurrrency issues on the production sever that made the time next to the query in the log greater on the production server? If so what is the best way to configure tomcat to deal with that issue? Thanks Robert. -- View this message in context: http://www.nabble.com/Performance-question%3A-Solr-64-bit-java-vs-32-bit-mode.-tf4817186.html#a13781791 Sent from the Solr - User mailing list archive at Nabble.com.
Re: Any tips for indexing large amounts of data?
Thats great. At what size of the index do you think we should look at partitioning the index file? Eswar On Nov 21, 2007 12:57 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > Just tried a search for "web" on this index - 1.1 seconds. This matches > about 1MM of about 20MM docs. Redo the search, and it's 1 ms (cached). > This is without any load nor serious benchmarking, clearly. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > - Original Message > From: Eswar K <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Wednesday, November 21, 2007 2:11:07 AM > Subject: Re: Any tips for indexing large amounts of data? > > Hi otis, > > I understand that is slightly off track question, but I am just curious > to > know the performance of Search on a 20 GB index file. What has been > your > observation? > > Regards, > Eswar > > On Nov 21, 2007 12:33 PM, Otis Gospodnetic <[EMAIL PROTECTED]> > wrote: > > > Mike is right about the occasional slow-down, which appears as a > pause and > > is due to large Lucene index segment merging. This should go away > with > > newer versions of Lucene where this is happening in the background. > > > > That said, we just indexed about 20MM documents on a single 8-core > machine > > with 8 GB of RAM, resulting in nearly 20 GB index. The whole process > took a > > little less than 10 hours - that's over 550 docs/second. The vanilla > > approach before some of our changes apparently required several days > to > > index the same amount of data. > > > > Otis > > -- > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > - Original Message > > From: Mike Klaas <[EMAIL PROTECTED]> > > To: solr-user@lucene.apache.org > > Sent: Monday, November 19, 2007 5:50:19 PM > > Subject: Re: Any tips for indexing large amounts of data? > > > > There should be some slowdown in larger indices as occasionally large > > segment merge operations must occur. However, this shouldn't really > > affect overall speed too much. > > > > You haven't really given us enough data to tell you anything useful. > > I would recommend trying to do the indexing via a webapp to eliminate > > all your code as a possible factor. Then, look for signs to what is > > happening when indexing slows. For instance, is Solr high in cpu, is > > the computer thrashing, etc? > > > > -Mike > > > > On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote: > > > > > Hi, > > > > > > Thanks for answering this question a while back. I have made some > > > of the suggestions you mentioned. ie not committing until I've > > > finished indexing. What I am seeing though, is as the index get > > > larger (around 1Gb), indexing is taking a lot longer. In fact it > > > slows down to a crawl. Have you got any pointers as to what I might > > > be doing wrong? > > > > > > Also, I was looking at using MultiCore solr. Could this help in > > > some way? > > > > > > Thank you > > > Brendan > > > > > > On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote: > > > > > >> > > >> : I would think you would see better performance by allowing auto > > >> commit > > >> : to handle the commit size instead of reopening the connection > > >> all the > > >> : time. > > >> > > >> if your goal is "fast" indexing, don't use autoCommit at all ... > > just > > >> index everything, and don't commit until you are completely done. > > >> > > >> autoCommitting will slow your indexing down (the benefit being > > >> that more > > >> results will be visible to searchers as you proceed) > > >> > > >> > > >> > > >> > > >> -Hoss > > >> > > > > > > > > > > > > > > > > >
Re: Any tips for indexing large amounts of data?
Just tried a search for "web" on this index - 1.1 seconds. This matches about 1MM of about 20MM docs. Redo the search, and it's 1 ms (cached). This is without any load nor serious benchmarking, clearly. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Eswar K <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Wednesday, November 21, 2007 2:11:07 AM Subject: Re: Any tips for indexing large amounts of data? Hi otis, I understand that is slightly off track question, but I am just curious to know the performance of Search on a 20 GB index file. What has been your observation? Regards, Eswar On Nov 21, 2007 12:33 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > Mike is right about the occasional slow-down, which appears as a pause and > is due to large Lucene index segment merging. This should go away with > newer versions of Lucene where this is happening in the background. > > That said, we just indexed about 20MM documents on a single 8-core machine > with 8 GB of RAM, resulting in nearly 20 GB index. The whole process took a > little less than 10 hours - that's over 550 docs/second. The vanilla > approach before some of our changes apparently required several days to > index the same amount of data. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > - Original Message > From: Mike Klaas <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Monday, November 19, 2007 5:50:19 PM > Subject: Re: Any tips for indexing large amounts of data? > > There should be some slowdown in larger indices as occasionally large > segment merge operations must occur. However, this shouldn't really > affect overall speed too much. > > You haven't really given us enough data to tell you anything useful. > I would recommend trying to do the indexing via a webapp to eliminate > all your code as a possible factor. Then, look for signs to what is > happening when indexing slows. For instance, is Solr high in cpu, is > the computer thrashing, etc? > > -Mike > > On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote: > > > Hi, > > > > Thanks for answering this question a while back. I have made some > > of the suggestions you mentioned. ie not committing until I've > > finished indexing. What I am seeing though, is as the index get > > larger (around 1Gb), indexing is taking a lot longer. In fact it > > slows down to a crawl. Have you got any pointers as to what I might > > be doing wrong? > > > > Also, I was looking at using MultiCore solr. Could this help in > > some way? > > > > Thank you > > Brendan > > > > On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote: > > > >> > >> : I would think you would see better performance by allowing auto > >> commit > >> : to handle the commit size instead of reopening the connection > >> all the > >> : time. > >> > >> if your goal is "fast" indexing, don't use autoCommit at all ... > just > >> index everything, and don't commit until you are completely done. > >> > >> autoCommitting will slow your indexing down (the benefit being > >> that more > >> results will be visible to searchers as you proceed) > >> > >> > >> > >> > >> -Hoss > >> > > > > > > >
Re: Performance of Solr on different Platforms
Most of Sematext's customers seem to be RH fans. I've seen some Ubuntu, some Debian, and some SuSe users. RH feels "safe". :) Some use Solaris. Some are going crazy with Xen, putting everything in VMs. RAM - as much as you can afford, as usual. CPU - AMD Opterons performed the best last time I benchmarked a bunch of different type of hardware - stay away from Sun Niagara servers for Lucene/Solr. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Eswar K <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Sunday, November 18, 2007 11:15:48 AM Subject: Performance of Solr on different Platforms Hi, I understand that Solr can be used on different Linux flavors. Is there any preferred flavor (Like Red Hat, Ubuntu, etc)? Also what is the kind of configuration of hardware (Processors, RAM, etc) be best suited for the install? We expect to load it with millions of documents (varying from 2 - 20 million). There might be around 1000 concurrent users. Your help in this regard will be appreciated. Regards, Eswar
Re: Near Duplicate Documents
Otis, Thanks for your response. I just gave a quick look to the Nutch Forum and find that there is an implementation to obtain de-duplicate documents/pages but none for Near Duplicates documents. Can you guide me a little further as to where exactly under Nutch I should be concentrating, regarding near duplicate documents? Regards, Rishabh On Nov 21, 2007 12:41 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > To whomever started this thread: look at Nutch. I believe something > related to this already exists in Nutch for near-duplicate detection. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > - Original Message > From: Mike Klaas <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Sunday, November 18, 2007 11:08:38 PM > Subject: Re: Near Duplicate Documents > > On 18-Nov-07, at 8:17 AM, Eswar K wrote: > > > Is there any idea implementing that feature in the up coming > releases? > > Not currently. Feel free to contribute something if you find a good > solution . > > -Mike > > > > On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote: > > > >> On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote: > >>> We have a scenario, where we want to find out documents which are > >> similar in > >>> content. To elaborate a little more on what we mean here, lets > >>> take an > >>> example. > >>> > >>> The example of this email chain in which we are interacting on, > >>> can be > >> best > >>> used for illustrating the concept of near dupes (We are not getting > >> confused > >>> with threads, they are two different things.). Each email in this > >>> thread > >> is > >>> treated as a document by the system. A reply to the original mail > >>> also > >>> includes the original mail in which case it becomes a near > >>> duplicate of > >> the > >>> orginal mail (depending on the percentage of similarity). > >>> Similarly it > >> goes > >>> on. The near dupes need not be limited to emails. > >> > >> I think this is what's known as "shingling." See > >> http://en.wikipedia.org/wiki/W-shingling > >> Lucene (and therefore Solr) does not implement shingling. The > >> "MoreLikeThis" query might be close enough, however. > >> > >> -Stuart > >> > > > > >
Re: two solr instances - index and commit
Uh, avoid NFS and Lucene/Solr, unless you really really don't care about performance. We recently benchmarked Lucene indexing+searching+... on 1) local disk, 2) SAN, and 3) NFS. You have the right to a single guess - which of the three was the slweet? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Kasi Sankaralingam <[EMAIL PROTECTED]> To: "solr-user@lucene.apache.org" Sent: Tuesday, November 13, 2007 6:48:03 PM Subject: RE: two solr instances - index and commit This works, the only thing you need to be aware of is the NFS problem if you are running in a distributed environment sharing a NFS partition. a) Index and commit on instance (Typically partitioned as an index server) b) Issue a commit on the search server (like a read only mode) Things to watch out for, you will get stale NFS problem, I replaced lucene core that is shipped with solr to the latest one and it works. -Original Message- From: Jae Joo [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 13, 2007 9:06 AM To: solr-user Subject: two solr instances - index and commit Hi, I have two solr instance running under different tomcat environment. One solr instance is for indexing and would like to commit to the other solr instance. This is what I tried, but failed. using post.sh (without commit), the docs are indexed in solr-1 instance. After indexed, call commit command with the attribute of solr-2. Can any help me? Jae