RE: Solr replication
Hi Bill, I have some questions regarding the SOLR collection distribution. !) Is it possilbe to add the index operations on the the slave server using SOLR collection distribution and still the master server is updated with these changes? 2)I have a requirement of having more than one solr instance (the corresponding data directory for each solr core). Is it possible to maintain different solr cores and still achieve SOLR collection distribution for all of these cores independently. If yes, then how ? Regards, Dilip -Original Message- From: Bill Au [mailto:[EMAIL PROTECTED] Sent: Monday, January 14, 2008 9:40 PM To: [EMAIL PROTECTED] Subject: Re: Solr replication Yes, you need the same changes in scripts.conf on the slave server but you don't need the post commit hook enabled on the slave server. The post commit hook is used to create snapshots. You will see a new snapshot in the data directory every time you do a commit on the master server. There is no need to create snapshots on the slave server as the slave server copies the snapshots from the master server. The scripts are designed to run under Unix/Linux. It uses symbolic link and Unix/Linux commands like scp, ssh, rsync, cp. I don't know much about Windows so I don't know for sure if all the Unix/Linux stuff used by the sccripts are available in Windows or not. Bill On 1/14/08, Dilip.TS <[EMAIL PROTECTED]> wrote: Hi Bill, I m trying to use the solr collection distribution. and done the following changes: 1)Changes done in Master server on linux #In scripts.conf file user= solr_hostname=localhost solr_port=8983 rsyncd_port=18983 data_dir=/usr/solr/data/data_tenantID_1 webapp_name=solr master_host=192.168.168.50 master_data_dir=/usr/solr/data/data_tenantID_1 master_status_dir=/usr/solr/logs 2)Enable the postcommit in solrconfig.xml /usr/solr/bin/snapshooter /usr/solr/bin true i run the Embedded solr folder and added a document to it.. and did a search for a word on the same server. I found the following observations in the console: INFO: query parser default operator is OR Jan 14, 2008 3:37:38 PM org.apache.solr.schema.IndexSchema readSchema INFO: unique key field: id Jan 14, 2008 3:37:38 PM org.apache.solr.core.SolrCore INFO: Opening new SolrCore at //usr//solr/, dataDir=//usr//solr//data//data_tenantID_1 Jan 14, 2008 3:37:38 PM org.apache.solr.core.SolrCore parseListener INFO: Searching for listeners: //[EMAIL PROTECTED]"firstSearcher"] Jan 14, 2008 3:37:38 PM org.apache.solr.core.SolrCore parseListener INFO: Searching for listeners: //[EMAIL PROTECTED]"newSearcher"] Jan 14, 2008 3:37:39 PM org.apache.solr.util.plugin.AbstractPluginLoader load INFO: created xslt: org.apache.solr.request.XSLTResponseWriter Jan 14, 2008 3:37:39 PM org.apache.solr.request.XSLTResponseWriter init INFO: xsltCacheLifetimeSeconds=5 Jan 14, 2008 3:37:39 PM org.apache.solr.util.plugin.AbstractPluginLoader load INFO: created standard: org.apache.solr.handler.StandardRequestHandler . . . . INFO: Opening [EMAIL PROTECTED] main Jan 14, 2008 3:37:39 PM org.apache.solr.core.SolrCore registerSearcher INFO: Registered new searcher [EMAIL PROTECTED] main Jan 14, 2008 3:37:39 PM org.apache.solr.update.UpdateHandler parseEventListeners INFO: added SolrEventListener for postCommit: org.apache.solr.core.RunExecutableListener{exe=/usr/solr/bin/snapshooter,dir =/usr/solr/bin,wait=true,env=[]} Jan 14, 2008 3:37:39 PM org.apache.solr.update.DirectUpdateHandler2$CommitTracker INFO: AutoCommit: disabled In the above console i find "postCommit: org.apache.solr.core.RunExecutableListener{exe=/usr/solr/bin/snapshooter,dir =/usr/solr/bin,wait=true,env=[]}" command being called after doing a commit. This is a scenario for the add/search done on the same master server on Linux. 1)I would like to know do we require similar entries for the scrips.conf and the postcommit enabled in the solrconfig.xml for the slave server too. If yes, are these entries for the slave server should be identical to that of master or it is different? 2)Also can we have the Linux machine acting as a master server and the slave can be made to run on windows machine? Thanks in advance. Regards Dilip -Original Message- From: Bill Au [mailto:[EMAIL PROTECTED] ] Sent: Saturday, December 15, 2007 1:08 AM To: solr-user@lucene.apache.org; [EMAIL PROTECTED] Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: Solr replication On Dec 14, 2007 7:00 AM, Dilip.TS <[EMAIL PROTECTED] > wrote: > Hi, > I have the following requirement for SOLR Collection Distribution using > Embedded Solr with the Jetty server: >
Problem with dismax handler when searching Solr along with field
when i search the query for example http://localhost:8983/solr/select/?q=category&qt=dismax it gives the results but when i want to search on the basis of field name like http://localhost:8983/solr/select/?q=maincategory:Cars&qt=dismax it does not gives results however http://localhost:8983/solr/select/?q=maincategory:Cars return results of cars from field name "maincategory" -- View this message in context: http://www.nabble.com/Problem-with-dismax-handler-when-searching-Solr-along-with-field-tp14878239p14878239.html Sent from the Solr - User mailing list archive at Nabble.com.
Indexing two sets of details
Hi, In the web application we are developing we have two sets of details. The personal details and the resume details. We allow 5 different resumes to be available for each user. But we want the personal details to remain same for each 5 resumes. The problem is when personal details are changed we will have to update all 5 resumes. I was thinking if we index the personal details fields separately we only have to change/update those fields. But the problem is when searching for users using fields from both personal details and resume. Then I have to manually combine both searches and what if one search gives more results than the other. Would really appreciate it if anyone has a suggestion on how I should tackle this problem Thanks, -- Gavin Selvaratnam, Project Leader hSenid Mobile Solutions Phone: +94-11-2446623/4 Fax: +94-11-2307579 Web: http://www.hSenidMobile.com Make it happen Disclaimer: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to which they are addressed. The content and opinions contained in this email are not necessarily those of hSenid Software International. If you have received this email in error please contact the sender.
Re: Solr in a distributed multi-machine high-performance environment
Look at http://issues.apache.org/jira/browse/SOLR-303 Please note that it is still work in progress. So you may not be able to use it immeadiately. On Jan 16, 2008 10:53 AM, Srikant Jakilinki <[EMAIL PROTECTED]> wrote: > Hi All, > > There is a requirement in our group of indexing and searching several > millions of documents (TREC) in real-time and millisecond responses. > For the moment we are preferring scale-out (throw more commodity > machines) approaches rather than scale-up (faster disks, more > RAM). This is in-turn inspired by the "Scale-out vs. Scale-up" paper > (mail me if you want a copy) in which it was proven that this kind of > distribution scales better and is more resilient. > > So, are there any resources available (Wiki, Tutorials, Slides, README > etc.) that throw light and guide newbies on how to run Solr in a > multi-machine scenario? I have gone through the mailing lists and site > but could not really find any answers or hands-on stuff to do so. An > adhoc guideline to get things working with 2 machines might just be > enough but for the sake of thinking out loud and solicit responses > from the list, here are my questions: > > 1) Solr that has to handle a fairly large index which has to be split > up on multiple disks (using Multicore?) > - Space is not a problem since we can use NFS but that is not > recommended as we would only exploit 1 processor > 2) Solr that has to handle a large collective index which has to be > split up on multi-machines > - The index is ever increasing (TB scale) and dynamic and all of it > has to be searched at any point > 3) Solr that has to exploit multi-machines because we have plenty of > them in a tightly coupled P2P scenario > - Machines are not a problem but will they be if they are of varied > configurations (PIII to Core2; Linux to Vista; 32-bit to 64-bit; J2SE > 1.1 to 1.6) > 4) Solr that has to distribute load on several machines > - The index(s) could be common though like say using a distributed > filesystem (Hadoop?) > > In each the above cases (we might use all of these strategies at > various use cases) the application should use Solr as a strict backend > and named service (IP or host:port) so that we can expose this > application (and the service) to the web or intranet. Machine failures > should be tolerated too. Also, does Solr manage load balancing out of > the box if it was indeed configured to work with multi-machines? > > Maybe it is superfluous but is Solr and/or Nutch the only way to use > Lucene in a multi-machine environment? Or is there some hidden > document/project somewhere that makes it possible by exposing a > regular Lucene process over the network using RMI or something? It is > my understanding (could be wrong) that Nutch and to some extent, Solr > do not perform well when there is a lot of indexing activity in > parallel to search. Batch processing is also there and perhaps we can > use Nutch/Solr there. Even so, we need multi-machine directions. > > I am sure that multi-machines make possible for a lot of other ways > which might solve the goal better and that others have practical > experience on. So, any advise and tips are also very welcome. We > intend to document things and do some benchmarking along the way in > the open spirit. > > Really sorry for the length but I hope some answers are forthcoming. > > Cheers, > Srikant > -- Regards, Shalin Shekhar Mangar.
Re: Solr replication
my answers inilne... On Jan 16, 2008 3:51 AM, Dilip.TS <[EMAIL PROTECTED]> wrote: > Hi Bill, > I have some questions regarding the SOLR collection distribution. > !) Is it possilbe to add the index operations on the the slave server > using > SOLR collection distribution and still the master server is updated with > these changes? No. The replication process is only one way, from the master to the slave. The idea behind it is that the slave servers would be for query only and the number of slaves can be increased or decreased according to traffic load. > 2)I have a requirement of having more than one solr instance (the > corresponding data directory for each solr core). Is it possible to > maintain > different solr cores and still achieve SOLR collection distribution for > all > of these cores independently. If yes, then how ? Does each solr instance has its own solr home? If so you can use replication within each instance by simply adjusting the parameters in scripts.conf for each instance. Even if they all share a single solr home, the replication related scripts all have command line option to override values set in scripts.conf: http://wiki.apache.org/solr/SolrCollectionDistributionScripts So you can invoke the scripts for each instance by setting the data directory on the command line. > > Regards, > Dilip > > > -Original Message- > From: Bill Au [mailto:[EMAIL PROTECTED] > Sent: Monday, January 14, 2008 9:40 PM > To: [EMAIL PROTECTED] > Subject: Re: Solr replication > > > Yes, you need the same changes in scripts.conf on the slave server but > you > don't need the post commit hook enabled on the slave server. > The post commit hook is used to create snapshots. You will see a new > snapshot in the data directory every time you do a commit on the master > server. There is no need to create snapshots on the slave server as the > slave server copies the snapshots from the master server. > > The scripts are designed to run under Unix/Linux. It uses symbolic link > and Unix/Linux commands like scp, ssh, rsync, cp. I don't know much about > Windows so I don't know for sure if all the Unix/Linux stuff used by the > sccripts are available in Windows or not. > > Bill > > > On 1/14/08, Dilip.TS <[EMAIL PROTECTED]> wrote: >Hi Bill, >I m trying to use the solr collection distribution. >and done the following changes: > >1)Changes done in Master server on linux >#In scripts.conf file > >user= >solr_hostname=localhost >solr_port=8983 >rsyncd_port=18983 >data_dir=/usr/solr/data/data_tenantID_1 >webapp_name=solr >master_host=192.168.168.50 >master_data_dir=/usr/solr/data/data_tenantID_1 >master_status_dir=/usr/solr/logs > >2)Enable the postcommit in solrconfig.xml > > > >/usr/solr/bin/snapshooter name="dir">/usr/solr/bin >true > > > > >i run the Embedded solr folder and added a document to it.. >and did a search for a word on the same server. >I found the following observations in the console: > >INFO: query parser default operator is OR >Jan 14, 2008 3:37:38 PM org.apache.solr.schema.IndexSchema readSchema >INFO: unique key field: id >Jan 14, 2008 3:37:38 PM org.apache.solr.core.SolrCore >INFO: Opening new SolrCore at //usr//solr/, >dataDir=//usr//solr//data//data_tenantID_1 >Jan 14, 2008 3:37:38 PM org.apache.solr.core.SolrCore parseListener >INFO: Searching for listeners: //[EMAIL PROTECTED]"firstSearcher"] >Jan 14, 2008 3:37:38 PM org.apache.solr.core.SolrCore parseListener >INFO: Searching for listeners: //[EMAIL PROTECTED]"newSearcher"] >Jan 14, 2008 3:37:39 PM > org.apache.solr.util.plugin.AbstractPluginLoader >load >INFO: created xslt: org.apache.solr.request.XSLTResponseWriter >Jan 14, 2008 3:37:39 PM org.apache.solr.request.XSLTResponseWriter init >INFO: xsltCacheLifetimeSeconds=5 >Jan 14, 2008 3:37:39 PM > org.apache.solr.util.plugin.AbstractPluginLoader >load >INFO: created standard: org.apache.solr.handler.StandardRequestHandler >. >. >. >. >INFO: Opening [EMAIL PROTECTED] main >Jan 14, 2008 3:37:39 PM org.apache.solr.core.SolrCore registerSearcher >INFO: Registered new searcher [EMAIL PROTECTED] main >Jan 14, 2008 3:37:39 PM org.apache.solr.update.UpdateHandler >parseEventListeners >INFO: added SolrEventListener for postCommit: > > org.apache.solr.core.RunExecutableListener{exe=/usr/solr/bin/snapshooter > ,dir >=/usr/solr/bin,wait=true,env=[]} >Jan 14, 2008 3:37:39 PM >org.apache.solr.update.DirectUpdateHandler2$CommitTracker >INFO: AutoCommit: disabled > > >In the above console i find "postCommit: > > org.apache.solr.core.RunExecutableListener{exe=/usr/solr/bin/snapshooter > ,dir >=/usr/solr/bin,wait=true,env=[]}" >command being called after doing a commit. >This is a scenario for the add/search do
Re: Indexing very large files.
All, I just found a thread about this on the mailing list archives because I'm troubleshooting the same problem. The kicker is that it doesn't take such large files to kill the StringBuilder. I have discovered the following: By using a text file made up of 3,443,464 bytes or less, I get no error. AT 3,443,465 bytes: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.lang.String.(String.java:208) at java.lang.StringBuilder.toString(StringBuilder.java:431) at org.junit.Assert.format(Assert.java:321) at org.junit.ComparisonFailure$ComparisonCompactor.compact( ComparisonFailure.java:80) at org.junit.ComparisonFailure.getMessage(ComparisonFailure.java:37) at java.lang.Throwable.getLocalizedMessage(Throwable.java:267) at java.lang.Throwable.toString(Throwable.java:344) at java.lang.String.valueOf(String.java:2615) at java.io.PrintWriter.print(PrintWriter.java:546) at java.io.PrintWriter.println(PrintWriter.java:683) at java.lang.Throwable.printStackTrace(Throwable.java:510) at org.apache.tools.ant.util.StringUtils.getStackTrace( StringUtils.java:96) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.getFilteredTrace (JUnitTestRunner.java:856) at org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.formatError (XMLJUnitResultFormatter.java:280) at org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.addError (XMLJUnitResultFormatter.java:255) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner$4.addError( JUnitTestRunner.java:988) at junit.framework.TestResult.addError(TestResult.java:38) at junit.framework.JUnit4TestAdapterCache$1.testFailure( JUnit4TestAdapterCache.java:51) at org.junit.runner.notification.RunNotifier$4.notifyListener( RunNotifier.java:96) at org.junit.runner.notification.RunNotifier$SafeNotifier.run( RunNotifier.java:37) at org.junit.runner.notification.RunNotifier.fireTestFailure( RunNotifier.java:93) at org.junit.internal.runners.TestMethodRunner.addFailure( TestMethodRunner.java:104) at org.junit.internal.runners.TestMethodRunner.runUnprotected( TestMethodRunner.java:87) at org.junit.internal.runners.BeforeAndAfterRunner.runProtected( BeforeAndAfterRunner.java:34) at org.junit.internal.runners.TestMethodRunner.runMethod( TestMethodRunner.java:75) at org.junit.internal.runners.TestMethodRunner.run( TestMethodRunner.java:45) at org.junit.internal.runners.TestClassMethodsRunner.invokeTestMethod( TestClassMethodsRunner.java:71) at org.junit.internal.runners.TestClassMethodsRunner.run( TestClassMethodsRunner.java:35) at org.junit.internal.runners.TestClassRunner$1.runUnprotected( TestClassRunner.java:42) at org.junit.internal.runners.BeforeAndAfterRunner.runProtected( BeforeAndAfterRunner.java:34) at org.junit.internal.runners.TestClassRunner.run( TestClassRunner.java:52) at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:32) AT 3,443,466 byes (or more) : Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.lang.AbstractStringBuilder.expandCapacity( AbstractStringBuilder.java:99) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java :393) at java.lang.StringBuilder.append(StringBuilder.java:120) at org.junit.Assert.format(Assert.java:321) at org.junit.ComparisonFailure$ComparisonCompactor.compact( ComparisonFailure.java:80) at org.junit.ComparisonFailure.getMessage(ComparisonFailure.java:37) at java.lang.Throwable.getLocalizedMessage(Throwable.java:267) at java.lang.Throwable.toString(Throwable.java:344) at java.lang.String.valueOf(String.java:2615) at java.io.PrintWriter.print(PrintWriter.java:546) at java.io.PrintWriter.println(PrintWriter.java:683) at java.lang.Throwable.printStackTrace(Throwable.java:510) at org.apache.tools.ant.util.StringUtils.getStackTrace( StringUtils.java:96) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.getFilteredTrace (JUnitTestRunner.java:856) at org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.formatError (XMLJUnitResultFormatter.java:280) at org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.addError (XMLJUnitResultFormatter.java:255) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner$4.addError( JUnitTestRunner.java:988) at junit.framework.TestResult.addError(TestResult.java:38) at junit.framework.JUnit4TestAdapterCache$1.testFailure( JUnit4TestAdapterCache.java:51) at org.junit.runner.notification.RunNotifier$4.notifyListener( RunNotifier.java:96) at org.junit.runner.notificatio
Cache size and Heap size
Hello,.. I have relatively large RAM (10Gb) on my server which is running Solr. I increased Cache settings and start to see OutOfMemory exceptions, specially on facet search. Is anybody has some suggestions how Cache settings related to Memory consumptions? What are optimal settings? How they could be calculated? Thank you for any advise, Gene
Re: Cache size and Heap size
Hi Gene. Have you set your app server / servlet container to use allocate some of this memory to be used? You can define the maximum and minimum heap size adding/replacing some parameters on the app server initialization: -Xmx1536m -Xms1536m Which app server / servlet container are you using? Regards, Daniel Alheiros On 16/1/08 15:23, "Evgeniy Strokin" <[EMAIL PROTECTED]> wrote: > Hello,.. > I have relatively large RAM (10Gb) on my server which is running Solr. I > increased Cache settings and start to see OutOfMemory exceptions, specially on > facet search. > Is anybody has some suggestions how Cache settings related to Memory > consumptions? What are optimal settings? How they could be calculated? > > Thank you for any advise, > Gene http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this.
conceptual issues with solr
Hi here, It seems that Lucene accepts any kind of XML document but Solr accepts only flat name/value pairs inside a document to be indexed. You'll find below what I'd like to do, Thanks for help of any kind ! Phil I need to index products (hotels) which have a price by date, then search them by date or date range and price range. Is there a way to do that with Solr? At the moment i have a document for each hotel : http:///yyy 1 Hotel Opera 4 stars . I would need to add my dates/price values like this but it is forbidden in Solr indexing: Otherwise i could define a default field (being an integer) and have as many fields as dates, like this: 200 150 indexing would accept it but i think i will not be able to search or sort by date The only solution i found at the moment is to create a document for each date/price http:///yyy 1 Hotel Opera 30/01/2008 200 http:///yyy 1 Hotel Opera 31/01/2008 150 then i'll have many documents for 1 hotel and in order to search by date range i would need more documents like this : 28/01/2008 to 31/01/2008 29/01/2008 to 31/01/2008 30/01/2008 to 31/01/2008 Since i need to index many other informations about an hotel (address, telephone, amenities etc...) i wouldn' like to duplicate too much information, and i think it would not be scalable to search first in a dates index then in hotels index to retrieve hotel information. Any idea?
Re: Indexing very large files.
I don't think this is a StringBuilder limitation, but rather your Java JVM doesn't start with enough memory. i.e. -Xmx. In raw Lucene, I've indexed 240M files Best Erick On Jan 16, 2008 10:12 AM, David Thibault <[EMAIL PROTECTED]> wrote: > All, > I just found a thread about this on the mailing list archives because I'm > troubleshooting the same problem. The kicker is that it doesn't take such > large files to kill the StringBuilder. I have discovered the following: > > By using a text file made up of 3,443,464 bytes or less, I get no error. > > AT 3,443,465 bytes: > > > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > >at java.lang.String.(String.java:208) > >at java.lang.StringBuilder.toString(StringBuilder.java:431) > >at org.junit.Assert.format(Assert.java:321) > >at org.junit.ComparisonFailure$ComparisonCompactor.compact( > ComparisonFailure.java:80) > >at org.junit.ComparisonFailure.getMessage(ComparisonFailure.java > :37) > >at java.lang.Throwable.getLocalizedMessage(Throwable.java:267) > >at java.lang.Throwable.toString(Throwable.java:344) > >at java.lang.String.valueOf(String.java:2615) > >at java.io.PrintWriter.print(PrintWriter.java:546) > >at java.io.PrintWriter.println(PrintWriter.java:683) > >at java.lang.Throwable.printStackTrace(Throwable.java:510) > >at org.apache.tools.ant.util.StringUtils.getStackTrace( > StringUtils.java:96) > >at > > org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.getFilteredTrace > (JUnitTestRunner.java:856) > >at > > org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.formatError > (XMLJUnitResultFormatter.java:280) > >at > > org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.addError > (XMLJUnitResultFormatter.java:255) > >at > org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner$4.addError( > JUnitTestRunner.java:988) > >at junit.framework.TestResult.addError(TestResult.java:38) > >at junit.framework.JUnit4TestAdapterCache$1.testFailure( > JUnit4TestAdapterCache.java:51) > >at org.junit.runner.notification.RunNotifier$4.notifyListener( > RunNotifier.java:96) > >at org.junit.runner.notification.RunNotifier$SafeNotifier.run( > RunNotifier.java:37) > >at org.junit.runner.notification.RunNotifier.fireTestFailure( > RunNotifier.java:93) > >at org.junit.internal.runners.TestMethodRunner.addFailure( > TestMethodRunner.java:104) > >at org.junit.internal.runners.TestMethodRunner.runUnprotected( > TestMethodRunner.java:87) > >at org.junit.internal.runners.BeforeAndAfterRunner.runProtected( > BeforeAndAfterRunner.java:34) > >at org.junit.internal.runners.TestMethodRunner.runMethod( > TestMethodRunner.java:75) > >at org.junit.internal.runners.TestMethodRunner.run( > TestMethodRunner.java:45) > >at > org.junit.internal.runners.TestClassMethodsRunner.invokeTestMethod( > TestClassMethodsRunner.java:71) > >at org.junit.internal.runners.TestClassMethodsRunner.run( > TestClassMethodsRunner.java:35) > >at org.junit.internal.runners.TestClassRunner$1.runUnprotected( > TestClassRunner.java:42) > >at org.junit.internal.runners.BeforeAndAfterRunner.runProtected( > BeforeAndAfterRunner.java:34) > >at org.junit.internal.runners.TestClassRunner.run( > TestClassRunner.java:52) > >at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:32) > > > > AT 3,443,466 byes (or more) : > > > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > >at java.lang.AbstractStringBuilder.expandCapacity( > AbstractStringBuilder.java:99) > >at java.lang.AbstractStringBuilder.append( > AbstractStringBuilder.java > :393) > >at java.lang.StringBuilder.append(StringBuilder.java:120) > >at org.junit.Assert.format(Assert.java:321) > >at org.junit.ComparisonFailure$ComparisonCompactor.compact( > ComparisonFailure.java:80) > >at org.junit.ComparisonFailure.getMessage(ComparisonFailure.java > :37) > >at java.lang.Throwable.getLocalizedMessage(Throwable.java:267) > >at java.lang.Throwable.toString(Throwable.java:344) > >at java.lang.String.valueOf(String.java:2615) > >at java.io.PrintWriter.print(PrintWriter.java:546) > >at java.io.PrintWriter.println(PrintWriter.java:683) > >at java.lang.Throwable.printStackTrace(Throwable.java:510) > >at org.apache.tools.ant.util.StringUtils.getStackTrace( > StringUtils.java:96) > >at > > org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.getFilteredTrace > (JUnitTestRunner.java:856) > >at > > org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.formatError > (XMLJUnitResultFormatter.java:280) > >at > > org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFo
Re: Indexing very large files.
P.S. Lucene by default limits the maximum field length to 10K tokens, so you have to bump that for large files. Erick On Jan 16, 2008 11:04 AM, Erick Erickson <[EMAIL PROTECTED]> wrote: > I don't think this is a StringBuilder limitation, but rather your Java > JVM doesn't start with enough memory. i.e. -Xmx. > > In raw Lucene, I've indexed 240M files > > Best > Erick > > > On Jan 16, 2008 10:12 AM, David Thibault <[EMAIL PROTECTED]> > wrote: > > > All, > > I just found a thread about this on the mailing list archives because > > I'm > > troubleshooting the same problem. The kicker is that it doesn't take > > such > > large files to kill the StringBuilder. I have discovered the following: > > > > > > By using a text file made up of 3,443,464 bytes or less, I get no > > error. > > > > AT 3,443,465 bytes: > > > > > > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > > > >at java.lang.String .(String.java:208) > > > >at java.lang.StringBuilder.toString(StringBuilder.java:431) > > > >at org.junit.Assert.format(Assert.java:321) > > > >at org.junit.ComparisonFailure$ComparisonCompactor.compact ( > > ComparisonFailure.java:80) > > > >at org.junit.ComparisonFailure.getMessage(ComparisonFailure.java > > :37) > > > >at java.lang.Throwable.getLocalizedMessage(Throwable.java:267) > > > >at java.lang.Throwable.toString (Throwable.java:344) > > > >at java.lang.String.valueOf(String.java:2615) > > > >at java.io.PrintWriter.print(PrintWriter.java:546) > > > >at java.io.PrintWriter.println(PrintWriter.java:683) > > > >at java.lang.Throwable.printStackTrace(Throwable.java:510) > > > >at org.apache.tools.ant.util.StringUtils.getStackTrace( > > StringUtils.java:96) > > > >at > > > > org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.getFilteredTrace > > (JUnitTestRunner.java:856) > > > >at > > > > org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.formatError > > (XMLJUnitResultFormatter.java:280) > > > >at > > > > org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.addError > > (XMLJUnitResultFormatter.java:255) > > > >at > > org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner$4.addError( > > JUnitTestRunner.java:988) > > > >at junit.framework.TestResult.addError(TestResult.java :38) > > > >at junit.framework.JUnit4TestAdapterCache$1.testFailure( > > JUnit4TestAdapterCache.java:51) > > > >at org.junit.runner.notification.RunNotifier$4.notifyListener( > > RunNotifier.java:96) > > > >at org.junit.runner.notification.RunNotifier$SafeNotifier.run( > > RunNotifier.java:37) > > > >at org.junit.runner.notification.RunNotifier.fireTestFailure( > > RunNotifier.java:93) > > > >at org.junit.internal.runners.TestMethodRunner.addFailure ( > > TestMethodRunner.java:104) > > > >at org.junit.internal.runners.TestMethodRunner.runUnprotected( > > TestMethodRunner.java:87) > > > >at org.junit.internal.runners.BeforeAndAfterRunner.runProtected( > > BeforeAndAfterRunner.java:34) > > > >at org.junit.internal.runners.TestMethodRunner.runMethod( > > TestMethodRunner.java:75) > > > >at org.junit.internal.runners.TestMethodRunner.run( > > TestMethodRunner.java :45) > > > >at > > org.junit.internal.runners.TestClassMethodsRunner.invokeTestMethod( > > TestClassMethodsRunner.java:71) > > > >at org.junit.internal.runners.TestClassMethodsRunner.run( > > TestClassMethodsRunner.java :35) > > > >at org.junit.internal.runners.TestClassRunner$1.runUnprotected( > > TestClassRunner.java:42) > > > >at org.junit.internal.runners.BeforeAndAfterRunner.runProtected( > > BeforeAndAfterRunner.java:34) > > > >at org.junit.internal.runners.TestClassRunner.run( > > TestClassRunner.java:52) > > > >at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java > > :32) > > > > > > > > AT 3,443,466 byes (or more) : > > > > > > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > > > >at java.lang.AbstractStringBuilder.expandCapacity( > > AbstractStringBuilder.java:99) > > > >at java.lang.AbstractStringBuilder.append ( > > AbstractStringBuilder.java > > :393) > > > >at java.lang.StringBuilder.append(StringBuilder.java:120) > > > >at org.junit.Assert.format(Assert.java:321) > > > >at org.junit.ComparisonFailure$ComparisonCompactor.compact ( > > ComparisonFailure.java:80) > > > >at org.junit.ComparisonFailure.getMessage(ComparisonFailure.java > > :37) > > > >at java.lang.Throwable.getLocalizedMessage(Throwable.java:267) > > > >at java.lang.Throwable.toString (Throwable.java:344) > > > >at java.lang.String.valueOf(String.java:2615) > > > >at java.io.PrintWriter.print(PrintWriter.java:546) > > > >at java.io.PrintWriter.println(PrintW
Re: Indexing very large files.
I think your PS might do the trick. My JVM doesn't seem to be the issue, because I've set it to -Xmx512m -Xms256m. I will track down the solr config parameter you mentioned and try that. Thanks for the quick response! Dave On 1/16/08, Erick Erickson <[EMAIL PROTECTED]> wrote: > > P.S. Lucene by default limits the maximum field length > to 10K tokens, so you have to bump that for large files. > > Erick > > On Jan 16, 2008 11:04 AM, Erick Erickson <[EMAIL PROTECTED]> wrote: > > > I don't think this is a StringBuilder limitation, but rather your Java > > JVM doesn't start with enough memory. i.e. -Xmx. > > > > In raw Lucene, I've indexed 240M files > > > > Best > > Erick > > > > > > On Jan 16, 2008 10:12 AM, David Thibault <[EMAIL PROTECTED]> > > wrote: > > > > > All, > > > I just found a thread about this on the mailing list archives because > > > I'm > > > troubleshooting the same problem. The kicker is that it doesn't take > > > such > > > large files to kill the StringBuilder. I have discovered the > following: > > > > > > > > > By using a text file made up of 3,443,464 bytes or less, I get no > > > error. > > > > > > AT 3,443,465 bytes: > > > > > > > > > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > > > > > >at java.lang.String .(String.java:208) > > > > > >at java.lang.StringBuilder.toString(StringBuilder.java:431) > > > > > >at org.junit.Assert.format(Assert.java:321) > > > > > >at org.junit.ComparisonFailure$ComparisonCompactor.compact ( > > > ComparisonFailure.java:80) > > > > > >at org.junit.ComparisonFailure.getMessage( > ComparisonFailure.java > > > :37) > > > > > >at java.lang.Throwable.getLocalizedMessage(Throwable.java:267) > > > > > >at java.lang.Throwable.toString (Throwable.java:344) > > > > > >at java.lang.String.valueOf(String.java:2615) > > > > > >at java.io.PrintWriter.print(PrintWriter.java:546) > > > > > >at java.io.PrintWriter.println(PrintWriter.java:683) > > > > > >at java.lang.Throwable.printStackTrace(Throwable.java:510) > > > > > >at org.apache.tools.ant.util.StringUtils.getStackTrace( > > > StringUtils.java:96) > > > > > >at > > > > > > > org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.getFilteredTrace > > > (JUnitTestRunner.java:856) > > > > > >at > > > > > > > org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.formatError > > > (XMLJUnitResultFormatter.java:280) > > > > > >at > > > > > > > org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.addError > > > (XMLJUnitResultFormatter.java:255) > > > > > >at > > > > org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner$4.addError( > > > JUnitTestRunner.java:988) > > > > > >at junit.framework.TestResult.addError(TestResult.java :38) > > > > > >at junit.framework.JUnit4TestAdapterCache$1.testFailure( > > > JUnit4TestAdapterCache.java:51) > > > > > >at org.junit.runner.notification.RunNotifier$4.notifyListener( > > > RunNotifier.java:96) > > > > > >at org.junit.runner.notification.RunNotifier$SafeNotifier.run( > > > RunNotifier.java:37) > > > > > >at org.junit.runner.notification.RunNotifier.fireTestFailure( > > > RunNotifier.java:93) > > > > > >at org.junit.internal.runners.TestMethodRunner.addFailure ( > > > TestMethodRunner.java:104) > > > > > >at org.junit.internal.runners.TestMethodRunner.runUnprotected( > > > TestMethodRunner.java:87) > > > > > >at org.junit.internal.runners.BeforeAndAfterRunner.runProtected > ( > > > BeforeAndAfterRunner.java:34) > > > > > >at org.junit.internal.runners.TestMethodRunner.runMethod( > > > TestMethodRunner.java:75) > > > > > >at org.junit.internal.runners.TestMethodRunner.run( > > > TestMethodRunner.java :45) > > > > > >at > > > org.junit.internal.runners.TestClassMethodsRunner.invokeTestMethod( > > > TestClassMethodsRunner.java:71) > > > > > >at org.junit.internal.runners.TestClassMethodsRunner.run( > > > TestClassMethodsRunner.java :35) > > > > > >at org.junit.internal.runners.TestClassRunner$1.runUnprotected( > > > TestClassRunner.java:42) > > > > > >at org.junit.internal.runners.BeforeAndAfterRunner.runProtected > ( > > > BeforeAndAfterRunner.java:34) > > > > > >at org.junit.internal.runners.TestClassRunner.run( > > > TestClassRunner.java:52) > > > > > >at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java > > > :32) > > > > > > > > > > > > AT 3,443,466 byes (or more) : > > > > > > > > > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > > > > > >at java.lang.AbstractStringBuilder.expandCapacity( > > > AbstractStringBuilder.java:99) > > > > > >at java.lang.AbstractStringBuilder.append ( > > > AbstractStringBuilder.java > > > :393) > > > > > >at java.lang.StringBuilder.append(StringBuilder.jav
Re: Indexing very large files.
I tried raising the 1 under as well as and still no luck. I'm trying to upload a text file that is about 8 MB in size. I think the following stack trace still points to some sort of overflowed String issue. Thoughts? Solr returned an error: Java heap space java.lang.OutOfMemoryError: Java heap space at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:232) at java.lang.StringCoding.encode(StringCoding.java:272) at java.lang.String.getBytes(String.java:947) at org.apache.lucene.index.FieldsWriter.addDocument(FieldsWriter.java:98) at org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:107) at org.apache.lucene.index.IndexWriter.buildSingleDocSegment( IndexWriter.java:977) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:965) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:947) at org.apache.solr.update.DirectUpdateHandler2.addDoc( DirectUpdateHandler2.java:270) at org.apache.solr.handler.XmlUpdateRequestHandler.update( XmlUpdateRequestHandler.java:166) at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody( XmlUpdateRequestHandler.java:84) at org.apache.solr.handler.RequestHandlerBase.handleRequest( RequestHandlerBase.java:77) at org.apache.solr.core.SolrCore.execute(SolrCore.java:658) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:191) at org.apache.solr.servlet.SolrDispatchFilter.doFilter( SolrDispatchFilter.java:159) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter( ApplicationFilterChain.java:215) at org.apache.catalina.core.ApplicationFilterChain.doFilter( ApplicationFilterChain.java:188) at org.apache.catalina.core.StandardWrapperValve.invoke( StandardWrapperValve.java:213) at org.apache.catalina.core.StandardContextValve.invoke( StandardContextValve.java:174) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java :127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java :117) at org.apache.catalina.core.StandardEngineValve.invoke( StandardEngineValve.java:108) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:151) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:874) at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection (Http11BaseProtocol.java:665) at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket( PoolTcpEndpoint.java:528) at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt( LeaderFollowerWorkerThread.java:81) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run( ThreadPool.java:689) at java.lang.Thread.run(Thread.java:619) java.io.IOException: Server returned HTTP response code: 500 for URL: http://solr:8080/solr/update at sun.net.www.protocol.http.HttpURLConnection.getInputStream( HttpURLConnection.java:1170) at com.itstrategypartners.sents.solrUpload.SimplePostTool.postData( SimplePostTool.java:134) at com.itstrategypartners.sents.solrUpload.SimplePostTool.postFile( SimplePostTool.java:87) at com.itstrategypartners.sents.solrUpload.Uploader.uploadFile( Uploader.java:97) at com.itstrategypartners.sents.solrUpload.UploaderTest.uploadFile( UploaderTest.java:95) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke( NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke( DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.junit.internal.runners.TestMethodRunner.executeMethodBody( TestMethodRunner.java:99) at org.junit.internal.runners.TestMethodRunner.runUnprotected( TestMethodRunner.java:81) at org.junit.internal.runners.BeforeAndAfterRunner.runProtected( BeforeAndAfterRunner.java:34) at org.junit.internal.runners.TestMethodRunner.runMethod( TestMethodRunner.java:75) at org.junit.internal.runners.TestMethodRunner.run( TestMethodRunner.java:45) at org.junit.internal.runners.TestClassMethodsRunner.invokeTestMethod( TestClassMethodsRunner.java:71) at org.junit.internal.runners.TestClassMethodsRunner.run( TestClassMethodsRunner.java:35) at org.junit.internal.runners.TestClassRunner$1.runUnprotected( TestClassRunner.java:42) at org.junit.internal.runners.BeforeAndAfterRunner.runProtected( BeforeAndAfterRunner.java:34) at org.junit.internal.runners.TestClassRunner.run( TestClassRunner.java:52) at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:32) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run( JUnitTestRunner.java:421) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch( JUnitTestRunner.java:912) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main (JUnitTestRunner.java:766) On 1/16/08, David Thibault <[EMAIL PROT
Re: Indexing very large files.
The PS really wasn't related to your OOM, and raising that shouldn't have changed the behavior. All that happens if you go beyond 10,000 tokens is that the rest gets thrown away. But we're beyond my real knowledge level about SOLR, so I'll defer to others. A very quick-n-dirty test as to whether you're actually allocation more memory to the process you *think* you are would be to bump it ridiculously higher. I'm completely unclear about what process gets the increased memory relative to the server. [EMAIL PROTECTED] On Jan 16, 2008 11:33 AM, David Thibault <[EMAIL PROTECTED]> wrote: > I tried raising the 1 under > as well as and still no luck. I'm trying to > upload a text file that is about 8 MB in size. I think the following > stack > trace still points to some sort of overflowed String issue. Thoughts? > Solr returned an error: Java heap space java.lang.OutOfMemoryError: Java > heap space > at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:232) > at java.lang.StringCoding.encode(StringCoding.java:272) > at java.lang.String.getBytes(String.java:947) > at org.apache.lucene.index.FieldsWriter.addDocument(FieldsWriter.java:98) > at org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java > :107) > > at org.apache.lucene.index.IndexWriter.buildSingleDocSegment( > IndexWriter.java:977) > at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:965) > at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:947) > at org.apache.solr.update.DirectUpdateHandler2.addDoc( > DirectUpdateHandler2.java:270) > at org.apache.solr.handler.XmlUpdateRequestHandler.update( > XmlUpdateRequestHandler.java:166) > at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody( > XmlUpdateRequestHandler.java:84) > at org.apache.solr.handler.RequestHandlerBase.handleRequest( > RequestHandlerBase.java:77) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:658) at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java > :191) > > at org.apache.solr.servlet.SolrDispatchFilter.doFilter( > SolrDispatchFilter.java:159) > at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter( > ApplicationFilterChain.java:215) > at org.apache.catalina.core.ApplicationFilterChain.doFilter( > ApplicationFilterChain.java:188) > at org.apache.catalina.core.StandardWrapperValve.invoke( > StandardWrapperValve.java:213) > at org.apache.catalina.core.StandardContextValve.invoke( > StandardContextValve.java:174) > at org.apache.catalina.core.StandardHostValve.invoke( > StandardHostValve.java > :127) > at org.apache.catalina.valves.ErrorReportValve.invoke( > ErrorReportValve.java > :117) > at org.apache.catalina.core.StandardEngineValve.invoke( > StandardEngineValve.java:108) > at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java > :151) > > at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java > :874) > > at > > org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection > (Http11BaseProtocol.java:665) > at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket( > PoolTcpEndpoint.java:528) > at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt( > LeaderFollowerWorkerThread.java:81) > at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run( > ThreadPool.java:689) > at java.lang.Thread.run(Thread.java:619) > > java.io.IOException: Server returned HTTP response code: 500 for URL: > http://solr:8080/solr/update >at sun.net.www.protocol.http.HttpURLConnection.getInputStream( > HttpURLConnection.java:1170) >at com.itstrategypartners.sents.solrUpload.SimplePostTool.postData( > SimplePostTool.java:134) >at com.itstrategypartners.sents.solrUpload.SimplePostTool.postFile( > SimplePostTool.java:87) >at com.itstrategypartners.sents.solrUpload.Uploader.uploadFile( > Uploader.java:97) >at com.itstrategypartners.sents.solrUpload.UploaderTest.uploadFile( > UploaderTest.java:95) >at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >at sun.reflect.NativeMethodAccessorImpl.invoke( > NativeMethodAccessorImpl.java:39) >at sun.reflect.DelegatingMethodAccessorImpl.invoke( > DelegatingMethodAccessorImpl.java:25) >at java.lang.reflect.Method.invoke(Method.java:585) >at org.junit.internal.runners.TestMethodRunner.executeMethodBody( > TestMethodRunner.java:99) >at org.junit.internal.runners.TestMethodRunner.runUnprotected( > TestMethodRunner.java:81) >at org.junit.internal.runners.BeforeAndAfterRunner.runProtected( > BeforeAndAfterRunner.java:34) >at org.junit.internal.runners.TestMethodRunner.runMethod( > TestMethodRunner.java:75) >at org.junit.internal.runners.TestMethodRunner.run( > TestMethodRunner.java:45) >at > org.junit.internal.runners.TestClassMethodsRunner.invokeTestMethod( > TestClassMethodsRunner.java:71) >at org.junit.internal.runne
Re: Indexing very large files.
This error means that the JVM has run out of heap space. Increase the heap space. That is an option on the "java" command. I set my heap to 200 Meg and do it this way with Tomcat 6: JAVA_OPTS="-Xmx600M" tomcat/bin/startup.sh wunder On 1/16/08 8:33 AM, "David Thibault" <[EMAIL PROTECTED]> wrote: > java.lang.OutOfMemoryError: Java heap space
Re: Indexing very large files.
Nice signature...=) On 1/16/08, Erick Erickson <[EMAIL PROTECTED]> wrote: > > The PS really wasn't related to your OOM, and raising that shouldn't > have changed the behavior. All that happens if you go beyond 10,000 > tokens is that the rest gets thrown away. > > But we're beyond my real knowledge level about SOLR, so I'll defer > to others. A very quick-n-dirty test as to whether you're actually > allocation more memory to the process you *think* you are would be > to bump it ridiculously higher. I'm completely unclear about what > process gets the increased memory relative to the server. > > [EMAIL PROTECTED] > > > On Jan 16, 2008 11:33 AM, David Thibault <[EMAIL PROTECTED]> > wrote: > > > I tried raising the 1 under > > as well as and still no luck. I'm trying to > > upload a text file that is about 8 MB in size. I think the following > > stack > > trace still points to some sort of overflowed String issue. Thoughts? > > Solr returned an error: Java heap space java.lang.OutOfMemoryError: > Java > > heap space > > at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:232) > > at java.lang.StringCoding.encode(StringCoding.java:272) > > at java.lang.String.getBytes(String.java:947) > > at org.apache.lucene.index.FieldsWriter.addDocument(FieldsWriter.java > :98) > > at org.apache.lucene.index.DocumentWriter.addDocument( > DocumentWriter.java > > :107) > > > > at org.apache.lucene.index.IndexWriter.buildSingleDocSegment( > > IndexWriter.java:977) > > at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:965) > > at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:947) > > at org.apache.solr.update.DirectUpdateHandler2.addDoc( > > DirectUpdateHandler2.java:270) > > at org.apache.solr.handler.XmlUpdateRequestHandler.update( > > XmlUpdateRequestHandler.java:166) > > at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody( > > XmlUpdateRequestHandler.java:84) > > at org.apache.solr.handler.RequestHandlerBase.handleRequest( > > RequestHandlerBase.java:77) > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:658) at > > org.apache.solr.servlet.SolrDispatchFilter.execute( > SolrDispatchFilter.java > > :191) > > > > at org.apache.solr.servlet.SolrDispatchFilter.doFilter( > > SolrDispatchFilter.java:159) > > at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter( > > ApplicationFilterChain.java:215) > > at org.apache.catalina.core.ApplicationFilterChain.doFilter( > > ApplicationFilterChain.java:188) > > at org.apache.catalina.core.StandardWrapperValve.invoke( > > StandardWrapperValve.java:213) > > at org.apache.catalina.core.StandardContextValve.invoke( > > StandardContextValve.java:174) > > at org.apache.catalina.core.StandardHostValve.invoke( > > StandardHostValve.java > > :127) > > at org.apache.catalina.valves.ErrorReportValve.invoke( > > ErrorReportValve.java > > :117) > > at org.apache.catalina.core.StandardEngineValve.invoke( > > StandardEngineValve.java:108) > > at org.apache.catalina.connector.CoyoteAdapter.service( > CoyoteAdapter.java > > :151) > > > > at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java > > :874) > > > > at > > > > > org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection > > (Http11BaseProtocol.java:665) > > at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket( > > PoolTcpEndpoint.java:528) > > at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt( > > LeaderFollowerWorkerThread.java:81) > > at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run( > > ThreadPool.java:689) > > at java.lang.Thread.run(Thread.java:619) > > > > java.io.IOException: Server returned HTTP response code: 500 for URL: > > http://solr:8080/solr/update > >at sun.net.www.protocol.http.HttpURLConnection.getInputStream( > > HttpURLConnection.java:1170) > >at > com.itstrategypartners.sents.solrUpload.SimplePostTool.postData( > > SimplePostTool.java:134) > >at > com.itstrategypartners.sents.solrUpload.SimplePostTool.postFile( > > SimplePostTool.java:87) > >at com.itstrategypartners.sents.solrUpload.Uploader.uploadFile( > > Uploader.java:97) > >at > com.itstrategypartners.sents.solrUpload.UploaderTest.uploadFile( > > UploaderTest.java:95) > >at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > >at sun.reflect.NativeMethodAccessorImpl.invoke( > > NativeMethodAccessorImpl.java:39) > >at sun.reflect.DelegatingMethodAccessorImpl.invoke( > > DelegatingMethodAccessorImpl.java:25) > >at java.lang.reflect.Method.invoke(Method.java:585) > >at org.junit.internal.runners.TestMethodRunner.executeMethodBody( > > TestMethodRunner.java:99) > >at org.junit.internal.runners.TestMethodRunner.runUnprotected( > > TestMethodRunner.java:81) > >at org.junit.internal.runners.BeforeAndAfterRunner.runProtected( > > BeforeAndAfterRunner.java:34) > >at org.junit.intern
Re: Indexing very large files.
Walter and all, I had been bumping up the heap for my Java app (running outside of Tomcat) but I hadn't yet tried bumping up my Tomcat heap. That seems to have helped me upload the 8MB file, but it's crashing while uploading a 32MB file now. I Just bumped tomcat to 1024MB of heap, so I'm not sure what the problem is now. I suspect Walter was on to something, since it sort of fixed my problem. I will keep troubleshooting the Tomcat memory and go from there.. Best, Dave On 1/16/08, Walter Underwood <[EMAIL PROTECTED]> wrote: > > This error means that the JVM has run out of heap space. Increase the > heap space. That is an option on the "java" command. I set my heap to > 200 Meg and do it this way with Tomcat 6: > > JAVA_OPTS="-Xmx600M" tomcat/bin/startup.sh > > wunder > > On 1/16/08 8:33 AM, "David Thibault" <[EMAIL PROTECTED]> wrote: > > > java.lang.OutOfMemoryError: Java heap space > >
Re: Solr in a distributed multi-machine high-performance environment
Thanks for that Shalin. Looks like I have to wait and keep track of developments. Forgetting about indexes that cannot be fit on a single machine (distributed search), any links to have Solr running in a 2-machine environment? I want to measure how much improvement there will be in performance with the addition of machines for computation (space later) and I need a 2-machine setup for that. Thanks Srikant Shalin Shekhar Mangar wrote: Look at http://issues.apache.org/jira/browse/SOLR-303 Please note that it is still work in progress. So you may not be able to use it immeadiately. -- Find out how you can get spam free email. http://www.bluebottle.com/tag/3
Re: Cache size and Heap size
I'm using Tomcat. I set Max Size = 5Gb and I checked in profiler that it's actually uses whole memory. There is no significant memory use by other applications. Whole change was I increased the size of cache to: LRU Cache(maxSize=1048576, initialSize=1048576, autowarmCount=524288, [EMAIL PROTECTED]) I know this is a lot and I'm going to decrease it, I was just experimenting, but I need some guidelines of how to calculate the right size of the cache. Thank you Gene - Original Message From: Daniel Alheiros <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Wednesday, January 16, 2008 10:48:50 AM Subject: Re: Cache size and Heap size Hi Gene. Have you set your app server / servlet container to use allocate some of this memory to be used? You can define the maximum and minimum heap size adding/replacing some parameters on the app server initialization: -Xmx1536m -Xms1536m Which app server / servlet container are you using? Regards, Daniel Alheiros On 16/1/08 15:23, "Evgeniy Strokin" <[EMAIL PROTECTED]> wrote: > Hello,.. > I have relatively large RAM (10Gb) on my server which is running Solr. I > increased Cache settings and start to see OutOfMemory exceptions, specially on > facet search. > Is anybody has some suggestions how Cache settings related to Memory > consumptions? What are optimal settings? How they could be calculated? > > Thank you for any advise, > Gene http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this.
Re: Solr in a distributed multi-machine high-performance environment
Solr provides a few scripts to create a multiple-machine deployment. One box is setup as the master (used primarily for writes) and others as slaves. Slaves are added as per application requirements. The index is transferred using rsync. Look at http://wiki.apache.org/solr/CollectionDistribution for details. You can put the slaves behind a load balancer or share the slaves among your front-end servers to measure performance. On Jan 17, 2008 12:39 AM, Srikant Jakilinki <[EMAIL PROTECTED]> wrote: > Thanks for that Shalin. Looks like I have to wait and keep track of > developments. > > Forgetting about indexes that cannot be fit on a single machine > (distributed search), any links to have Solr running in a 2-machine > environment? I want to measure how much improvement there will be in > performance with the addition of machines for computation (space later) > and I need a 2-machine setup for that. > > Thanks > Srikant > > Shalin Shekhar Mangar wrote: > > Look at http://issues.apache.org/jira/browse/SOLR-303 > > > > Please note that it is still work in progress. So you may not be able to > use > > it immeadiately. > > > > -- > Find out how you can get spam free email. > http://www.bluebottle.com/tag/3 > > -- Regards, Shalin Shekhar Mangar.
Re: Solr in a distributed multi-machine high-performance environment
On 16-Jan-08, at 11:09 AM, Srikant Jakilinki wrote: Thanks for that Shalin. Looks like I have to wait and keep track of developments. Forgetting about indexes that cannot be fit on a single machine (distributed search), any links to have Solr running in a 2-machine environment? I want to measure how much improvement there will be in performance with the addition of machines for computation (space later) and I need a 2-machine setup for that. If you are looking for automatic replication and load-balancing across multiple machines, Solr does not provide that. The typical strategy is as follows: index half the documents on one machine and half on another. Execute both queries simultaneously (using threads, f.i.), and combine the results. You should observe a speed up. -Mike
Re: Solr in a distributed multi-machine high-performance environment
On 15-Jan-08, at 9:23 PM, Srikant Jakilinki wrote: 2) Solr that has to handle a large collective index which has to be split up on multi-machines - The index is ever increasing (TB scale) and dynamic and all of it has to be searched at any point This will require significant development on your part. Nutch may be able to provide more of what you need OOB. 3) Solr that has to exploit multi-machines because we have plenty of them in a tightly coupled P2P scenario - Machines are not a problem but will they be if they are of varied configurations (PIII to Core2; Linux to Vista; 32-bit to 64-bit; J2SE 1.1 to 1.6) Solr requires java 1.5, lucene requires java 1.4. Also, there is certainly no point mixing PIII's and modern cpus: trying to achieve the appropriate balance between machines of such disparate capability will take much more effort than will be gained out of using them. -Mike
Re: Cache size and Heap size
On 16-Jan-08, at 11:15 AM, [EMAIL PROTECTED] wrote: I'm using Tomcat. I set Max Size = 5Gb and I checked in profiler that it's actually uses whole memory. There is no significant memory use by other applications. Whole change was I increased the size of cache to: LRU Cache(maxSize=1048576, initialSize=1048576, autowarmCount=524288, [EMAIL PROTECTED]) autowarmcount > maxSize certainly doesn't make sense. I know this is a lot and I'm going to decrease it, I was just experimenting, but I need some guidelines of how to calculate the right size of the cache. Each filter that matches more than ~3000 documents will occupy maxDocs/8 bytes of memory. Certain kinds of faceting require one entry per unique value in a field. The best way to tune this is to monitor your cache hit/expunge statistics for the filter cache (on the solr admin statistics screen). -Mike
Re: Problem with dismax handler when searching Solr along with field
On 16-Jan-08, at 3:15 AM, farhanali wrote: when i search the query for example http://localhost:8983/solr/select/?q=category&qt=dismax it gives the results but when i want to search on the basis of field name like http://localhost:8983/solr/select/?q=maincategory:Cars&qt=dismax it does not gives results however http://localhost:8983/solr/select/?q=maincategory:Cars return results of cars from field name "maincategory" Anyone have some idea??? The dismax handler does not allow you to use lucene query syntax. The qf parameter must be used to select the fields to query (alternatively, you can provide a lucene-style query in an fq filter). See the documentation here: http://wiki.apache.org/solr/DisMaxRequestHandler -Mike
Spell checker index rebuild
Having another weird spell checker index issue. Starting off from a clean index and spell check index, I'll index everything in example/exampledocs. On the first rebuild of the spellchecker index using the query below says the word 'blackjack' exists in the spellchecker index. Great, no problems. Rebuild it again and the word 'blackjack' does not exist any more. http://localhost:8983/solr/core0/select?q=blackjack&qt=spellchecker&cmd=rebuild Any ideas? This is with a Solr trunk build from yesterday. doug
IOException: read past EOF during optimize phase
I am using the embedded Solr API for my indexing process. I created a brand new index with my application without any problem. I then ran my indexer in incremental mode. This process copies the working index to a temporary Solr location, adds/updates any records, optimizes the index, and then copies it back to the working location. There are currently not any instances of Solr reading this index. Also, I commit after every 10 rows. The schema.xml and solrconfig.xml files have not changed. Here is my function call. protected void optimizeProducts() throws IOException { UpdateHandler updateHandler = m_SolrCore.getUpdateHandler(); CommitUpdateCommand commitCmd = new CommitUpdateCommand(true); commitCmd.optimize = true; updateHandler.commit(commitCmd); log.info("Optimized index"); } So, during the optimize phase, I get the following stack trace: java.io.IOException: read past EOF at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:89) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:34) at org.apache.lucene.store.IndexInput.readChars(IndexInput.java:107) at org.apache.lucene.store.IndexInput.readString(IndexInput.java:93) at org.apache.lucene.index.FieldsReader.addFieldForMerge(FieldsReader.java:211) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:119) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:323) at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:206) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:1835) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:1195) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:508) at ... There are no exceptions or anything else that appears to be incorrect during the adds or commits. After this, the index files are still non-optimized. I know there is not a whole lot to go on here. Anything in particular that I should look at?
Re: Big number of conditions of the search
I see,.. but I really need to run it on Solr. We have already indexed everything. I don't really want to construct a query with 1K OR conditions, and send to Solr to parse it first and run it after. May be there is a way to go directly to Lucene, or Solr and run such query from Java, passing Array of IDs, or something like this? Could anybody give me some advise of how to do this in better way? Thank you Gene - Original Message From: Otis Gospodnetic <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Friday, January 11, 2008 12:26:14 AM Subject: Re: Big number of conditions of the search Evgeniy - sound like a problem best suited for RDBMS, really. You can run such an OR query, but you'll have to manually increase the max number of clauses allowed (in one of the configs) and make sure the JVM has plenty of memory. But again, this is best done in RDBMS with some count(*) and GROUP BY selects. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Evgeniy Strokin <[EMAIL PROTECTED]> To: Solr User Sent: Thursday, January 10, 2008 4:39:44 PM Subject: Big number of conditions of the search Hello, I don't know how to formulate this right, I'll give an example: I have 20 millions documents with unique ID indexed. I have list of IDs stored somewhere. I need to run query which will take documents with ID from my list and gives me some statistic. For example: my documents are addresses with unique ID. I have list which contains 10 thousand IDs of some addresses. I need to find how many addresses are in NJ from my list? Or another scenario: give me all states my addresses from and how many addresses in each state (only addresses from my list)? So I was thinking I could run facet search by field "State", but my query would be like this: ID:123 OR ID:23987 OR ID:294343 10K such OR conditions in a row, which is ridicules and not even possible I think. Could somebody suggest some solution for this? Thank you Gene
Re: Indexing very large files.
OK, I have now bumped my tomcat JVM up to 1024MB min and 1500MB max. For some reason Walter's suggestion helped me get past the 8MB file upload to Solr but it's still choking on a 32MB file. Is there a way to set per-webapp JVM settings in tomcat, or is the overall tomcat JVM sufficient to set? I can't see anything in the tomcat manager to suggest that there are smaller memory limitations for solr or any other webapp (all the demo webapps that tomcat comes with are still there right now). Here's the trace I get when I try to upload the 32MB file: java.lang.OutOfMemoryError: Java heap space at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java :95) at sun.net.www.http.PosterOutputStream.write(PosterOutputStream.java :61) at sun.nio.cs.StreamEncoder$CharsetSE.writeBytes(StreamEncoder.java :336) at sun.nio.cs.StreamEncoder$CharsetSE.implWrite(StreamEncoder.java :395) at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:136) at java.io.OutputStreamWriter.write(OutputStreamWriter.java:191) at com.itstrategypartners.sents.solrUpload.SimplePostTool.pipe( SimplePostTool.java:167) at com.itstrategypartners.sents.solrUpload.SimplePostTool.postData( SimplePostTool.java:125) at com.itstrategypartners.sents.solrUpload.SimplePostTool.postFile( SimplePostTool.java:87) at com.itstrategypartners.sents.solrUpload.Uploader.uploadFile( Uploader.java:97) at com.itstrategypartners.sents.solrUpload.UploaderTest.uploadFile( UploaderTest.java:95) Any more thoughts on possible causes? Best, Dave On 1/16/08, David Thibault <[EMAIL PROTECTED]> wrote: > > Walter and all, > > I had been bumping up the heap for my Java app (running outside of Tomcat) > but I hadn't yet tried bumping up my Tomcat heap. That seems to have helped > me upload the 8MB file, but it's crashing while uploading a 32MB file now. I > Just bumped tomcat to 1024MB of heap, so I'm not sure what the problem is > now. I suspect Walter was on to something, since it sort of fixed my > problem. I will keep troubleshooting the Tomcat memory and go from there.. > > > Best, > Dave > > On 1/16/08, Walter Underwood < [EMAIL PROTECTED]> wrote: > > > > This error means that the JVM has run out of heap space. Increase the > > heap space. That is an option on the "java" command. I set my heap to > > 200 Meg and do it this way with Tomcat 6: > > > > JAVA_OPTS="-Xmx600M" tomcat/bin/startup.sh > > > > wunder > > > > On 1/16/08 8:33 AM, "David Thibault" < [EMAIL PROTECTED]> > > wrote: > > > > > java.lang.OutOfMemoryError: Java heap space > > > > > >
RE: Indexing very large files.
I think you should try isolating the problem. It may turn out that the problem isn't really to do with Solr, but file uploading. I'm no expert, but that's what I'd try out in such situation. Cheers, Timothy Wonil Lee http://timundergod.blogspot.com/ http://www.google.com/reader/shared/16849249410805339619 -Original Message- From: David Thibault [mailto:[EMAIL PROTECTED] Sent: Thursday, 17 January 2008 8:30 AM To: solr-user@lucene.apache.org Subject: Re: Indexing very large files. OK, I have now bumped my tomcat JVM up to 1024MB min and 1500MB max. For some reason Walter's suggestion helped me get past the 8MB file upload to Solr but it's still choking on a 32MB file. Is there a way to set per-webapp JVM settings in tomcat, or is the overall tomcat JVM sufficient to set? I can't see anything in the tomcat manager to suggest that there are smaller memory limitations for solr or any other webapp (all the demo webapps that tomcat comes with are still there right now). Here's the trace I get when I try to upload the 32MB file: java.lang.OutOfMemoryError: Java heap space at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java :95) at sun.net.www.http.PosterOutputStream.write(PosterOutputStream.java :61) at sun.nio.cs.StreamEncoder$CharsetSE.writeBytes(StreamEncoder.java :336) at sun.nio.cs.StreamEncoder$CharsetSE.implWrite(StreamEncoder.java :395) at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:136) at java.io.OutputStreamWriter.write(OutputStreamWriter.java:191) at com.itstrategypartners.sents.solrUpload.SimplePostTool.pipe( SimplePostTool.java:167) at com.itstrategypartners.sents.solrUpload.SimplePostTool.postData( SimplePostTool.java:125) at com.itstrategypartners.sents.solrUpload.SimplePostTool.postFile( SimplePostTool.java:87) at com.itstrategypartners.sents.solrUpload.Uploader.uploadFile( Uploader.java:97) at com.itstrategypartners.sents.solrUpload.UploaderTest.uploadFile( UploaderTest.java:95) Any more thoughts on possible causes? Best, Dave On 1/16/08, David Thibault <[EMAIL PROTECTED]> wrote: > > Walter and all, > > I had been bumping up the heap for my Java app (running outside of Tomcat) > but I hadn't yet tried bumping up my Tomcat heap. That seems to have helped > me upload the 8MB file, but it's crashing while uploading a 32MB file now. I > Just bumped tomcat to 1024MB of heap, so I'm not sure what the problem is > now. I suspect Walter was on to something, since it sort of fixed my > problem. I will keep troubleshooting the Tomcat memory and go from there.. > > > Best, > Dave > > On 1/16/08, Walter Underwood < [EMAIL PROTECTED]> wrote: > > > > This error means that the JVM has run out of heap space. Increase the > > heap space. That is an option on the "java" command. I set my heap to > > 200 Meg and do it this way with Tomcat 6: > > > > JAVA_OPTS="-Xmx600M" tomcat/bin/startup.sh > > > > wunder > > > > On 1/16/08 8:33 AM, "David Thibault" < [EMAIL PROTECTED]> > > wrote: > > > > > java.lang.OutOfMemoryError: Java heap space > > > > > > !DSPAM:478e7768324633671820667!
Re: IOException: read past EOF during optimize phase
Kevin, Don't have the answer to EOF but I'm wondering why the index moving. You don't need to do that as far as Solr is concerned. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Kevin Osborn <[EMAIL PROTECTED]> To: Solr Sent: Wednesday, January 16, 2008 3:07:23 PM Subject: IOException: read past EOF during optimize phase I am using the embedded Solr API for my indexing process. I created a brand new index with my application without any problem. I then ran my indexer in incremental mode. This process copies the working index to a temporary Solr location, adds/updates any records, optimizes the index, and then copies it back to the working location. There are currently not any instances of Solr reading this index. Also, I commit after every 10 rows. The schema.xml and solrconfig.xml files have not changed. Here is my function call. protected void optimizeProducts() throws IOException { UpdateHandler updateHandler = m_SolrCore.getUpdateHandler(); CommitUpdateCommand commitCmd = new CommitUpdateCommand(true); commitCmd.optimize = true; updateHandler.commit(commitCmd); log.info("Optimized index"); } So, during the optimize phase, I get the following stack trace: java.io.IOException: read past EOF at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:89) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:34) at org.apache.lucene.store.IndexInput.readChars(IndexInput.java:107) at org.apache.lucene.store.IndexInput.readString(IndexInput.java:93) at org.apache.lucene.index.FieldsReader.addFieldForMerge(FieldsReader.java:211) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:119) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:323) at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:206) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:1835) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:1195) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:508) at ... There are no exceptions or anything else that appears to be incorrect during the adds or commits. After this, the index files are still non-optimized. I know there is not a whole lot to go on here. Anything in particular that I should look at?
Re: Spell checker index rebuild
Do you trust the spellchecker 100% (not looking at its source now). I'd peek at the index with Luke (Luke I trust :)) and see if that term is really there first. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doug Steigerwald <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Wednesday, January 16, 2008 2:56:35 PM Subject: Spell checker index rebuild Having another weird spell checker index issue. Starting off from a clean index and spell check index, I'll index everything in example/exampledocs. On the first rebuild of the spellchecker index using the query below says the word 'blackjack' exists in the spellchecker index. Great, no problems. Rebuild it again and the word 'blackjack' does not exist any more. http://localhost:8983/solr/core0/select?q=blackjack&qt=spellchecker&cmd=rebuild Any ideas? This is with a Solr trunk build from yesterday. doug
Re: IOException: read past EOF during optimize phase
It is more of a file structure thing for our application. We build in one place and do our index syncing in a different place. I doubt it is relevant to this issue, but figured I would include this information anyway. - Original Message From: Otis Gospodnetic <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Wednesday, January 16, 2008 2:21:31 PM Subject: Re: IOException: read past EOF during optimize phase Kevin, Don't have the answer to EOF but I'm wondering why the index moving. You don't need to do that as far as Solr is concerned. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Kevin Osborn <[EMAIL PROTECTED]> To: Solr Sent: Wednesday, January 16, 2008 3:07:23 PM Subject: IOException: read past EOF during optimize phase I am using the embedded Solr API for my indexing process. I created a brand new index with my application without any problem. I then ran my indexer in incremental mode. This process copies the working index to a temporary Solr location, adds/updates any records, optimizes the index, and then copies it back to the working location. There are currently not any instances of Solr reading this index. Also, I commit after every 10 rows. The schema.xml and solrconfig.xml files have not changed. Here is my function call. protected void optimizeProducts() throws IOException { UpdateHandler updateHandler = m_SolrCore.getUpdateHandler(); CommitUpdateCommand commitCmd = new CommitUpdateCommand(true); commitCmd.optimize = true; updateHandler.commit(commitCmd); log.info("Optimized index"); } So, during the optimize phase, I get the following stack trace: java.io.IOException: read past EOF at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:89) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:34) at org.apache.lucene.store.IndexInput.readChars(IndexInput.java:107) at org.apache.lucene.store.IndexInput.readString(IndexInput.java:93) at org.apache.lucene.index.FieldsReader.addFieldForMerge(FieldsReader.java:211) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:119) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:323) at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:206) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:1835) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:1195) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:508) at ... There are no exceptions or anything else that appears to be incorrect during the adds or commits. After this, the index files are still non-optimized. I know there is not a whole lot to go on here. Anything in particular that I should look at?
Re: Indexing very large files.
>From your stack trace, it looks like it's your client running out of memory, right? SimplePostTool was meant as a command-line replacement to curl to remove that dependency, not as a recommended way to talk to Solr. -Yonik On Jan 16, 2008 4:29 PM, David Thibault <[EMAIL PROTECTED]> wrote: > OK, I have now bumped my tomcat JVM up to 1024MB min and 1500MB max. For > some reason Walter's suggestion helped me get past the 8MB file upload to > Solr but it's still choking on a 32MB file. Is there a way to set > per-webapp JVM settings in tomcat, or is the overall tomcat JVM sufficient > to set? I can't see anything in the tomcat manager to suggest that there > are smaller memory limitations for solr or any other webapp (all the demo > webapps that tomcat comes with are still there right now). > Here's the trace I get when I try to upload the 32MB file: > > > java.lang.OutOfMemoryError: Java heap space > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java > :95) > at sun.net.www.http.PosterOutputStream.write(PosterOutputStream.java > :61) > at sun.nio.cs.StreamEncoder$CharsetSE.writeBytes(StreamEncoder.java > :336) > at sun.nio.cs.StreamEncoder$CharsetSE.implWrite(StreamEncoder.java > :395) > at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:136) > at java.io.OutputStreamWriter.write(OutputStreamWriter.java:191) > at com.itstrategypartners.sents.solrUpload.SimplePostTool.pipe( > SimplePostTool.java:167) > at com.itstrategypartners.sents.solrUpload.SimplePostTool.postData( > SimplePostTool.java:125) > at com.itstrategypartners.sents.solrUpload.SimplePostTool.postFile( > SimplePostTool.java:87) > at com.itstrategypartners.sents.solrUpload.Uploader.uploadFile( > Uploader.java:97) > at com.itstrategypartners.sents.solrUpload.UploaderTest.uploadFile( > UploaderTest.java:95) > > Any more thoughts on possible causes? > > Best, > Dave
Re: IOException: read past EOF during optimize phase
Kevin, Perhaps you want to look at how Solr can be used in a master-slave setup. This will separate your indexing from searching. Don't have the URL, but it's on zee Wiki. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Kevin Osborn <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Wednesday, January 16, 2008 5:25:34 PM Subject: Re: IOException: read past EOF during optimize phase It is more of a file structure thing for our application. We build in one place and do our index syncing in a different place. I doubt it is relevant to this issue, but figured I would include this information anyway. - Original Message From: Otis Gospodnetic <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Wednesday, January 16, 2008 2:21:31 PM Subject: Re: IOException: read past EOF during optimize phase Kevin, Don't have the answer to EOF but I'm wondering why the index moving. You don't need to do that as far as Solr is concerned. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Kevin Osborn <[EMAIL PROTECTED]> To: Solr Sent: Wednesday, January 16, 2008 3:07:23 PM Subject: IOException: read past EOF during optimize phase I am using the embedded Solr API for my indexing process. I created a brand new index with my application without any problem. I then ran my indexer in incremental mode. This process copies the working index to a temporary Solr location, adds/updates any records, optimizes the index, and then copies it back to the working location. There are currently not any instances of Solr reading this index. Also, I commit after every 10 rows. The schema.xml and solrconfig.xml files have not changed. Here is my function call. protected void optimizeProducts() throws IOException { UpdateHandler updateHandler = m_SolrCore.getUpdateHandler(); CommitUpdateCommand commitCmd = new CommitUpdateCommand(true); commitCmd.optimize = true; updateHandler.commit(commitCmd); log.info("Optimized index"); } So, during the optimize phase, I get the following stack trace: java.io.IOException: read past EOF at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:89) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:34) at org.apache.lucene.store.IndexInput.readChars(IndexInput.java:107) at org.apache.lucene.store.IndexInput.readString(IndexInput.java:93) at org.apache.lucene.index.FieldsReader.addFieldForMerge(FieldsReader.java:211) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:119) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:323) at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:206) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:1835) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:1195) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:508) at ... There are no exceptions or anything else that appears to be incorrect during the adds or commits. After this, the index files are still non-optimized. I know there is not a whole lot to go on here. Anything in particular that I should look at?
Re: Indexing very large files.
David, I bet you can quickly identify the source using YourKit or another Java profiler jmap command line tool might also give you some direction. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: David Thibault <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Wednesday, January 16, 2008 1:31:23 PM Subject: Re: Indexing very large files. Walter and all, I had been bumping up the heap for my Java app (running outside of Tomcat) but I hadn't yet tried bumping up my Tomcat heap. That seems to have helped me upload the 8MB file, but it's crashing while uploading a 32MB file now. I Just bumped tomcat to 1024MB of heap, so I'm not sure what the problem is now. I suspect Walter was on to something, since it sort of fixed my problem. I will keep troubleshooting the Tomcat memory and go from there.. Best, Dave On 1/16/08, Walter Underwood <[EMAIL PROTECTED]> wrote: > > This error means that the JVM has run out of heap space. Increase the > heap space. That is an option on the "java" command. I set my heap to > 200 Meg and do it this way with Tomcat 6: > > JAVA_OPTS="-Xmx600M" tomcat/bin/startup.sh > > wunder > > On 1/16/08 8:33 AM, "David Thibault" <[EMAIL PROTECTED]> wrote: > > > java.lang.OutOfMemoryError: Java heap space > >
Re: IOException: read past EOF during optimize phase
Our basic setup is master/slave. We just want to make sure that we are not syncing against an index that is in the middle of a large rebuild. But, I think these issues are still separate from what I am experiencing. I also tried this same scenario in a different development environment. No problems there. - Original Message From: Otis Gospodnetic <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Wednesday, January 16, 2008 2:33:03 PM Subject: Re: IOException: read past EOF during optimize phase Kevin, Perhaps you want to look at how Solr can be used in a master-slave setup. This will separate your indexing from searching. Don't have the URL, but it's on zee Wiki. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Kevin Osborn <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Wednesday, January 16, 2008 5:25:34 PM Subject: Re: IOException: read past EOF during optimize phase It is more of a file structure thing for our application. We build in one place and do our index syncing in a different place. I doubt it is relevant to this issue, but figured I would include this information anyway. - Original Message From: Otis Gospodnetic <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Wednesday, January 16, 2008 2:21:31 PM Subject: Re: IOException: read past EOF during optimize phase Kevin, Don't have the answer to EOF but I'm wondering why the index moving. You don't need to do that as far as Solr is concerned. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Kevin Osborn <[EMAIL PROTECTED]> To: Solr Sent: Wednesday, January 16, 2008 3:07:23 PM Subject: IOException: read past EOF during optimize phase I am using the embedded Solr API for my indexing process. I created a brand new index with my application without any problem. I then ran my indexer in incremental mode. This process copies the working index to a temporary Solr location, adds/updates any records, optimizes the index, and then copies it back to the working location. There are currently not any instances of Solr reading this index. Also, I commit after every 10 rows. The schema.xml and solrconfig.xml files have not changed. Here is my function call. protected void optimizeProducts() throws IOException { UpdateHandler updateHandler = m_SolrCore.getUpdateHandler(); CommitUpdateCommand commitCmd = new CommitUpdateCommand(true); commitCmd.optimize = true; updateHandler.commit(commitCmd); log.info("Optimized index"); } So, during the optimize phase, I get the following stack trace: java.io.IOException: read past EOF at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:89) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:34) at org.apache.lucene.store.IndexInput.readChars(IndexInput.java:107) at org.apache.lucene.store.IndexInput.readString(IndexInput.java:93) at org.apache.lucene.index.FieldsReader.addFieldForMerge(FieldsReader.java:211) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:119) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:323) at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:206) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:1835) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:1195) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:508) at ... There are no exceptions or anything else that appears to be incorrect during the adds or commits. After this, the index files are still non-optimized. I know there is not a whole lot to go on here. Anything in particular that I should look at?
dojo and solr
has anyone done any work integrating dojo based applications with solr? I am pretty new to both but I wondered if it anyone had developed an xsl for solr that returns solr queries in dojo data store format - json, but a specific format of json. I am not even sure if this is sensible/possible. Sean _ Share what Santa brought you https://www.mycooluncool.com
Re: IOException: read past EOF during optimize phase
This may be a Lucene bug... IIRC, I saw at least one other lucene user with a similar stack trace. I think the latest lucene version (2.3 dev) should fix it if that's the case. -Yonik On Jan 16, 2008 3:07 PM, Kevin Osborn <[EMAIL PROTECTED]> wrote: > I am using the embedded Solr API for my indexing process. I created a brand > new index with my application without any problem. I then ran my indexer in > incremental mode. This process copies the working index to a temporary Solr > location, adds/updates any records, optimizes the index, and then copies it > back to the working location. There are currently not any instances of Solr > reading this index. Also, I commit after every 10 rows. The schema.xml > and solrconfig.xml files have not changed. > > Here is my function call. > protected void optimizeProducts() throws IOException { > UpdateHandler updateHandler = m_SolrCore.getUpdateHandler(); > CommitUpdateCommand commitCmd = new CommitUpdateCommand(true); > commitCmd.optimize = true; > > updateHandler.commit(commitCmd); > > log.info("Optimized index"); > } > > So, during the optimize phase, I get the following stack trace: > java.io.IOException: read past EOF > at > org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:89) > at > org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:34) > at org.apache.lucene.store.IndexInput.readChars(IndexInput.java:107) > at org.apache.lucene.store.IndexInput.readString(IndexInput.java:93) > at > org.apache.lucene.index.FieldsReader.addFieldForMerge(FieldsReader.java:211) > at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:119) > at > org.apache.lucene.index.SegmentReader.document(SegmentReader.java:323) > at > org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:206) > at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96) > at > org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:1835) > at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:1195) > at > org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:508) > at ... > > There are no exceptions or anything else that appears to be incorrect during > the adds or commits. After this, the index files are still non-optimized. > > I know there is not a whole lot to go on here. Anything in particular that I > should look at? > >
Re: Indexing very large files.
Yonik, I pulled SimplePostTool apart, pulled out the main() and the postFiles() and just use it directly in Java via postFile() -> postData(). It seems to work OK. Maybe I should upgrade to v1.3 and try doing things directly through Solrj. Is 1.3 stable yet? Might that be a better plan altogether? Dave On 1/16/08, Yonik Seeley <[EMAIL PROTECTED]> wrote: > > From your stack trace, it looks like it's your client running out of > memory, right? > > SimplePostTool was meant as a command-line replacement to curl to > remove that dependency, not as a recommended way to talk to Solr. > > -Yonik > > On Jan 16, 2008 4:29 PM, David Thibault <[EMAIL PROTECTED]> > wrote: > > OK, I have now bumped my tomcat JVM up to 1024MB min and 1500MB > max. For > > some reason Walter's suggestion helped me get past the 8MB file upload > to > > Solr but it's still choking on a 32MB file. Is there a way to set > > per-webapp JVM settings in tomcat, or is the overall tomcat JVM > sufficient > > to set? I can't see anything in the tomcat manager to suggest that > there > > are smaller memory limitations for solr or any other webapp (all the > demo > > webapps that tomcat comes with are still there right now). > > Here's the trace I get when I try to upload the 32MB file: > > > > > > java.lang.OutOfMemoryError: Java heap space > > at java.io.ByteArrayOutputStream.write( > ByteArrayOutputStream.java > > :95) > > at sun.net.www.http.PosterOutputStream.write( > PosterOutputStream.java > > :61) > > at sun.nio.cs.StreamEncoder$CharsetSE.writeBytes( > StreamEncoder.java > > :336) > > at sun.nio.cs.StreamEncoder$CharsetSE.implWrite( > StreamEncoder.java > > :395) > > at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:136) > > at java.io.OutputStreamWriter.write(OutputStreamWriter.java:191) > > at com.itstrategypartners.sents.solrUpload.SimplePostTool.pipe( > > SimplePostTool.java:167) > > at > com.itstrategypartners.sents.solrUpload.SimplePostTool.postData( > > SimplePostTool.java:125) > > at > com.itstrategypartners.sents.solrUpload.SimplePostTool.postFile( > > SimplePostTool.java:87) > > at com.itstrategypartners.sents.solrUpload.Uploader.uploadFile( > > Uploader.java:97) > > at > com.itstrategypartners.sents.solrUpload.UploaderTest.uploadFile( > > UploaderTest.java:95) > > > > Any more thoughts on possible causes? > > > > Best, > > Dave >
Re: Indexing very large files.
Thanks, Otis. I will take a look at those profiling tools. Best, Dave On 1/16/08, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > > David, > I bet you can quickly identify the source using YourKit or another Java > profiler jmap command line tool might also give you some direction. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > - Original Message > From: David Thibault <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Wednesday, January 16, 2008 1:31:23 PM > Subject: Re: Indexing very large files. > > Walter and all, > I had been bumping up the heap for my Java app (running outside of > Tomcat) > but I hadn't yet tried bumping up my Tomcat heap. That seems to have > helped > me upload the 8MB file, but it's crashing while uploading a 32MB file > now. I > Just bumped tomcat to 1024MB of heap, so I'm not sure what the problem > is > now. I suspect Walter was on to something, since it sort of fixed my > problem. I will keep troubleshooting the Tomcat memory and go from > there.. > > Best, > Dave > > On 1/16/08, Walter Underwood <[EMAIL PROTECTED]> wrote: > > > > This error means that the JVM has run out of heap space. Increase the > > heap space. That is an option on the "java" command. I set my heap to > > 200 Meg and do it this way with Tomcat 6: > > > > JAVA_OPTS="-Xmx600M" tomcat/bin/startup.sh > > > > wunder > > > > On 1/16/08 8:33 AM, "David Thibault" <[EMAIL PROTECTED]> > wrote: > > > > > java.lang.OutOfMemoryError: Java heap space > > > > > > > >
Re: IOException: read past EOF during optimize phase
I did see that bug, which made me suspect Lucene. In my case, I tracked down the problem. It was my own application. I was using Java's FileChannel.transferTo functions to copy my index from one location to another. One of the files is bigger than 2^31-1 bytes. So, one of my files was corrupted during the copy because I was just doing one pass. I now loop the copy function until the entire file is copied and everything works fine. DOH! - Original Message From: Yonik Seeley <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Wednesday, January 16, 2008 4:57:08 PM Subject: Re: IOException: read past EOF during optimize phase This may be a Lucene bug... IIRC, I saw at least one other lucene user with a similar stack trace. I think the latest lucene version (2.3 dev) should fix it if that's the case. -Yonik On Jan 16, 2008 3:07 PM, Kevin Osborn <[EMAIL PROTECTED]> wrote: > I am using the embedded Solr API for my indexing process. I created a brand new index with my application without any problem. I then ran my indexer in incremental mode. This process copies the working index to a temporary Solr location, adds/updates any records, optimizes the index, and then copies it back to the working location. There are currently not any instances of Solr reading this index. Also, I commit after every 10 rows. The schema.xml and solrconfig.xml files have not changed. > > Here is my function call. > protected void optimizeProducts() throws IOException { > UpdateHandler updateHandler = m_SolrCore.getUpdateHandler(); > CommitUpdateCommand commitCmd = new CommitUpdateCommand(true); > commitCmd.optimize = true; > > updateHandler.commit(commitCmd); > > log.info("Optimized index"); > } > > So, during the optimize phase, I get the following stack trace: > java.io.IOException: read past EOF > at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:89) > at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:34) > at org.apache.lucene.store.IndexInput.readChars(IndexInput.java:107) > at org.apache.lucene.store.IndexInput.readString(IndexInput.java:93) > at org.apache.lucene.index.FieldsReader.addFieldForMerge(FieldsReader.java:211) > at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:119) > at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:323) > at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:206) > at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96) > at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:1835) > at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:1195) > at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:508) > at ... > > There are no exceptions or anything else that appears to be incorrect during the adds or commits. After this, the index files are still non-optimized. > > I know there is not a whole lot to go on here. Anything in particular that I should look at? > >
Logging in Solr
All, I'm new to Solr and Tomcat and I'm trying to track down some odd errors. How do I set up Tomcat to do fine-grained Solr-specific logging? I have looked around enough to know that it should be possible to do per-webapp logging in Tomcat 5.5, but the details are hard to follow for a newbie. Any suggestions would be greatly appreciated. Best, Dave
Re: conceptual issues with solr
On Wed, 16 Jan 2008 16:54:56 +0100 "Philippe Guillard" <[EMAIL PROTECTED]> wrote: > Hi here, > > It seems that Lucene accepts any kind of XML document but Solr accepts only > flat name/value pairs inside a document to be indexed. > You'll find below what I'd like to do, Thanks for help of any kind ! > > Phil > Hey Phil, > > I need to index products (hotels) which have a price by date, then search > them by date or date range and price range. > Is there a way to do that with Solr? yes - look at the data types definition (somewhere in the wiki of the sample schema.xml) about data-types for indexing dates and integers,etc There are some caveats about using date data type fields (too much resolution, can slow down too much..) > > At the moment i have a document for each hotel : > > > http:///yyy > 1 > Hotel Opera > 4 stars > . > > > > I would need to add my dates/price values like this but it is forbidden in > Solr indexing: > > > > Otherwise i could define a default field (being an integer) and have as many > fields as dates, like this: > 200 > 150 > indexing would accept it but i think i will not be able to search or sort by > date for simple dates like that, why not make use of dynamic fields ? define , for example, bydate_* as dynamic fields, then you can do : so, from your example : 200 150 > The only solution i found at the moment is to create a document for each > date/price > > > http:///yyy > 1 > Hotel Opera > 30/01/2008 > 200 > > > http:///yyy > 1 > Hotel Opera > 31/01/2008 > 150 > > If the field 'id' is your schemas ID, then this wouldn't work , but sure, the approach would be valid though a bit wasteful wrt to storing the metadata about the hotel There was a thread some time ago in this list (a month or 2 ago) about clever uses of the field defined as ID in the schema. > then i'll have many documents for 1 hotel > and in order to search by date range i would need more documents > like this : > 28/01/2008 to 31/01/2008 > 29/01/2008 to 31/01/2008 > 30/01/2008 to 31/01/2008 > > Since i need to index many other informations about an hotel (address, > telephone, amenities etc...) i wouldn' like to duplicate too much > information, and i think it would not be scalable to search first in a dates > index then in hotels index to retrieve hotel information. > > Any idea? It strikes me you'd probably want a relational DB for this kind of thing B _ {Beto|Norberto|Numard} Meijome Unix is user friendly. However, it isn't idiot friendly. I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: Solr schema filters
: For this exact example, use the WordDelimiterFilter exactly as : configured in the "text" fieldType in the example schema that ships : with solr. The trick is to then use some slop when querying. : : FT-50-43 will be indexed as FT, 50, 43 / 5043 (the last two tokens : are in the same position). : Now when querying, "FT-5043" won't match without slop because there is : a "50" token in the middle of the indexed terms... so try "FT-5043"~1 FYI: this was the motivation for the "qs" param on dismax ... http://localhost:8983/solr/select?debugQuery=true&qt=dismax&pf=&qf=text&q=FT-5043&qs=3 -Hoss
Re: DisMax Syntax
: I may be mistaken, but this is not equivalent to my query.In my query i have : matches for x1, matches for x2 without slope and/or boosting and then match : to "x1 x2" (exact match) with slope (~) a and boost (b) in order to have : results with exact match score better. : The total score is the sum of all the above. : Your query seems diff the structure of the query will look different in debugging, and the scores won't be exactly the same, but the concept is the same. -Hoss
Re: Fuzziness with DisMaxRequestHandler
: Is there any way to make the DisMaxRequestHandler a bit more forgiving with : user queries, I'm only getting results when the user enters a close to : perfect match. I'd like to allow near matches if possible, but I'm not sure : how to add something like this when special query syntax isn't allowed. the principle goal of dismax was to leave "query string" syntax as simple as possible, and move the mechanisms for controlling the "query structure" into other paramaters. the idea of making Queries Fuzzy is an interesting one ... it's something i don't remember anyone ever asking about before, and i'd never really considered it (from a UI perspective i find "did you mean" style spellchecking to be a better approach then making a user's query implicitly fuzzy) but it seems like it would be pretty easy to add support for something ... one approach qould be to add a numeric "fuzz" parameter, that if set would make the DisMaxQUeryParser return FuzzyQueries in place of TermQueries ... an alternate appraoch would be to allow per field fuzziness by tweaking the "qf" syntax so instead of just fieldA^4 where 4 is the boost value, you could have fieldA^4~0.8 where 4 is the boost value and 0.8 is the fuzziness factor I haven't thought about it hard enough to have an opinion about which would make more sense ... but the overall idea certainly seems like it could be a usefull feature if osmeone wants to submit a patch. -Hoss
Re: Newbie question: facets and filter query?
: The problem is that when I use the 'cd' request handler, the facet count for : 'dvd' provided in the response is 0 because of the filter query used to only : show the 'cd' facet results. How do I retrieve facet counts for both : categories while only retrieving the results for one category? the "simple facet" params provided by solr have no way to do this ... the facet counts are allways relative the total available docs after applying the "q" and "fq" params .. if it wasn't then you wouldn't be able to get facet counts while "drilling down" into specific facets. you could however easily implement something like this in a custom request handler by using the internal SimpleFacets API and giving it a DocSet you generate jsut fom the q param. -Hoss
Re: Transactions and Solr Was: Re: Delte by multiple id problem
: Does anyone have more experience doing this kind of stuff and whants to share? My advice: don't. I work with (or work with people who work with) about two dozen Solr indexes -- we don't attempt to update a single one of them in any sort of transactional way. Some of them are updated "real time" (ie: as soon as the authoritative DB is updated by some code, the same code updates the Solr index; Some of them are updated in batch (ie: once every N minutes code checks a log of all logical objects modified/deleted from the DB and sends the adds/delets to Solr; And some are only ever rebuilt from scrath every N hours (because the data in them isn't very time sensative and rebuilding from scratch is easier then dealing with incremental or batch updates. But as i said: we never attempt to be transactional about it, for a few reasons: 1) why should it be part of the transaction? a Solr index is a denormalized/inverted index of data .. why should a tool (or any other process) be prevented from writting to an authoritative data store just becuase a non authoritative copy of that data can't be updated? ... if you used MySQL with replication, would you really want to block all writes to the master just because there's a glitch in replicating to a slave? 2) why worry about it? It's relaly a non issue. If an add or delete fails it's usually either developer error (ie: the code generating your add statements thinks there's a field that doesn't exist), a transient timeout (maybe because of a commit in progress) or network glitch (have the client retry once or twice), or in very rare instances the whole Solr index was completely jacked (either from disk failure, or OOM due to a huge spike in load) and we want to revert to a backup of the index in the shortterm and rebuild the index from scratch to play it safe. 3) why limit yourself? you're going to want the ability to trigger arbitrary indexing of your data objects at anytime -- if for no other reason then so when you decide to add a field to your index you can reindex them all -- so why make your index updating code inherently tied to your DB updating code? As for your specific question along the lines of "why can't we do a mix of s and s all as part of one update message?" the answer is "because no one ever wrote any code to parse messages like that." BUT! ... that's not the question you really want to ask. the question you relaly want to ask is: "*IF* someone wrote code to allow a mix of s and s all as part of one update message, would it solve my problem of wanting to be able to modify my solr index transactionally?" and the answer is "No." Even if Solr accepted update messages that looked like this... 42 7bb 666 ...the low level lucene calls that it would be doing internall still aren't transactional, so the first "delete" and "add" might succeed, but if there was then some kind of internal error, or a timeout because the first add took a while (maybe it triggered a segment merge) and the second add didn't happen -- the first two commands would have still been executed, and there would be no way to "rollback". In a nutshell: you would be no better off then if your client code has sent all three as seperate update messages. -Hoss
Re: Restrict values in a multivalued field
: In my schema I have a multivalued field, and the values of that field are : "stored" and "indexed" in the index. I wanted to know if its possible to : restrict the number of multiple values being returned from that field, on a : search? And how? Because, lets say, if I have thousands of values in that : multivalued field, returning all of them would be a lot of load on the : system. So, I want to restrict it to send me only say, 50 values out of the : thousands. How would Solr pick which 50 to return? Why not index all thousand (so you can search on them) in an unstored field, and only store the 50 you want returned in a seperate (unindexed field). the index size will be exactly the same -- admittedly you'll have to send a bit more data over the wire for each doc you index, but that's probably a trivial amount (assuming the 50 values you want to store are representative of the thousands you index you are talking about at most a 5% increases in the amount of data you send solr on each add) -Hoss
Re: Wildcard on last char
: i have encountered a problem concerning the wildcard. When i search for : field:testword i get 50 results. That's ok but when I search for : field:testwor* i get just 3 hits! I get only words returned without a : whitespace after the char like "testwordtest" but i wont find any single : "testword" i've searched before. Why can't I replace a single char at the end : of a word with the wildcard? What does the fieldtime for "field" look like in your schema? what exactly is the URL you are using to query solr? i notice in particular you say "I get only words returned without a : whitespace after the char" ... which suggests that you expect it to match documents that have whitespace in the field value (ie: "testword more words" ... but if you are using an analyzer that splits on whitespace, those are now seperate terms -- wildcard and prefix searches match on indexed terms -- not the entire field value (if you want that, you need to use something like StrField or the KeywordTokenizer .. but i doubt that's really what you want) -Hoss
Re: Fwd: Solr "Text" field
: searches. That is fine by me. But I'm still at the first question: : How do I conduct a wildcard search for ARIZONA on a solr.textField? I tried as i said: it really depends on what kind of "index" analyzer you have configured for the field -- the query analyzer isn't used at all when dealing with ildcard and prefix queries, so what you type in before the "*" must match the prefix of an actually indexed term that makes it into your index as a result of the index analyzer. If you add the debugQuery=true param to your queries, and compare the differnces you see in the parsedquery_toString value between searching for field:AR* and field:Arizona and field:ARIZONA you'll see what i mean. if you take a look at the Luke request handler which will show you the actual raw terms in your index (or the top N anyway), you can see what's really in there -- or --if you use the analysis.jsp interface, it will show you what Terms your analyzer will actaully produce if you index the raw string "ARIZONA" ... whatever you see there is what you need to be searching for when you do your prefix queries. -Hoss
Re: 2D Facet
: : Hello, is this possible to do in one query: I have a query which returns : 1000 documents with names and addresses. I can run facet on state field : and see how many addresses I have in each state. But also I need to see : how many families lives in each state. So as a result I need a matrix of : states on top and Last Names on right. After my first query, knowing : which states I have I can run queries on each state using facet field : Last_Name. But I guess this is not an efficient way. Is this possible to : get in one query? Or may be some other way? if you set rows=0 on all of those queries it won't be horribly inefficient ... the DocSets for each state and lastname should wind up in the filterCache, so most of the queries will just be simple DocSet intersections with only the HTTP overhead (which if you use persistent connections should be fairly minor) The idea of generic multidimensional faceting is acctaully pretty interesting ... it could be done fairly simply -- imagine if for every facet.field=foo param, solr checked for a f.foo.facet.matrix params, and once the top facet.limit terms were found for field "foo" it then computed the top facet founds for each f.foo.facet.matrix field with an implicit fq=foo:term. that would be pretty cool. -Hoss
Re: batch indexing takes more time than shown on SOLR output --> something to do with IO?
: INFO: {add=[10485, 10488, 10489, 10490, 10491, 10495, 10497, 10498, ...(42 : more) : ]} 0 875 : : However, when timing this instruction on the client-side (I use SOlrJ --> : req.process(server)) I get totally different numbers (in the beginning the : client-side measured time is about 2 seconds on average but after some time : this time goes up to about 30-40 seconds, altough the solr-outputted time : stays between 0.8-1.3 seconds? as Otis mentioned, that time is the raw processing of the request, not counting any network IO between the client and the server, or any time spent by the "ResponseWriter" formating the response. you can get more accurate numbers about exctly how long the server spent doing all of these things from the access log of your servlet container (which should be recording the time only after every last byte is written back to the client. that said: there's really no reason for as big a descrepency as you are describing particularly on updates where the ResposneWriter has almost nothing to do (30-40 seconds per update?!?!?!) I'm not very familiar with SolrJ, but are you by any chance using it in a way that sends a commit after every update command? (commits can get successifly longer as your index gets bigger.) : Does this have anything to do with costly IO-activity that is accounted for : in the SOLR output? If this is true, what tool do you recommend using to : monitor IO-activity? Which IO-activity are you talking about? -Hoss
Re: Cache size and Heap size
: > I know this is a lot and I'm going to decrease it, I was just experimenting, : > but I need some guidelines of how to calculate the right size of the cache. : : Each filter that matches more than ~3000 documents will occupy maxDocs/8 bytes : of memory. Certain kinds of faceting require one entry per unique value in a FWIW: the magic number 3000 is the example value of the config option ... it can be tweaked if you think you can tune it to a value that makes more sense given the nature of the types of DocSets you are dealing with, but i wouldn't bother (there are probably a lot of better ways you can spend your time to tweak performance) -Hoss
Re: FunctionQuery in a custom request handler
: How do I access the ValueSource for my DateField? I'd like to use a : ReciprocalFloatFunction from inside the code, adding it aside others in the : main BooleanQuery. The FieldType API provides a getValueSource method (so every FieldType picks it's own best ValueSource implementaion). -Hoss
Some sort of join in SOLR?
Hello, I have two sources of data for the same "things" to search. It is book data in a library. First there is the usual bibliographic data (author, title...) and then I have scanned and OCRed table of contents data about the same books. Both are updated independently. Now I don't know how to best index and search this data. - One option would be to save the data in different records. That would make updates easy because I don't have to worry about the fields from the other source. But searching would be more difficult: I have to do an additional search for every hit in the "contents" data to get the bibliographic data. - The other option would be to save everything in one record but then updates would be difficult. Before I can update a record I must first look if there is any data from the other source, merge it into the record and only then update it. This option sounds very time consuming for a complete reindex. The best solution would be some sort of join: Have two records in the index but always give both in the result no matter where the hit was. Any ideas on how to best organize this kind of data? -Michael