date:20080116

RE: Solr replication

2008-01-16 Thread Dilip.TS

Hi Bill,
I have some questions regarding the SOLR collection distribution.
!) Is it possilbe to add the index operations on the the slave server using
SOLR collection distribution and still the master server is updated with
these changes?
2)I have a requirement of having more than one solr instance (the
corresponding data directory for each solr core). Is it possible to maintain
different solr cores and still achieve SOLR collection distribution for all
of these cores independently. If yes, then how ?

Regards,
Dilip

  -Original Message-
  From: Bill Au [mailto:[EMAIL PROTECTED]
  Sent: Monday, January 14, 2008 9:40 PM
  To: [EMAIL PROTECTED]
  Subject: Re: Solr replication

  Yes, you need the same changes in scripts.conf on the slave server but you
don't need the post commit hook enabled on the slave server.
  The post commit hook is used to create snapshots.  You will see a new
snapshot in the data directory every time you do a commit on the master
server.  There is no need to create snapshots on the slave server as the
slave server copies the snapshots from the master server.

  The scripts are designed to run under Unix/Linux.  It uses symbolic link
and Unix/Linux commands like scp, ssh, rsync, cp.  I don't know much about
Windows so I don't know for sure if all the Unix/Linux stuff used by the
sccripts are available in Windows or not.

  Bill

  On 1/14/08, Dilip.TS <[EMAIL PROTECTED]> wrote:
Hi Bill,
I m trying to use the solr collection distribution.
and done the following changes:

1)Changes done in Master server on linux
#In scripts.conf file

user=
solr_hostname=localhost
solr_port=8983
rsyncd_port=18983
data_dir=/usr/solr/data/data_tenantID_1
webapp_name=solr
master_host=192.168.168.50
master_data_dir=/usr/solr/data/data_tenantID_1
master_status_dir=/usr/solr/logs

2)Enable the postcommit in solrconfig.xml

/usr/solr/bin/snapshooter   /usr/solr/bin
true

i run the Embedded solr folder and added a document to it..
and did a search for a word on the same server.
I found the following observations in the console:

INFO: query parser default operator is OR
Jan 14, 2008 3:37:38 PM org.apache.solr.schema.IndexSchema readSchema
INFO: unique key field: id
Jan 14, 2008 3:37:38 PM org.apache.solr.core.SolrCore 
INFO: Opening new SolrCore at //usr//solr/,
dataDir=//usr//solr//data//data_tenantID_1
Jan 14, 2008 3:37:38 PM org.apache.solr.core.SolrCore parseListener
INFO: Searching for listeners: //[EMAIL PROTECTED]"firstSearcher"]
Jan 14, 2008 3:37:38 PM org.apache.solr.core.SolrCore parseListener
INFO: Searching for listeners: //[EMAIL PROTECTED]"newSearcher"]
Jan 14, 2008 3:37:39 PM org.apache.solr.util.plugin.AbstractPluginLoader
load
INFO: created xslt: org.apache.solr.request.XSLTResponseWriter
Jan 14, 2008 3:37:39 PM org.apache.solr.request.XSLTResponseWriter init
INFO: xsltCacheLifetimeSeconds=5
Jan 14, 2008 3:37:39 PM org.apache.solr.util.plugin.AbstractPluginLoader
load
INFO: created standard: org.apache.solr.handler.StandardRequestHandler
.
.
.
.
INFO: Opening [EMAIL PROTECTED] main
Jan 14, 2008 3:37:39 PM org.apache.solr.core.SolrCore registerSearcher
INFO: Registered new searcher [EMAIL PROTECTED] main
Jan 14, 2008 3:37:39 PM org.apache.solr.update.UpdateHandler
parseEventListeners
INFO: added SolrEventListener for postCommit:

org.apache.solr.core.RunExecutableListener{exe=/usr/solr/bin/snapshooter,dir
=/usr/solr/bin,wait=true,env=[]}
Jan 14, 2008 3:37:39 PM
org.apache.solr.update.DirectUpdateHandler2$CommitTracker 
INFO: AutoCommit: disabled

In the above console i find "postCommit:

org.apache.solr.core.RunExecutableListener{exe=/usr/solr/bin/snapshooter,dir
=/usr/solr/bin,wait=true,env=[]}"
command being called after doing a commit.
This is a scenario for the add/search done on the same master server on
Linux.

1)I would like to know do we require similar entries for the scrips.conf
and
the postcommit enabled in the solrconfig.xml for the slave server too.
  If yes, are these entries for the slave server  should be identical to
that of master or it is different?

2)Also can we have the Linux machine acting as a master server and the
slave
can be made to run on windows machine?

Thanks in advance.
Regards
Dilip

-Original Message-
From: Bill Au [mailto:[EMAIL PROTECTED] ]
Sent: Saturday, December 15, 2007 1:08 AM
To: solr-user@lucene.apache.org; [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: Re: Solr replication

On Dec 14, 2007 7:00 AM, Dilip.TS <[EMAIL PROTECTED] > wrote:

> Hi,
> I have the following requirement for SOLR Collection Distribution
using
> Embedded Solr with the Jetty server:
>

Problem with dismax handler when searching Solr along with field

2008-01-16 Thread farhanali


when i search the query for example 

http://localhost:8983/solr/select/?q=category&qt=dismax

it gives the results but when i want to search on the basis of field name
like

http://localhost:8983/solr/select/?q=maincategory:Cars&qt=dismax

it does not gives results however

http://localhost:8983/solr/select/?q=maincategory:Cars

return results of cars from field name "maincategory"


-- 
View this message in context: 
http://www.nabble.com/Problem-with-dismax-handler-when-searching-Solr-along-with-field-tp14878239p14878239.html
Sent from the Solr - User mailing list archive at Nabble.com.

Indexing two sets of details

2008-01-16 Thread Gavin

Hi,
In the web application we are developing we have two sets of details.
The personal details and the resume details. We allow 5 different
resumes to be available for each user. But we want the personal details
to remain same for each 5 resumes. The problem is when personal details
are changed we will have to update all 5 resumes. 
I was thinking if we index the personal details fields separately we
only have to change/update those fields. But the problem is when
searching for users using fields from both personal details and resume.
Then I have to manually combine both searches and what if one search
gives more results than the other. Would really appreciate it if  anyone
has a suggestion on how I should tackle this problem


Thanks,
-- 
Gavin Selvaratnam,
Project Leader

hSenid Mobile Solutions
Phone: +94-11-2446623/4 
Fax: +94-11-2307579 

Web: http://www.hSenidMobile.com 
 
Make it happen

Disclaimer: This email and any files transmitted with it are confidential and 
intended solely for 
the use of the individual or entity to which they are addressed. The content 
and opinions 
contained in this email are not necessarily those of hSenid Software 
International. 
If you have received this email in error please contact the sender.

Re: Solr in a distributed multi-machine high-performance environment

2008-01-16 Thread Shalin Shekhar Mangar

Look at http://issues.apache.org/jira/browse/SOLR-303

Please note that it is still work in progress. So you may not be able to use
it immeadiately.

On Jan 16, 2008 10:53 AM, Srikant Jakilinki <[EMAIL PROTECTED]> wrote:

> Hi All,
>
> There is a requirement in our group of indexing and searching several
> millions of documents (TREC) in real-time and millisecond responses.
> For the moment we are preferring scale-out (throw more commodity
> machines) approaches rather than scale-up (faster disks, more
> RAM). This is in-turn inspired by the "Scale-out vs. Scale-up" paper
> (mail me if you want a copy) in which it was proven that this kind of
> distribution scales better and is more resilient.
>
> So, are there any resources available (Wiki, Tutorials, Slides, README
> etc.) that throw light and guide newbies on how to run Solr in a
> multi-machine scenario? I have gone through the mailing lists and site
> but could not really find any answers or hands-on stuff to do so. An
> adhoc guideline to get things working with 2 machines might just be
> enough but for the sake of thinking out loud and solicit responses
> from the list, here are my questions:
>
> 1) Solr that has to handle a fairly large index which has to be split
> up on multiple disks (using Multicore?)
> - Space is not a problem since we can use NFS but that is not
> recommended as we would only exploit 1 processor
> 2) Solr that has to handle a large collective index which has to be
> split up on multi-machines
> - The index is ever increasing (TB scale) and dynamic and all of it
> has to be searched at any point
> 3) Solr that has to exploit multi-machines because we have plenty of
> them in a tightly coupled P2P scenario
> - Machines are not a problem but will they be if they are of varied
> configurations (PIII to Core2; Linux to Vista; 32-bit to 64-bit; J2SE
> 1.1 to 1.6)
> 4) Solr that has to distribute load on several machines
> - The index(s) could be common though like say using a distributed
> filesystem (Hadoop?)
>
> In each the above cases (we might use all of these strategies at
> various use cases) the application should use Solr as a strict backend
> and named service (IP or host:port) so that we can expose this
> application (and the service) to the web or intranet. Machine failures
> should be tolerated too. Also, does Solr manage load balancing out of
> the box if it was indeed configured to work with multi-machines?
>
> Maybe it is superfluous but is Solr and/or Nutch the only way to use
> Lucene in a multi-machine environment? Or is there some hidden
> document/project somewhere that makes it possible by exposing a
> regular Lucene process over the network using RMI or something? It is
> my understanding (could be wrong) that Nutch and to some extent, Solr
> do not perform well when there is a lot of indexing activity in
> parallel to search. Batch processing is also there and perhaps we can
> use Nutch/Solr there. Even so, we need multi-machine directions.
>
> I am sure that multi-machines make possible for a lot of other ways
> which might solve the goal better and that others have practical
> experience on. So, any advise and tips are also very welcome. We
> intend to document things and do some benchmarking along the way in
> the open spirit.
>
> Really sorry for the length but I hope some answers are forthcoming.
>
> Cheers,
> Srikant
>



-- 
Regards,
Shalin Shekhar Mangar.

Re: Solr replication

2008-01-16 Thread Bill Au

my answers inilne...

On Jan 16, 2008 3:51 AM, Dilip.TS <[EMAIL PROTECTED]> wrote:

> Hi Bill,
> I have some questions regarding the SOLR collection distribution.
> !) Is it possilbe to add the index operations on the the slave server
> using
> SOLR collection distribution and still the master server is updated with
> these changes?


No.  The replication process is only one way, from the master to the slave.
The idea behind it is that the slave servers would be for query only and the
number of slaves can
be increased or decreased according to traffic load.


> 2)I have a requirement of having more than one solr instance (the
> corresponding data directory for each solr core). Is it possible to
> maintain
> different solr cores and still achieve SOLR collection distribution for
> all
> of these cores independently. If yes, then how ?


Does each solr instance has its own solr home?  If so you can use
replication within each instance by simply adjusting the parameters in
scripts.conf for each instance.  Even if they all share a single solr home,
the replication related scripts all have command line option to override
values set in scripts.conf:

http://wiki.apache.org/solr/SolrCollectionDistributionScripts

So you can invoke the scripts for each instance by setting the data
directory on the command line.


>
> Regards,
> Dilip
>
>
>  -Original Message-
>  From: Bill Au [mailto:[EMAIL PROTECTED]
>  Sent: Monday, January 14, 2008 9:40 PM
>  To: [EMAIL PROTECTED]
>  Subject: Re: Solr replication
>
>
>  Yes, you need the same changes in scripts.conf on the slave server but
> you
> don't need the post commit hook enabled on the slave server.
>  The post commit hook is used to create snapshots.  You will see a new
> snapshot in the data directory every time you do a commit on the master
> server.  There is no need to create snapshots on the slave server as the
> slave server copies the snapshots from the master server.
>
>  The scripts are designed to run under Unix/Linux.  It uses symbolic link
> and Unix/Linux commands like scp, ssh, rsync, cp.  I don't know much about
> Windows so I don't know for sure if all the Unix/Linux stuff used by the
> sccripts are available in Windows or not.
>
>  Bill
>
>
>  On 1/14/08, Dilip.TS <[EMAIL PROTECTED]> wrote:
>Hi Bill,
>I m trying to use the solr collection distribution.
>and done the following changes:
>
>1)Changes done in Master server on linux
>#In scripts.conf file
>
>user=
>solr_hostname=localhost
>solr_port=8983
>rsyncd_port=18983
>data_dir=/usr/solr/data/data_tenantID_1
>webapp_name=solr
>master_host=192.168.168.50
>master_data_dir=/usr/solr/data/data_tenantID_1
>master_status_dir=/usr/solr/logs
>
>2)Enable the postcommit in solrconfig.xml
>
>
>
>/usr/solr/bin/snapshooter   name="dir">/usr/solr/bin
>true
>
> 
>
>
>i run the Embedded solr folder and added a document to it..
>and did a search for a word on the same server.
>I found the following observations in the console:
>
>INFO: query parser default operator is OR
>Jan 14, 2008 3:37:38 PM org.apache.solr.schema.IndexSchema readSchema
>INFO: unique key field: id
>Jan 14, 2008 3:37:38 PM org.apache.solr.core.SolrCore 
>INFO: Opening new SolrCore at //usr//solr/,
>dataDir=//usr//solr//data//data_tenantID_1
>Jan 14, 2008 3:37:38 PM org.apache.solr.core.SolrCore parseListener
>INFO: Searching for listeners: //[EMAIL PROTECTED]"firstSearcher"]
>Jan 14, 2008 3:37:38 PM org.apache.solr.core.SolrCore parseListener
>INFO: Searching for listeners: //[EMAIL PROTECTED]"newSearcher"]
>Jan 14, 2008 3:37:39 PM
> org.apache.solr.util.plugin.AbstractPluginLoader
>load
>INFO: created xslt: org.apache.solr.request.XSLTResponseWriter
>Jan 14, 2008 3:37:39 PM org.apache.solr.request.XSLTResponseWriter init
>INFO: xsltCacheLifetimeSeconds=5
>Jan 14, 2008 3:37:39 PM
> org.apache.solr.util.plugin.AbstractPluginLoader
>load
>INFO: created standard: org.apache.solr.handler.StandardRequestHandler
>.
>.
>.
>.
>INFO: Opening [EMAIL PROTECTED] main
>Jan 14, 2008 3:37:39 PM org.apache.solr.core.SolrCore registerSearcher
>INFO: Registered new searcher [EMAIL PROTECTED] main
>Jan 14, 2008 3:37:39 PM org.apache.solr.update.UpdateHandler
>parseEventListeners
>INFO: added SolrEventListener for postCommit:
>
> org.apache.solr.core.RunExecutableListener{exe=/usr/solr/bin/snapshooter
> ,dir
>=/usr/solr/bin,wait=true,env=[]}
>Jan 14, 2008 3:37:39 PM
>org.apache.solr.update.DirectUpdateHandler2$CommitTracker 
>INFO: AutoCommit: disabled
>
>
>In the above console i find "postCommit:
>
> org.apache.solr.core.RunExecutableListener{exe=/usr/solr/bin/snapshooter
> ,dir
>=/usr/solr/bin,wait=true,env=[]}"
>command being called after doing a commit.
>This is a scenario for the add/search do

Re: Indexing very large files.

2008-01-16 Thread David Thibault

All,
I just found a thread about this on the mailing list archives because I'm
troubleshooting the same problem.  The kicker is that it doesn't take such
large files to kill the StringBuilder.  I have discovered the following:

By using a text file made up of  3,443,464 bytes or less, I get no error.

AT 3,443,465 bytes:


Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

at java.lang.String.(String.java:208)

at java.lang.StringBuilder.toString(StringBuilder.java:431)

at org.junit.Assert.format(Assert.java:321)

at org.junit.ComparisonFailure$ComparisonCompactor.compact(
ComparisonFailure.java:80)

at org.junit.ComparisonFailure.getMessage(ComparisonFailure.java:37)

at java.lang.Throwable.getLocalizedMessage(Throwable.java:267)

at java.lang.Throwable.toString(Throwable.java:344)

at java.lang.String.valueOf(String.java:2615)

at java.io.PrintWriter.print(PrintWriter.java:546)

at java.io.PrintWriter.println(PrintWriter.java:683)

at java.lang.Throwable.printStackTrace(Throwable.java:510)

at org.apache.tools.ant.util.StringUtils.getStackTrace(
StringUtils.java:96)

at
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.getFilteredTrace
(JUnitTestRunner.java:856)

at
org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.formatError
(XMLJUnitResultFormatter.java:280)

at
org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.addError
(XMLJUnitResultFormatter.java:255)

at
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner$4.addError(
JUnitTestRunner.java:988)

at junit.framework.TestResult.addError(TestResult.java:38)

at junit.framework.JUnit4TestAdapterCache$1.testFailure(
JUnit4TestAdapterCache.java:51)

at org.junit.runner.notification.RunNotifier$4.notifyListener(
RunNotifier.java:96)

at org.junit.runner.notification.RunNotifier$SafeNotifier.run(
RunNotifier.java:37)

at org.junit.runner.notification.RunNotifier.fireTestFailure(
RunNotifier.java:93)

at org.junit.internal.runners.TestMethodRunner.addFailure(
TestMethodRunner.java:104)

at org.junit.internal.runners.TestMethodRunner.runUnprotected(
TestMethodRunner.java:87)

at org.junit.internal.runners.BeforeAndAfterRunner.runProtected(
BeforeAndAfterRunner.java:34)

at org.junit.internal.runners.TestMethodRunner.runMethod(
TestMethodRunner.java:75)

at org.junit.internal.runners.TestMethodRunner.run(
TestMethodRunner.java:45)

at
org.junit.internal.runners.TestClassMethodsRunner.invokeTestMethod(
TestClassMethodsRunner.java:71)

at org.junit.internal.runners.TestClassMethodsRunner.run(
TestClassMethodsRunner.java:35)

at org.junit.internal.runners.TestClassRunner$1.runUnprotected(
TestClassRunner.java:42)

at org.junit.internal.runners.BeforeAndAfterRunner.runProtected(
BeforeAndAfterRunner.java:34)

at org.junit.internal.runners.TestClassRunner.run(
TestClassRunner.java:52)

at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:32)



AT 3,443,466 byes (or more) :


Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

at java.lang.AbstractStringBuilder.expandCapacity(
AbstractStringBuilder.java:99)

at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java
:393)

at java.lang.StringBuilder.append(StringBuilder.java:120)

at org.junit.Assert.format(Assert.java:321)

at org.junit.ComparisonFailure$ComparisonCompactor.compact(
ComparisonFailure.java:80)

at org.junit.ComparisonFailure.getMessage(ComparisonFailure.java:37)

at java.lang.Throwable.getLocalizedMessage(Throwable.java:267)

at java.lang.Throwable.toString(Throwable.java:344)

at java.lang.String.valueOf(String.java:2615)

at java.io.PrintWriter.print(PrintWriter.java:546)

at java.io.PrintWriter.println(PrintWriter.java:683)

at java.lang.Throwable.printStackTrace(Throwable.java:510)

at org.apache.tools.ant.util.StringUtils.getStackTrace(
StringUtils.java:96)

at
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.getFilteredTrace
(JUnitTestRunner.java:856)

at
org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.formatError
(XMLJUnitResultFormatter.java:280)

at
org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.addError
(XMLJUnitResultFormatter.java:255)

at
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner$4.addError(
JUnitTestRunner.java:988)

at junit.framework.TestResult.addError(TestResult.java:38)

at junit.framework.JUnit4TestAdapterCache$1.testFailure(
JUnit4TestAdapterCache.java:51)

at org.junit.runner.notification.RunNotifier$4.notifyListener(
RunNotifier.java:96)

at org.junit.runner.notificatio

Cache size and Heap size

2008-01-16 Thread Evgeniy Strokin

Hello,.. 
I have relatively large RAM (10Gb) on my server which is running Solr. I 
increased Cache settings and start to see OutOfMemory exceptions, specially on 
facet search.
Is anybody has some suggestions how Cache settings related to Memory 
consumptions? What are optimal settings? How they could be calculated?
 
Thank you for any advise,
Gene

Re: Cache size and Heap size

2008-01-16 Thread Daniel Alheiros

Hi Gene.

Have you set your app server / servlet container to use allocate some of
this memory to be used?

You can define the maximum and minimum heap size adding/replacing some
parameters on the app server initialization:

-Xmx1536m -Xms1536m

Which app server / servlet container are you using?

Regards,
Daniel Alheiros

On 16/1/08 15:23, "Evgeniy Strokin" <[EMAIL PROTECTED]> wrote:

> Hello,.. 
> I have relatively large RAM (10Gb) on my server which is running Solr. I
> increased Cache settings and start to see OutOfMemory exceptions, specially on
> facet search.
> Is anybody has some suggestions how Cache settings related to Memory
> consumptions? What are optimal settings? How they could be calculated?
>  
> Thank you for any advise,
> Gene

http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal 
views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on 
it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.

conceptual issues with solr

2008-01-16 Thread Philippe Guillard

Hi here,

It seems that Lucene accepts any kind of XML document but Solr accepts only
flat name/value pairs inside a document to be indexed.
You'll find below what I'd like to do, Thanks for help of any kind !

Phil


I need to index products (hotels) which have a price by date, then search
them by date or date range and price range.
Is there a way to do that with Solr?

At the moment i have a document for each hotel :


http:///yyy
1
Hotel Opera
4 stars
.



I would need to add my dates/price values like this but it is forbidden in
Solr indexing:



Otherwise i could define a default field (being an integer) and have as many
fields as dates, like this:
200
150
indexing would accept it but i think i will not be able to search or sort by
date

The only solution i found at the moment is to create a document for each
date/price


http:///yyy
1
Hotel Opera
30/01/2008
200


 http:///yyy
1
Hotel Opera
31/01/2008
150


then i'll have many documents for 1 hotel
and in order to search by date range i would need more documents
like this :
28/01/2008 to 31/01/2008
29/01/2008 to 31/01/2008
30/01/2008 to 31/01/2008

Since i need to index many other informations about an hotel (address,
telephone, amenities etc...) i wouldn' like to duplicate too much
information, and i think it would not be scalable to search first in a dates
index then in hotels index to retrieve hotel information.

Any idea?

Re: Indexing very large files.

2008-01-16 Thread Erick Erickson

I don't think this is a StringBuilder limitation, but rather your Java
JVM doesn't start with enough memory. i.e. -Xmx.

In raw Lucene, I've indexed 240M files

Best
Erick

On Jan 16, 2008 10:12 AM, David Thibault <[EMAIL PROTECTED]>
wrote:

> All,
> I just found a thread about this on the mailing list archives because I'm
> troubleshooting the same problem.  The kicker is that it doesn't take such
> large files to kill the StringBuilder.  I have discovered the following:
>
> By using a text file made up of  3,443,464 bytes or less, I get no error.
>
> AT 3,443,465 bytes:
>
>
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>
>at java.lang.String.(String.java:208)
>
>at java.lang.StringBuilder.toString(StringBuilder.java:431)
>
>at org.junit.Assert.format(Assert.java:321)
>
>at org.junit.ComparisonFailure$ComparisonCompactor.compact(
> ComparisonFailure.java:80)
>
>at org.junit.ComparisonFailure.getMessage(ComparisonFailure.java
> :37)
>
>at java.lang.Throwable.getLocalizedMessage(Throwable.java:267)
>
>at java.lang.Throwable.toString(Throwable.java:344)
>
>at java.lang.String.valueOf(String.java:2615)
>
>at java.io.PrintWriter.print(PrintWriter.java:546)
>
>at java.io.PrintWriter.println(PrintWriter.java:683)
>
>at java.lang.Throwable.printStackTrace(Throwable.java:510)
>
>at org.apache.tools.ant.util.StringUtils.getStackTrace(
> StringUtils.java:96)
>
>at
>
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.getFilteredTrace
> (JUnitTestRunner.java:856)
>
>at
>
> org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.formatError
> (XMLJUnitResultFormatter.java:280)
>
>at
>
> org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.addError
> (XMLJUnitResultFormatter.java:255)
>
>at
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner$4.addError(
> JUnitTestRunner.java:988)
>
>at junit.framework.TestResult.addError(TestResult.java:38)
>
>at junit.framework.JUnit4TestAdapterCache$1.testFailure(
> JUnit4TestAdapterCache.java:51)
>
>at org.junit.runner.notification.RunNotifier$4.notifyListener(
> RunNotifier.java:96)
>
>at org.junit.runner.notification.RunNotifier$SafeNotifier.run(
> RunNotifier.java:37)
>
>at org.junit.runner.notification.RunNotifier.fireTestFailure(
> RunNotifier.java:93)
>
>at org.junit.internal.runners.TestMethodRunner.addFailure(
> TestMethodRunner.java:104)
>
>at org.junit.internal.runners.TestMethodRunner.runUnprotected(
> TestMethodRunner.java:87)
>
>at org.junit.internal.runners.BeforeAndAfterRunner.runProtected(
> BeforeAndAfterRunner.java:34)
>
>at org.junit.internal.runners.TestMethodRunner.runMethod(
> TestMethodRunner.java:75)
>
>at org.junit.internal.runners.TestMethodRunner.run(
> TestMethodRunner.java:45)
>
>at
> org.junit.internal.runners.TestClassMethodsRunner.invokeTestMethod(
> TestClassMethodsRunner.java:71)
>
>at org.junit.internal.runners.TestClassMethodsRunner.run(
> TestClassMethodsRunner.java:35)
>
>at org.junit.internal.runners.TestClassRunner$1.runUnprotected(
> TestClassRunner.java:42)
>
>at org.junit.internal.runners.BeforeAndAfterRunner.runProtected(
> BeforeAndAfterRunner.java:34)
>
>at org.junit.internal.runners.TestClassRunner.run(
> TestClassRunner.java:52)
>
>at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:32)
>
>
>
> AT 3,443,466 byes (or more) :
>
>
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>
>at java.lang.AbstractStringBuilder.expandCapacity(
> AbstractStringBuilder.java:99)
>
>at java.lang.AbstractStringBuilder.append(
> AbstractStringBuilder.java
> :393)
>
>at java.lang.StringBuilder.append(StringBuilder.java:120)
>
>at org.junit.Assert.format(Assert.java:321)
>
>at org.junit.ComparisonFailure$ComparisonCompactor.compact(
> ComparisonFailure.java:80)
>
>at org.junit.ComparisonFailure.getMessage(ComparisonFailure.java
> :37)
>
>at java.lang.Throwable.getLocalizedMessage(Throwable.java:267)
>
>at java.lang.Throwable.toString(Throwable.java:344)
>
>at java.lang.String.valueOf(String.java:2615)
>
>at java.io.PrintWriter.print(PrintWriter.java:546)
>
>at java.io.PrintWriter.println(PrintWriter.java:683)
>
>at java.lang.Throwable.printStackTrace(Throwable.java:510)
>
>at org.apache.tools.ant.util.StringUtils.getStackTrace(
> StringUtils.java:96)
>
>at
>
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.getFilteredTrace
> (JUnitTestRunner.java:856)
>
>at
>
> org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.formatError
> (XMLJUnitResultFormatter.java:280)
>
>at
>
> org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFo

Re: Indexing very large files.

2008-01-16 Thread Erick Erickson

P.S. Lucene by default limits the maximum field length
to 10K tokens, so you have to bump that for large files.

Erick

On Jan 16, 2008 11:04 AM, Erick Erickson <[EMAIL PROTECTED]> wrote:

> I don't think this is a StringBuilder limitation, but rather your Java
> JVM doesn't start with enough memory. i.e. -Xmx.
>
> In raw Lucene, I've indexed 240M files
>
> Best
> Erick
>
>
> On Jan 16, 2008 10:12 AM, David Thibault <[EMAIL PROTECTED]>
> wrote:
>
> > All,
> > I just found a thread about this on the mailing list archives because
> > I'm
> > troubleshooting the same problem.  The kicker is that it doesn't take
> > such
> > large files to kill the StringBuilder.  I have discovered the following:
> >
> >
> > By using a text file made up of  3,443,464 bytes or less, I get no
> > error.
> >
> > AT 3,443,465 bytes:
> >
> >
> > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> >
> >at java.lang.String .(String.java:208)
> >
> >at java.lang.StringBuilder.toString(StringBuilder.java:431)
> >
> >at org.junit.Assert.format(Assert.java:321)
> >
> >at org.junit.ComparisonFailure$ComparisonCompactor.compact (
> > ComparisonFailure.java:80)
> >
> >at org.junit.ComparisonFailure.getMessage(ComparisonFailure.java
> > :37)
> >
> >at java.lang.Throwable.getLocalizedMessage(Throwable.java:267)
> >
> >at java.lang.Throwable.toString (Throwable.java:344)
> >
> >at java.lang.String.valueOf(String.java:2615)
> >
> >at java.io.PrintWriter.print(PrintWriter.java:546)
> >
> >at java.io.PrintWriter.println(PrintWriter.java:683)
> >
> >at java.lang.Throwable.printStackTrace(Throwable.java:510)
> >
> >at org.apache.tools.ant.util.StringUtils.getStackTrace(
> > StringUtils.java:96)
> >
> >at
> >
> > org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.getFilteredTrace
> > (JUnitTestRunner.java:856)
> >
> >at
> >
> > org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.formatError
> > (XMLJUnitResultFormatter.java:280)
> >
> >at
> >
> > org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.addError
> > (XMLJUnitResultFormatter.java:255)
> >
> >at
> > org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner$4.addError(
> > JUnitTestRunner.java:988)
> >
> >at junit.framework.TestResult.addError(TestResult.java :38)
> >
> >at junit.framework.JUnit4TestAdapterCache$1.testFailure(
> > JUnit4TestAdapterCache.java:51)
> >
> >at org.junit.runner.notification.RunNotifier$4.notifyListener(
> > RunNotifier.java:96)
> >
> >at org.junit.runner.notification.RunNotifier$SafeNotifier.run(
> > RunNotifier.java:37)
> >
> >at org.junit.runner.notification.RunNotifier.fireTestFailure(
> > RunNotifier.java:93)
> >
> >at org.junit.internal.runners.TestMethodRunner.addFailure (
> > TestMethodRunner.java:104)
> >
> >at org.junit.internal.runners.TestMethodRunner.runUnprotected(
> > TestMethodRunner.java:87)
> >
> >at org.junit.internal.runners.BeforeAndAfterRunner.runProtected(
> > BeforeAndAfterRunner.java:34)
> >
> >at org.junit.internal.runners.TestMethodRunner.runMethod(
> > TestMethodRunner.java:75)
> >
> >at org.junit.internal.runners.TestMethodRunner.run(
> > TestMethodRunner.java :45)
> >
> >at
> > org.junit.internal.runners.TestClassMethodsRunner.invokeTestMethod(
> > TestClassMethodsRunner.java:71)
> >
> >at org.junit.internal.runners.TestClassMethodsRunner.run(
> > TestClassMethodsRunner.java :35)
> >
> >at org.junit.internal.runners.TestClassRunner$1.runUnprotected(
> > TestClassRunner.java:42)
> >
> >at org.junit.internal.runners.BeforeAndAfterRunner.runProtected(
> > BeforeAndAfterRunner.java:34)
> >
> >at org.junit.internal.runners.TestClassRunner.run(
> > TestClassRunner.java:52)
> >
> >at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java
> > :32)
> >
> >
> >
> > AT 3,443,466 byes (or more) :
> >
> >
> > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> >
> >at java.lang.AbstractStringBuilder.expandCapacity(
> > AbstractStringBuilder.java:99)
> >
> >at java.lang.AbstractStringBuilder.append (
> > AbstractStringBuilder.java
> > :393)
> >
> >at java.lang.StringBuilder.append(StringBuilder.java:120)
> >
> >at org.junit.Assert.format(Assert.java:321)
> >
> >at org.junit.ComparisonFailure$ComparisonCompactor.compact (
> > ComparisonFailure.java:80)
> >
> >at org.junit.ComparisonFailure.getMessage(ComparisonFailure.java
> > :37)
> >
> >at java.lang.Throwable.getLocalizedMessage(Throwable.java:267)
> >
> >at java.lang.Throwable.toString (Throwable.java:344)
> >
> >at java.lang.String.valueOf(String.java:2615)
> >
> >at java.io.PrintWriter.print(PrintWriter.java:546)
> >
> >at java.io.PrintWriter.println(PrintW

Re: Indexing very large files.

2008-01-16 Thread David Thibault

I think your PS might do the trick.  My JVM doesn't seem to be the issue,
because I've set it to -Xmx512m -Xms256m.  I will track down the solr config
parameter you mentioned and try that.  Thanks for the quick response!
Dave

On 1/16/08, Erick Erickson <[EMAIL PROTECTED]> wrote:
>
> P.S. Lucene by default limits the maximum field length
> to 10K tokens, so you have to bump that for large files.
>
> Erick
>
> On Jan 16, 2008 11:04 AM, Erick Erickson <[EMAIL PROTECTED]> wrote:
>
> > I don't think this is a StringBuilder limitation, but rather your Java
> > JVM doesn't start with enough memory. i.e. -Xmx.
> >
> > In raw Lucene, I've indexed 240M files
> >
> > Best
> > Erick
> >
> >
> > On Jan 16, 2008 10:12 AM, David Thibault <[EMAIL PROTECTED]>
> > wrote:
> >
> > > All,
> > > I just found a thread about this on the mailing list archives because
> > > I'm
> > > troubleshooting the same problem.  The kicker is that it doesn't take
> > > such
> > > large files to kill the StringBuilder.  I have discovered the
> following:
> > >
> > >
> > > By using a text file made up of  3,443,464 bytes or less, I get no
> > > error.
> > >
> > > AT 3,443,465 bytes:
> > >
> > >
> > > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> > >
> > >at java.lang.String .(String.java:208)
> > >
> > >at java.lang.StringBuilder.toString(StringBuilder.java:431)
> > >
> > >at org.junit.Assert.format(Assert.java:321)
> > >
> > >at org.junit.ComparisonFailure$ComparisonCompactor.compact (
> > > ComparisonFailure.java:80)
> > >
> > >at org.junit.ComparisonFailure.getMessage(
> ComparisonFailure.java
> > > :37)
> > >
> > >at java.lang.Throwable.getLocalizedMessage(Throwable.java:267)
> > >
> > >at java.lang.Throwable.toString (Throwable.java:344)
> > >
> > >at java.lang.String.valueOf(String.java:2615)
> > >
> > >at java.io.PrintWriter.print(PrintWriter.java:546)
> > >
> > >at java.io.PrintWriter.println(PrintWriter.java:683)
> > >
> > >at java.lang.Throwable.printStackTrace(Throwable.java:510)
> > >
> > >at org.apache.tools.ant.util.StringUtils.getStackTrace(
> > > StringUtils.java:96)
> > >
> > >at
> > >
> > >
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.getFilteredTrace
> > > (JUnitTestRunner.java:856)
> > >
> > >at
> > >
> > >
> org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.formatError
> > > (XMLJUnitResultFormatter.java:280)
> > >
> > >at
> > >
> > >
> org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.addError
> > > (XMLJUnitResultFormatter.java:255)
> > >
> > >at
> > >
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner$4.addError(
> > > JUnitTestRunner.java:988)
> > >
> > >at junit.framework.TestResult.addError(TestResult.java :38)
> > >
> > >at junit.framework.JUnit4TestAdapterCache$1.testFailure(
> > > JUnit4TestAdapterCache.java:51)
> > >
> > >at org.junit.runner.notification.RunNotifier$4.notifyListener(
> > > RunNotifier.java:96)
> > >
> > >at org.junit.runner.notification.RunNotifier$SafeNotifier.run(
> > > RunNotifier.java:37)
> > >
> > >at org.junit.runner.notification.RunNotifier.fireTestFailure(
> > > RunNotifier.java:93)
> > >
> > >at org.junit.internal.runners.TestMethodRunner.addFailure (
> > > TestMethodRunner.java:104)
> > >
> > >at org.junit.internal.runners.TestMethodRunner.runUnprotected(
> > > TestMethodRunner.java:87)
> > >
> > >at org.junit.internal.runners.BeforeAndAfterRunner.runProtected
> (
> > > BeforeAndAfterRunner.java:34)
> > >
> > >at org.junit.internal.runners.TestMethodRunner.runMethod(
> > > TestMethodRunner.java:75)
> > >
> > >at org.junit.internal.runners.TestMethodRunner.run(
> > > TestMethodRunner.java :45)
> > >
> > >at
> > > org.junit.internal.runners.TestClassMethodsRunner.invokeTestMethod(
> > > TestClassMethodsRunner.java:71)
> > >
> > >at org.junit.internal.runners.TestClassMethodsRunner.run(
> > > TestClassMethodsRunner.java :35)
> > >
> > >at org.junit.internal.runners.TestClassRunner$1.runUnprotected(
> > > TestClassRunner.java:42)
> > >
> > >at org.junit.internal.runners.BeforeAndAfterRunner.runProtected
> (
> > > BeforeAndAfterRunner.java:34)
> > >
> > >at org.junit.internal.runners.TestClassRunner.run(
> > > TestClassRunner.java:52)
> > >
> > >at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java
> > > :32)
> > >
> > >
> > >
> > > AT 3,443,466 byes (or more) :
> > >
> > >
> > > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> > >
> > >at java.lang.AbstractStringBuilder.expandCapacity(
> > > AbstractStringBuilder.java:99)
> > >
> > >at java.lang.AbstractStringBuilder.append (
> > > AbstractStringBuilder.java
> > > :393)
> > >
> > >at java.lang.StringBuilder.append(StringBuilder.jav

Re: Indexing very large files.

2008-01-16 Thread David Thibault

I tried raising the 1 under
 as well as  and still no luck. I'm trying to
upload a text file that is about 8 MB in size.  I think the following stack
trace still points to some sort of overflowed String issue.  Thoughts?
Solr returned an error: Java heap space  java.lang.OutOfMemoryError: Java
heap space
at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:232)
at java.lang.StringCoding.encode(StringCoding.java:272)
at java.lang.String.getBytes(String.java:947)
at org.apache.lucene.index.FieldsWriter.addDocument(FieldsWriter.java:98)
at org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:107)

at org.apache.lucene.index.IndexWriter.buildSingleDocSegment(
IndexWriter.java:977)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:965)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:947)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(
DirectUpdateHandler2.java:270)
at org.apache.solr.handler.XmlUpdateRequestHandler.update(
XmlUpdateRequestHandler.java:166)
at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(
XmlUpdateRequestHandler.java:84)
 at org.apache.solr.handler.RequestHandlerBase.handleRequest(
RequestHandlerBase.java:77)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:658)  at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:191)

at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
SolrDispatchFilter.java:159)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(
ApplicationFilterChain.java:215)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(
ApplicationFilterChain.java:188)
at org.apache.catalina.core.StandardWrapperValve.invoke(
StandardWrapperValve.java:213)
at org.apache.catalina.core.StandardContextValve.invoke(
StandardContextValve.java:174)
 at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java
:127)
 at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java
:117)
 at org.apache.catalina.core.StandardEngineValve.invoke(
StandardEngineValve.java:108)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:151)

at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:874)

at
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection
(Http11BaseProtocol.java:665)
at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(
PoolTcpEndpoint.java:528)
at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(
LeaderFollowerWorkerThread.java:81)
 at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(
ThreadPool.java:689)
at java.lang.Thread.run(Thread.java:619)

java.io.IOException: Server returned HTTP response code: 500 for URL:
http://solr:8080/solr/update
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(
HttpURLConnection.java:1170)
at com.itstrategypartners.sents.solrUpload.SimplePostTool.postData(
SimplePostTool.java:134)
at com.itstrategypartners.sents.solrUpload.SimplePostTool.postFile(
SimplePostTool.java:87)
at com.itstrategypartners.sents.solrUpload.Uploader.uploadFile(
Uploader.java:97)
at com.itstrategypartners.sents.solrUpload.UploaderTest.uploadFile(
UploaderTest.java:95)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.junit.internal.runners.TestMethodRunner.executeMethodBody(
TestMethodRunner.java:99)
at org.junit.internal.runners.TestMethodRunner.runUnprotected(
TestMethodRunner.java:81)
at org.junit.internal.runners.BeforeAndAfterRunner.runProtected(
BeforeAndAfterRunner.java:34)
at org.junit.internal.runners.TestMethodRunner.runMethod(
TestMethodRunner.java:75)
at org.junit.internal.runners.TestMethodRunner.run(
TestMethodRunner.java:45)
at
org.junit.internal.runners.TestClassMethodsRunner.invokeTestMethod(
TestClassMethodsRunner.java:71)
at org.junit.internal.runners.TestClassMethodsRunner.run(
TestClassMethodsRunner.java:35)
at org.junit.internal.runners.TestClassRunner$1.runUnprotected(
TestClassRunner.java:42)
at org.junit.internal.runners.BeforeAndAfterRunner.runProtected(
BeforeAndAfterRunner.java:34)
at org.junit.internal.runners.TestClassRunner.run(
TestClassRunner.java:52)
at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:32)
at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(
JUnitTestRunner.java:421)
at
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(
JUnitTestRunner.java:912)
at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main
(JUnitTestRunner.java:766)

On 1/16/08, David Thibault <[EMAIL PROT

Re: Indexing very large files.

2008-01-16 Thread Erick Erickson

The PS really wasn't related to your OOM, and raising that shouldn't
have changed the behavior. All that happens if you go beyond 10,000
tokens is that the rest gets thrown away.

But we're beyond my real knowledge level about SOLR, so I'll defer
to others. A very quick-n-dirty test as to whether you're actually
allocation more memory to the process you *think* you are would be
to bump it ridiculously higher. I'm completely unclear about what
process gets the increased memory relative to the server.

[EMAIL PROTECTED]


On Jan 16, 2008 11:33 AM, David Thibault <[EMAIL PROTECTED]>
wrote:

> I tried raising the 1 under
>  as well as  and still no luck. I'm trying to
> upload a text file that is about 8 MB in size.  I think the following
> stack
> trace still points to some sort of overflowed String issue.  Thoughts?
> Solr returned an error: Java heap space  java.lang.OutOfMemoryError: Java
> heap space
> at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:232)
> at java.lang.StringCoding.encode(StringCoding.java:272)
> at java.lang.String.getBytes(String.java:947)
> at org.apache.lucene.index.FieldsWriter.addDocument(FieldsWriter.java:98)
> at org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java
> :107)
>
> at org.apache.lucene.index.IndexWriter.buildSingleDocSegment(
> IndexWriter.java:977)
> at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:965)
> at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:947)
> at org.apache.solr.update.DirectUpdateHandler2.addDoc(
> DirectUpdateHandler2.java:270)
> at org.apache.solr.handler.XmlUpdateRequestHandler.update(
> XmlUpdateRequestHandler.java:166)
> at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(
> XmlUpdateRequestHandler.java:84)
>  at org.apache.solr.handler.RequestHandlerBase.handleRequest(
> RequestHandlerBase.java:77)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:658)  at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java
> :191)
>
> at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> SolrDispatchFilter.java:159)
> at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(
> ApplicationFilterChain.java:215)
> at org.apache.catalina.core.ApplicationFilterChain.doFilter(
> ApplicationFilterChain.java:188)
> at org.apache.catalina.core.StandardWrapperValve.invoke(
> StandardWrapperValve.java:213)
> at org.apache.catalina.core.StandardContextValve.invoke(
> StandardContextValve.java:174)
>  at org.apache.catalina.core.StandardHostValve.invoke(
> StandardHostValve.java
> :127)
>  at org.apache.catalina.valves.ErrorReportValve.invoke(
> ErrorReportValve.java
> :117)
>  at org.apache.catalina.core.StandardEngineValve.invoke(
> StandardEngineValve.java:108)
> at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java
> :151)
>
> at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java
> :874)
>
> at
>
> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection
> (Http11BaseProtocol.java:665)
> at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(
> PoolTcpEndpoint.java:528)
> at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(
> LeaderFollowerWorkerThread.java:81)
>  at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(
> ThreadPool.java:689)
> at java.lang.Thread.run(Thread.java:619)
>
> java.io.IOException: Server returned HTTP response code: 500 for URL:
> http://solr:8080/solr/update
>at sun.net.www.protocol.http.HttpURLConnection.getInputStream(
> HttpURLConnection.java:1170)
>at com.itstrategypartners.sents.solrUpload.SimplePostTool.postData(
> SimplePostTool.java:134)
>at com.itstrategypartners.sents.solrUpload.SimplePostTool.postFile(
> SimplePostTool.java:87)
>at com.itstrategypartners.sents.solrUpload.Uploader.uploadFile(
> Uploader.java:97)
>at com.itstrategypartners.sents.solrUpload.UploaderTest.uploadFile(
> UploaderTest.java:95)
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:39)
>at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:25)
>at java.lang.reflect.Method.invoke(Method.java:585)
>at org.junit.internal.runners.TestMethodRunner.executeMethodBody(
> TestMethodRunner.java:99)
>at org.junit.internal.runners.TestMethodRunner.runUnprotected(
> TestMethodRunner.java:81)
>at org.junit.internal.runners.BeforeAndAfterRunner.runProtected(
> BeforeAndAfterRunner.java:34)
>at org.junit.internal.runners.TestMethodRunner.runMethod(
> TestMethodRunner.java:75)
>at org.junit.internal.runners.TestMethodRunner.run(
> TestMethodRunner.java:45)
>at
> org.junit.internal.runners.TestClassMethodsRunner.invokeTestMethod(
> TestClassMethodsRunner.java:71)
>at org.junit.internal.runne

Re: Indexing very large files.

2008-01-16 Thread Walter Underwood

This error means that the JVM has run out of heap space. Increase the
heap space. That is an option on the "java" command. I set my heap to
200 Meg and do it this way with Tomcat 6:

JAVA_OPTS="-Xmx600M" tomcat/bin/startup.sh

wunder

On 1/16/08 8:33 AM, "David Thibault" <[EMAIL PROTECTED]> wrote:

> java.lang.OutOfMemoryError: Java heap space

Re: Indexing very large files.

2008-01-16 Thread David Thibault

Nice signature...=)

On 1/16/08, Erick Erickson <[EMAIL PROTECTED]> wrote:
>
> The PS really wasn't related to your OOM, and raising that shouldn't
> have changed the behavior. All that happens if you go beyond 10,000
> tokens is that the rest gets thrown away.
>
> But we're beyond my real knowledge level about SOLR, so I'll defer
> to others. A very quick-n-dirty test as to whether you're actually
> allocation more memory to the process you *think* you are would be
> to bump it ridiculously higher. I'm completely unclear about what
> process gets the increased memory relative to the server.
>
> [EMAIL PROTECTED]
>
>
> On Jan 16, 2008 11:33 AM, David Thibault <[EMAIL PROTECTED]>
> wrote:
>
> > I tried raising the 1 under
> >  as well as  and still no luck. I'm trying to
> > upload a text file that is about 8 MB in size.  I think the following
> > stack
> > trace still points to some sort of overflowed String issue.  Thoughts?
> > Solr returned an error: Java heap space  java.lang.OutOfMemoryError:
> Java
> > heap space
> > at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:232)
> > at java.lang.StringCoding.encode(StringCoding.java:272)
> > at java.lang.String.getBytes(String.java:947)
> > at org.apache.lucene.index.FieldsWriter.addDocument(FieldsWriter.java
> :98)
> > at org.apache.lucene.index.DocumentWriter.addDocument(
> DocumentWriter.java
> > :107)
> >
> > at org.apache.lucene.index.IndexWriter.buildSingleDocSegment(
> > IndexWriter.java:977)
> > at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:965)
> > at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:947)
> > at org.apache.solr.update.DirectUpdateHandler2.addDoc(
> > DirectUpdateHandler2.java:270)
> > at org.apache.solr.handler.XmlUpdateRequestHandler.update(
> > XmlUpdateRequestHandler.java:166)
> > at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(
> > XmlUpdateRequestHandler.java:84)
> >  at org.apache.solr.handler.RequestHandlerBase.handleRequest(
> > RequestHandlerBase.java:77)
> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:658)  at
> > org.apache.solr.servlet.SolrDispatchFilter.execute(
> SolrDispatchFilter.java
> > :191)
> >
> > at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> > SolrDispatchFilter.java:159)
> > at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(
> > ApplicationFilterChain.java:215)
> > at org.apache.catalina.core.ApplicationFilterChain.doFilter(
> > ApplicationFilterChain.java:188)
> > at org.apache.catalina.core.StandardWrapperValve.invoke(
> > StandardWrapperValve.java:213)
> > at org.apache.catalina.core.StandardContextValve.invoke(
> > StandardContextValve.java:174)
> >  at org.apache.catalina.core.StandardHostValve.invoke(
> > StandardHostValve.java
> > :127)
> >  at org.apache.catalina.valves.ErrorReportValve.invoke(
> > ErrorReportValve.java
> > :117)
> >  at org.apache.catalina.core.StandardEngineValve.invoke(
> > StandardEngineValve.java:108)
> > at org.apache.catalina.connector.CoyoteAdapter.service(
> CoyoteAdapter.java
> > :151)
> >
> > at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java
> > :874)
> >
> > at
> >
> >
> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection
> > (Http11BaseProtocol.java:665)
> > at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(
> > PoolTcpEndpoint.java:528)
> > at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(
> > LeaderFollowerWorkerThread.java:81)
> >  at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(
> > ThreadPool.java:689)
> > at java.lang.Thread.run(Thread.java:619)
> >
> > java.io.IOException: Server returned HTTP response code: 500 for URL:
> > http://solr:8080/solr/update
> >at sun.net.www.protocol.http.HttpURLConnection.getInputStream(
> > HttpURLConnection.java:1170)
> >at
> com.itstrategypartners.sents.solrUpload.SimplePostTool.postData(
> > SimplePostTool.java:134)
> >at
> com.itstrategypartners.sents.solrUpload.SimplePostTool.postFile(
> > SimplePostTool.java:87)
> >at com.itstrategypartners.sents.solrUpload.Uploader.uploadFile(
> > Uploader.java:97)
> >at
> com.itstrategypartners.sents.solrUpload.UploaderTest.uploadFile(
> > UploaderTest.java:95)
> >at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >at sun.reflect.NativeMethodAccessorImpl.invoke(
> > NativeMethodAccessorImpl.java:39)
> >at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> > DelegatingMethodAccessorImpl.java:25)
> >at java.lang.reflect.Method.invoke(Method.java:585)
> >at org.junit.internal.runners.TestMethodRunner.executeMethodBody(
> > TestMethodRunner.java:99)
> >at org.junit.internal.runners.TestMethodRunner.runUnprotected(
> > TestMethodRunner.java:81)
> >at org.junit.internal.runners.BeforeAndAfterRunner.runProtected(
> > BeforeAndAfterRunner.java:34)
> >at org.junit.intern

Re: Indexing very large files.

2008-01-16 Thread David Thibault

Walter and all,
I had been bumping up the heap for my Java app (running outside of Tomcat)
but I hadn't yet tried bumping up my Tomcat heap.  That seems to have helped
me upload the 8MB file, but it's crashing while uploading a 32MB file now. I
Just bumped tomcat to 1024MB of heap, so I'm not sure what the problem is
now.  I suspect Walter was on to something, since it sort of fixed my
problem.  I will keep troubleshooting the Tomcat memory and go from there..

Best,
Dave

On 1/16/08, Walter Underwood <[EMAIL PROTECTED]> wrote:
>
> This error means that the JVM has run out of heap space. Increase the
> heap space. That is an option on the "java" command. I set my heap to
> 200 Meg and do it this way with Tomcat 6:
>
> JAVA_OPTS="-Xmx600M" tomcat/bin/startup.sh
>
> wunder
>
> On 1/16/08 8:33 AM, "David Thibault" <[EMAIL PROTECTED]> wrote:
>
> > java.lang.OutOfMemoryError: Java heap space
>
>

Re: Solr in a distributed multi-machine high-performance environment

2008-01-16 Thread Srikant Jakilinki

Thanks for that Shalin. Looks like I have to wait and keep track of 
developments.


Forgetting about indexes that cannot be fit on a single machine 
(distributed search), any links to have Solr running in a 2-machine 
environment? I want to measure how much improvement there will be in 
performance with the addition of machines for computation (space later) 
and I need a 2-machine setup for that.


Thanks
Srikant

Shalin Shekhar Mangar wrote:

Look at http://issues.apache.org/jira/browse/SOLR-303

Please note that it is still work in progress. So you may not be able to use
it immeadiately.



--
Find out how you can get spam free email.
http://www.bluebottle.com/tag/3

Re: Cache size and Heap size

2008-01-16 Thread evgeniy . strokin

I'm using Tomcat. I set Max Size = 5Gb and I checked in profiler that it's 
actually uses whole memory. There is no significant memory use by other 
applications.
Whole change was I increased the size of cache to:
LRU Cache(maxSize=1048576, initialSize=1048576, autowarmCount=524288, [EMAIL 
PROTECTED])
I know this is a lot and I'm going to decrease it, I was just experimenting, 
but I need some guidelines of how to calculate the right size of the cache.

Thank you
Gene

- Original Message 
From: Daniel Alheiros <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Wednesday, January 16, 2008 10:48:50 AM
Subject: Re: Cache size and Heap size

Hi Gene.

Have you set your app server / servlet container to use allocate some of
this memory to be used?

You can define the maximum and minimum heap size adding/replacing some
parameters on the app server initialization:

-Xmx1536m -Xms1536m

Which app server / servlet container are you using?

Regards,
Daniel Alheiros

On 16/1/08 15:23, "Evgeniy Strokin" <[EMAIL PROTECTED]> wrote:

> Hello,.. 
> I have relatively large RAM (10Gb) on my server which is running Solr. I
> increased Cache settings and start to see OutOfMemory exceptions, specially on
> facet search.
> Is anybody has some suggestions how Cache settings related to Memory
> consumptions? What are optimal settings? How they could be calculated?
>  
> Thank you for any advise,
> Gene

http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal 
views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on 
it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.

Re: Solr in a distributed multi-machine high-performance environment

2008-01-16 Thread Shalin Shekhar Mangar

Solr provides a few scripts to create a multiple-machine deployment. One box
is setup as the master (used primarily for writes) and others as slaves.
Slaves are added as per application requirements. The index is transferred
using rsync. Look at http://wiki.apache.org/solr/CollectionDistribution for
details.

You can put the slaves behind a load balancer or share the slaves among your
front-end servers to measure performance.

On Jan 17, 2008 12:39 AM, Srikant Jakilinki <[EMAIL PROTECTED]>
wrote:

> Thanks for that Shalin. Looks like I have to wait and keep track of
> developments.
>
> Forgetting about indexes that cannot be fit on a single machine
> (distributed search), any links to have Solr running in a 2-machine
> environment? I want to measure how much improvement there will be in
> performance with the addition of machines for computation (space later)
> and I need a 2-machine setup for that.
>
> Thanks
> Srikant
>
> Shalin Shekhar Mangar wrote:
> > Look at http://issues.apache.org/jira/browse/SOLR-303
> >
> > Please note that it is still work in progress. So you may not be able to
> use
> > it immeadiately.
> >
>
> --
> Find out how you can get spam free email.
> http://www.bluebottle.com/tag/3
>
>

-- 
Regards,
Shalin Shekhar Mangar.

Re: Solr in a distributed multi-machine high-performance environment

2008-01-16 Thread Mike Klaas


On 16-Jan-08, at 11:09 AM, Srikant Jakilinki wrote:

Thanks for that Shalin. Looks like I have to wait and keep track of  
developments.


Forgetting about indexes that cannot be fit on a single machine  
(distributed search), any links to have Solr running in a 2-machine  
environment? I want to measure how much improvement there will be  
in performance with the addition of machines for computation (space  
later) and I need a 2-machine setup for that.


If you are looking for automatic replication and load-balancing  
across multiple machines, Solr does not provide that.  The typical  
strategy is as follows: index half the documents on one machine and  
half on another.  Execute both queries simultaneously (using threads,  
f.i.), and combine the results.  You should observe a speed up.


-Mike

Re: Solr in a distributed multi-machine high-performance environment

2008-01-16 Thread Mike Klaas


On 15-Jan-08, at 9:23 PM, Srikant Jakilinki wrote:


2) Solr that has to handle a large collective index which has to be
split up on multi-machines
- The index is ever increasing (TB scale) and dynamic and all of it
has to be searched at any point


This will require significant development on your part.  Nutch may be  
able to provide more of what you need OOB.



3) Solr that has to exploit multi-machines because we have plenty of
them in a tightly coupled P2P scenario
- Machines are not a problem but will they be if they are of varied
configurations (PIII to Core2; Linux to Vista; 32-bit to 64-bit; J2SE
1.1 to 1.6)


Solr requires java 1.5, lucene requires java 1.4.  Also, there is  
certainly no point mixing PIII's and modern cpus: trying to achieve  
the appropriate balance between machines of such disparate capability  
will take much more effort than will be gained out of using them.


-Mike

Re: Cache size and Heap size

2008-01-16 Thread Mike Klaas


On 16-Jan-08, at 11:15 AM, [EMAIL PROTECTED] wrote:

I'm using Tomcat. I set Max Size = 5Gb and I checked in profiler  
that it's actually uses whole memory. There is no significant  
memory use by other applications.

Whole change was I increased the size of cache to:
LRU Cache(maxSize=1048576, initialSize=1048576,  
autowarmCount=524288,  
[EMAIL PROTECTED])


autowarmcount > maxSize certainly doesn't make sense.

I know this is a lot and I'm going to decrease it, I was just  
experimenting, but I need some guidelines of how to calculate the  
right size of the cache.


Each filter that matches more than ~3000 documents will occupy  
maxDocs/8 bytes of memory.  Certain kinds of faceting require one  
entry per unique value in a field.  The best way to tune this is to  
monitor your cache hit/expunge statistics for the filter cache (on  
the solr admin statistics screen).


-Mike

Re: Problem with dismax handler when searching Solr along with field

2008-01-16 Thread Mike Klaas


On 16-Jan-08, at 3:15 AM, farhanali wrote:



when i search the query for example

http://localhost:8983/solr/select/?q=category&qt=dismax

it gives the results but when i want to search on the basis of  
field name

like

http://localhost:8983/solr/select/?q=maincategory:Cars&qt=dismax

it does not gives results however

http://localhost:8983/solr/select/?q=maincategory:Cars

return results of cars from field name "maincategory"

Anyone have some idea???


The dismax handler does not allow you to use lucene query syntax.   
The qf parameter must be used to select the fields to query  
(alternatively, you can provide a lucene-style query in an fq filter).


See the documentation here:
http://wiki.apache.org/solr/DisMaxRequestHandler

-Mike

Spell checker index rebuild

2008-01-16 Thread Doug Steigerwald

Having another weird spell checker index issue.  Starting off from a clean index and spell check 
index, I'll index everything in example/exampledocs.  On the first rebuild of the spellchecker index 
using the query below says the word 'blackjack' exists in the spellchecker index.  Great, no problems.


Rebuild it again and the word 'blackjack' does not exist any more.

http://localhost:8983/solr/core0/select?q=blackjack&qt=spellchecker&cmd=rebuild

Any ideas?  This is with a Solr trunk build from yesterday.

doug

IOException: read past EOF during optimize phase

2008-01-16 Thread Kevin Osborn

I am using the embedded Solr API for my indexing process. I created a brand new 
index with my application without any problem. I then ran my indexer in 
incremental mode. This process copies the working index to a temporary Solr 
location, adds/updates any records, optimizes the index, and then copies it 
back to the working location. There are currently not any instances of Solr 
reading this index. Also, I commit after every 10 rows. The schema.xml and 
solrconfig.xml files have not changed.

Here is my function call.
protected void optimizeProducts() throws IOException {
UpdateHandler updateHandler = m_SolrCore.getUpdateHandler();
CommitUpdateCommand commitCmd = new CommitUpdateCommand(true);
commitCmd.optimize = true;

updateHandler.commit(commitCmd);

log.info("Optimized index");
}

So, during the optimize phase, I get the following stack trace:
java.io.IOException: read past EOF
at 
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:89)
at 
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:34)
at org.apache.lucene.store.IndexInput.readChars(IndexInput.java:107)
at org.apache.lucene.store.IndexInput.readString(IndexInput.java:93)
at 
org.apache.lucene.index.FieldsReader.addFieldForMerge(FieldsReader.java:211)
at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:119)
at 
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:323)
at 
org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:206)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96)
at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:1835)
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:1195)
at 
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:508)
at ...

There are no exceptions or anything else that appears to be incorrect during 
the adds or commits. After this, the index files are still non-optimized.

I know there is not a whole lot to go on here. Anything in particular that I 
should look at?

Re: Big number of conditions of the search

2008-01-16 Thread evgeniy . strokin

I see,.. but I really need to run it on Solr. We have already indexed 
everything.
I don't really want to construct a query with 1K OR conditions, and send to 
Solr to parse it first and run it after. 
May be there is a way to go directly to Lucene, or Solr and run such query from 
Java, passing Array of IDs, or something like this?
Could anybody give me some advise of how to do this in better way?
 
Thank you
Gene



- Original Message 
From: Otis Gospodnetic <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Friday, January 11, 2008 12:26:14 AM
Subject: Re: Big number of conditions of the search

Evgeniy - sound like a problem best suited for RDBMS, really.

You can run such an OR query, but you'll have to manually increase the max 
number of clauses allowed (in one of the configs) and make sure the JVM has 
plenty of memory.  But again, this is best done in RDBMS with some count(*) and 
GROUP BY selects.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Evgeniy Strokin <[EMAIL PROTECTED]>
To: Solr User 
Sent: Thursday, January 10, 2008 4:39:44 PM
Subject: Big number of conditions of the search

Hello, I don't know how to formulate this right, I'll give an example:
I have 20 millions documents with unique ID indexed.
I have list of IDs stored somewhere. I need to run query which will
take documents with ID from my list and gives me some statistic. 
For example: my documents are addresses with unique ID. I have list
which contains 10 thousand IDs of some addresses. I need to find how many
addresses are in NJ from my list? Or another scenario: give me all
states my addresses from and how many addresses in each state (only
addresses from my list)?

So I was thinking I could run facet search by field "State", but my
query would be like this: ID:123 OR ID:23987 OR ID:294343  10K such OR
conditions in a row, which is ridicules and not even possible I think.

Could somebody suggest some solution for this?

Thank you
Gene

Re: Indexing very large files.

2008-01-16 Thread David Thibault

OK, I have now bumped my tomcat JVM up to 1024MB min and 1500MB max.  For
some reason Walter's suggestion helped me get past the 8MB file upload to
Solr but it's still choking on a 32MB file.  Is there a way to set
per-webapp JVM settings in tomcat, or is the overall tomcat JVM sufficient
to set?  I can't see anything in the tomcat manager to suggest that there
are smaller memory limitations for solr or any other webapp (all the demo
webapps that tomcat comes with are still there right now).
Here's the trace I get when I try to upload the 32MB file:


java.lang.OutOfMemoryError: Java heap space
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java
:95)
at sun.net.www.http.PosterOutputStream.write(PosterOutputStream.java
:61)
at sun.nio.cs.StreamEncoder$CharsetSE.writeBytes(StreamEncoder.java
:336)
at sun.nio.cs.StreamEncoder$CharsetSE.implWrite(StreamEncoder.java
:395)
at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:136)
at java.io.OutputStreamWriter.write(OutputStreamWriter.java:191)
at com.itstrategypartners.sents.solrUpload.SimplePostTool.pipe(
SimplePostTool.java:167)
at com.itstrategypartners.sents.solrUpload.SimplePostTool.postData(
SimplePostTool.java:125)
at com.itstrategypartners.sents.solrUpload.SimplePostTool.postFile(
SimplePostTool.java:87)
at com.itstrategypartners.sents.solrUpload.Uploader.uploadFile(
Uploader.java:97)
at com.itstrategypartners.sents.solrUpload.UploaderTest.uploadFile(
UploaderTest.java:95)

Any more thoughts on possible causes?

Best,
Dave

On 1/16/08, David Thibault <[EMAIL PROTECTED]> wrote:
>
> Walter and all,
>
> I had been bumping up the heap for my Java app (running outside of Tomcat)
> but I hadn't yet tried bumping up my Tomcat heap.  That seems to have helped
> me upload the 8MB file, but it's crashing while uploading a 32MB file now. I
> Just bumped tomcat to 1024MB of heap, so I'm not sure what the problem is
> now.  I suspect Walter was on to something, since it sort of fixed my
> problem.  I will keep troubleshooting the Tomcat memory and go from there..
>
>
> Best,
> Dave
>
> On 1/16/08, Walter Underwood < [EMAIL PROTECTED]> wrote:
> >
> > This error means that the JVM has run out of heap space. Increase the
> > heap space. That is an option on the "java" command. I set my heap to
> > 200 Meg and do it this way with Tomcat 6:
> >
> > JAVA_OPTS="-Xmx600M" tomcat/bin/startup.sh
> >
> > wunder
> >
> > On 1/16/08 8:33 AM, "David Thibault" < [EMAIL PROTECTED]>
> > wrote:
> >
> > > java.lang.OutOfMemoryError: Java heap space
> >
> >
>
>

RE: Indexing very large files.

2008-01-16 Thread Timothy Wonil Lee

I think you should try isolating the problem.
It may turn out that the problem isn't really to do with Solr, but file
uploading.
I'm no expert, but that's what I'd try out in such situation.

Cheers,

Timothy Wonil Lee

http://timundergod.blogspot.com/
http://www.google.com/reader/shared/16849249410805339619


-Original Message-
From: David Thibault [mailto:[EMAIL PROTECTED] 
Sent: Thursday, 17 January 2008 8:30 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing very large files.

OK, I have now bumped my tomcat JVM up to 1024MB min and 1500MB max.  For
some reason Walter's suggestion helped me get past the 8MB file upload to
Solr but it's still choking on a 32MB file.  Is there a way to set
per-webapp JVM settings in tomcat, or is the overall tomcat JVM sufficient
to set?  I can't see anything in the tomcat manager to suggest that there
are smaller memory limitations for solr or any other webapp (all the demo
webapps that tomcat comes with are still there right now).
Here's the trace I get when I try to upload the 32MB file:


java.lang.OutOfMemoryError: Java heap space
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java
:95)
at sun.net.www.http.PosterOutputStream.write(PosterOutputStream.java
:61)
at sun.nio.cs.StreamEncoder$CharsetSE.writeBytes(StreamEncoder.java
:336)
at sun.nio.cs.StreamEncoder$CharsetSE.implWrite(StreamEncoder.java
:395)
at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:136)
at java.io.OutputStreamWriter.write(OutputStreamWriter.java:191)
at com.itstrategypartners.sents.solrUpload.SimplePostTool.pipe(
SimplePostTool.java:167)
at com.itstrategypartners.sents.solrUpload.SimplePostTool.postData(
SimplePostTool.java:125)
at com.itstrategypartners.sents.solrUpload.SimplePostTool.postFile(
SimplePostTool.java:87)
at com.itstrategypartners.sents.solrUpload.Uploader.uploadFile(
Uploader.java:97)
at com.itstrategypartners.sents.solrUpload.UploaderTest.uploadFile(
UploaderTest.java:95)

Any more thoughts on possible causes?

Best,
Dave

On 1/16/08, David Thibault <[EMAIL PROTECTED]> wrote:
>
> Walter and all,
>
> I had been bumping up the heap for my Java app (running outside of Tomcat)
> but I hadn't yet tried bumping up my Tomcat heap.  That seems to have
helped
> me upload the 8MB file, but it's crashing while uploading a 32MB file now.
I
> Just bumped tomcat to 1024MB of heap, so I'm not sure what the problem is
> now.  I suspect Walter was on to something, since it sort of fixed my
> problem.  I will keep troubleshooting the Tomcat memory and go from
there..
>
>
> Best,
> Dave
>
> On 1/16/08, Walter Underwood < [EMAIL PROTECTED]> wrote:
> >
> > This error means that the JVM has run out of heap space. Increase the
> > heap space. That is an option on the "java" command. I set my heap to
> > 200 Meg and do it this way with Tomcat 6:
> >
> > JAVA_OPTS="-Xmx600M" tomcat/bin/startup.sh
> >
> > wunder
> >
> > On 1/16/08 8:33 AM, "David Thibault" < [EMAIL PROTECTED]>
> > wrote:
> >
> > > java.lang.OutOfMemoryError: Java heap space
> >
> >
>
>


!DSPAM:478e7768324633671820667!

Re: IOException: read past EOF during optimize phase

2008-01-16 Thread Otis Gospodnetic

Kevin,

Don't have the answer to EOF but I'm wondering why the index moving.  You 
don't need to do that as far as Solr is concerned.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Kevin Osborn <[EMAIL PROTECTED]>
To: Solr 
Sent: Wednesday, January 16, 2008 3:07:23 PM
Subject: IOException: read past EOF during optimize phase

I am using the embedded Solr API for my indexing process. I created a
 brand new index with my application without any problem. I then ran my
 indexer in incremental mode. This process copies the working index to a
 temporary Solr location, adds/updates any records, optimizes the index,
 and then copies it back to the working location. There are currently
 not any instances of Solr reading this index. Also, I commit after every
 10 rows. The schema.xml and solrconfig.xml files have not changed.

Here is my function call.
protected void optimizeProducts() throws IOException {
UpdateHandler updateHandler = m_SolrCore.getUpdateHandler();
CommitUpdateCommand commitCmd = new CommitUpdateCommand(true);
commitCmd.optimize = true;

updateHandler.commit(commitCmd);

log.info("Optimized index");
}

So, during the optimize phase, I get the following stack trace:
java.io.IOException: read past EOF
at
 org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:89)
at
 org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:34)
at
 org.apache.lucene.store.IndexInput.readChars(IndexInput.java:107)
at
 org.apache.lucene.store.IndexInput.readString(IndexInput.java:93)
at
 org.apache.lucene.index.FieldsReader.addFieldForMerge(FieldsReader.java:211)
at
 org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:119)
at
 org.apache.lucene.index.SegmentReader.document(SegmentReader.java:323)
at
 org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:206)
at
 org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96)
at
 org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:1835)
at
 org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:1195)
at
 
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:508)
at ...

There are no exceptions or anything else that appears to be incorrect
 during the adds or commits. After this, the index files are still
 non-optimized.

I know there is not a whole lot to go on here. Anything in particular
 that I should look at?

Re: Spell checker index rebuild

2008-01-16 Thread Otis Gospodnetic

Do you trust the spellchecker 100% (not looking at its source now).  I'd peek 
at the index with Luke (Luke I trust :)) and see if that term is really there 
first.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Doug Steigerwald <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Wednesday, January 16, 2008 2:56:35 PM
Subject: Spell checker index rebuild

Having another weird spell checker index issue.  Starting off from a
 clean index and spell check 
index, I'll index everything in example/exampledocs.  On the first
 rebuild of the spellchecker index 
using the query below says the word 'blackjack' exists in the
 spellchecker index.  Great, no problems.

Rebuild it again and the word 'blackjack' does not exist any more.

http://localhost:8983/solr/core0/select?q=blackjack&qt=spellchecker&cmd=rebuild

Any ideas?  This is with a Solr trunk build from yesterday.

doug

Re: IOException: read past EOF during optimize phase

2008-01-16 Thread Kevin Osborn

It is more of a file structure thing for our application. We build in one place 
and do our index syncing in a different place. I doubt it is relevant to this 
issue, but figured I would include this information anyway.

- Original Message 
From: Otis Gospodnetic <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Wednesday, January 16, 2008 2:21:31 PM
Subject: Re: IOException: read past EOF during optimize phase

Kevin,

Don't have the answer to EOF but I'm wondering why the index
 moving.  You don't need to do that as far as Solr is concerned.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Kevin Osborn <[EMAIL PROTECTED]>
To: Solr 
Sent: Wednesday, January 16, 2008 3:07:23 PM
Subject: IOException: read past EOF during optimize phase

I am using the embedded Solr API for my indexing process. I created a
 brand new index with my application without any problem. I then ran my
 indexer in incremental mode. This process copies the working index to
 a
 temporary Solr location, adds/updates any records, optimizes the
 index,
 and then copies it back to the working location. There are currently
 not any instances of Solr reading this index. Also, I commit after
 every
 10 rows. The schema.xml and solrconfig.xml files have not changed.

Here is my function call.
protected void optimizeProducts() throws IOException {
UpdateHandler updateHandler = m_SolrCore.getUpdateHandler();
CommitUpdateCommand commitCmd = new CommitUpdateCommand(true);
commitCmd.optimize = true;

updateHandler.commit(commitCmd);

log.info("Optimized index");
}

So, during the optimize phase, I get the following stack trace:
java.io.IOException: read past EOF
at

 org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:89)
at

 org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:34)
at
 org.apache.lucene.store.IndexInput.readChars(IndexInput.java:107)
at
 org.apache.lucene.store.IndexInput.readString(IndexInput.java:93)
at

 org.apache.lucene.index.FieldsReader.addFieldForMerge(FieldsReader.java:211)
at
 org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:119)
at
 org.apache.lucene.index.SegmentReader.document(SegmentReader.java:323)
at

 org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:206)
at
 org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96)
at

 org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:1835)
at
 org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:1195)
at

org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:508)
at ...

There are no exceptions or anything else that appears to be incorrect
 during the adds or commits. After this, the index files are still
 non-optimized.

I know there is not a whole lot to go on here. Anything in particular
 that I should look at?

Re: Indexing very large files.

2008-01-16 Thread Yonik Seeley

>From your stack trace, it looks like it's your client running out of
memory, right?

SimplePostTool was meant as a command-line replacement to curl to
remove that dependency, not as a recommended way to talk to Solr.

-Yonik

On Jan 16, 2008 4:29 PM, David Thibault <[EMAIL PROTECTED]> wrote:
> OK, I have now bumped my tomcat JVM up to 1024MB min and 1500MB max.  For
> some reason Walter's suggestion helped me get past the 8MB file upload to
> Solr but it's still choking on a 32MB file.  Is there a way to set
> per-webapp JVM settings in tomcat, or is the overall tomcat JVM sufficient
> to set?  I can't see anything in the tomcat manager to suggest that there
> are smaller memory limitations for solr or any other webapp (all the demo
> webapps that tomcat comes with are still there right now).
> Here's the trace I get when I try to upload the 32MB file:
>
>
> java.lang.OutOfMemoryError: Java heap space
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java
> :95)
> at sun.net.www.http.PosterOutputStream.write(PosterOutputStream.java
> :61)
> at sun.nio.cs.StreamEncoder$CharsetSE.writeBytes(StreamEncoder.java
> :336)
> at sun.nio.cs.StreamEncoder$CharsetSE.implWrite(StreamEncoder.java
> :395)
> at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:136)
> at java.io.OutputStreamWriter.write(OutputStreamWriter.java:191)
> at com.itstrategypartners.sents.solrUpload.SimplePostTool.pipe(
> SimplePostTool.java:167)
> at com.itstrategypartners.sents.solrUpload.SimplePostTool.postData(
> SimplePostTool.java:125)
> at com.itstrategypartners.sents.solrUpload.SimplePostTool.postFile(
> SimplePostTool.java:87)
> at com.itstrategypartners.sents.solrUpload.Uploader.uploadFile(
> Uploader.java:97)
> at com.itstrategypartners.sents.solrUpload.UploaderTest.uploadFile(
> UploaderTest.java:95)
>
> Any more thoughts on possible causes?
>
> Best,
> Dave

Re: IOException: read past EOF during optimize phase

2008-01-16 Thread Otis Gospodnetic

Kevin,

Perhaps you want to look at how Solr can be used in a master-slave setup.  This 
will separate your indexing from searching.  Don't have the URL, but it's on 
zee Wiki.

Otis 

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Kevin Osborn <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Wednesday, January 16, 2008 5:25:34 PM
Subject: Re: IOException: read past EOF during optimize phase

It is more of a file structure thing for our application. We build in
 one place and do our index syncing in a different place. I doubt it is
 relevant to this issue, but figured I would include this information
 anyway.

- Original Message 
From: Otis Gospodnetic <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Wednesday, January 16, 2008 2:21:31 PM
Subject: Re: IOException: read past EOF during optimize phase


Kevin,

Don't have the answer to EOF but I'm wondering why the index
 moving.  You don't need to do that as far as Solr is concerned.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Kevin Osborn <[EMAIL PROTECTED]>
To: Solr 
Sent: Wednesday, January 16, 2008 3:07:23 PM
Subject: IOException: read past EOF during optimize phase

I am using the embedded Solr API for my indexing process. I created a
 brand new index with my application without any problem. I then ran my
 indexer in incremental mode. This process copies the working index to
 a
 temporary Solr location, adds/updates any records, optimizes the
 index,
 and then copies it back to the working location. There are currently
 not any instances of Solr reading this index. Also, I commit after
 every
 10 rows. The schema.xml and solrconfig.xml files have not changed.

Here is my function call.
protected void optimizeProducts() throws IOException {
UpdateHandler updateHandler = m_SolrCore.getUpdateHandler();
CommitUpdateCommand commitCmd = new CommitUpdateCommand(true);
commitCmd.optimize = true;

updateHandler.commit(commitCmd);

log.info("Optimized index");
}

So, during the optimize phase, I get the following stack trace:
java.io.IOException: read past EOF
at


 org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:89)
at


 org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:34)
at
 org.apache.lucene.store.IndexInput.readChars(IndexInput.java:107)
at
 org.apache.lucene.store.IndexInput.readString(IndexInput.java:93)
at


 org.apache.lucene.index.FieldsReader.addFieldForMerge(FieldsReader.java:211)
at
 org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:119)
at
 org.apache.lucene.index.SegmentReader.document(SegmentReader.java:323)
at


 org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:206)
at
 org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96)
at


 org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:1835)
at
 org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:1195)
at


 
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:508)
at ...

There are no exceptions or anything else that appears to be incorrect
 during the adds or commits. After this, the index files are still
 non-optimized.

I know there is not a whole lot to go on here. Anything in particular
 that I should look at?

Re: Indexing very large files.

2008-01-16 Thread Otis Gospodnetic

David,
I bet you can quickly identify the source using YourKit or another Java 
profiler  jmap command line tool might also give you some direction.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: David Thibault <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Wednesday, January 16, 2008 1:31:23 PM
Subject: Re: Indexing very large files.

Walter and all,
I had been bumping up the heap for my Java app (running outside of
 Tomcat)
but I hadn't yet tried bumping up my Tomcat heap.  That seems to have
 helped
me upload the 8MB file, but it's crashing while uploading a 32MB file
 now. I
Just bumped tomcat to 1024MB of heap, so I'm not sure what the problem
 is
now.  I suspect Walter was on to something, since it sort of fixed my
problem.  I will keep troubleshooting the Tomcat memory and go from
 there..

Best,
Dave

On 1/16/08, Walter Underwood <[EMAIL PROTECTED]> wrote:
>
> This error means that the JVM has run out of heap space. Increase the
> heap space. That is an option on the "java" command. I set my heap to
> 200 Meg and do it this way with Tomcat 6:
>
> JAVA_OPTS="-Xmx600M" tomcat/bin/startup.sh
>
> wunder
>
> On 1/16/08 8:33 AM, "David Thibault" <[EMAIL PROTECTED]>
 wrote:
>
> > java.lang.OutOfMemoryError: Java heap space
>
>

Re: IOException: read past EOF during optimize phase

2008-01-16 Thread Kevin Osborn

Our basic setup is master/slave. We just want to make sure that we are not 
syncing against an index that is in the middle of a large rebuild. But, I think 
these issues are still separate from what I am experiencing.

I also tried this same scenario in a different development environment. No 
problems there.

- Original Message 
From: Otis Gospodnetic <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Wednesday, January 16, 2008 2:33:03 PM
Subject: Re: IOException: read past EOF during optimize phase

Kevin,

Perhaps you want to look at how Solr can be used in a master-slave
 setup.  This will separate your indexing from searching.  Don't have the
 URL, but it's on zee Wiki.

Otis 

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Kevin Osborn <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Wednesday, January 16, 2008 5:25:34 PM
Subject: Re: IOException: read past EOF during optimize phase

It is more of a file structure thing for our application. We build in
 one place and do our index syncing in a different place. I doubt it is
 relevant to this issue, but figured I would include this information
 anyway.

- Original Message 
From: Otis Gospodnetic <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Wednesday, January 16, 2008 2:21:31 PM
Subject: Re: IOException: read past EOF during optimize phase

Kevin,

Don't have the answer to EOF but I'm wondering why the index
 moving.  You don't need to do that as far as Solr is concerned.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Kevin Osborn <[EMAIL PROTECTED]>
To: Solr 
Sent: Wednesday, January 16, 2008 3:07:23 PM
Subject: IOException: read past EOF during optimize phase

I am using the embedded Solr API for my indexing process. I created a
 brand new index with my application without any problem. I then ran my
 indexer in incremental mode. This process copies the working index to
 a
 temporary Solr location, adds/updates any records, optimizes the
 index,
 and then copies it back to the working location. There are currently
 not any instances of Solr reading this index. Also, I commit after
 every
 10 rows. The schema.xml and solrconfig.xml files have not changed.

Here is my function call.
protected void optimizeProducts() throws IOException {
UpdateHandler updateHandler = m_SolrCore.getUpdateHandler();
CommitUpdateCommand commitCmd = new CommitUpdateCommand(true);
commitCmd.optimize = true;

updateHandler.commit(commitCmd);

log.info("Optimized index");
}

So, during the optimize phase, I get the following stack trace:
java.io.IOException: read past EOF
at

 org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:89)
at

 org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:34)
at
 org.apache.lucene.store.IndexInput.readChars(IndexInput.java:107)
at
 org.apache.lucene.store.IndexInput.readString(IndexInput.java:93)
at

 org.apache.lucene.index.FieldsReader.addFieldForMerge(FieldsReader.java:211)
at
 org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:119)
at
 org.apache.lucene.index.SegmentReader.document(SegmentReader.java:323)
at

 org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:206)
at
 org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96)
at

 org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:1835)
at
 org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:1195)
at

org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:508)
at ...

There are no exceptions or anything else that appears to be incorrect
 during the adds or commits. After this, the index files are still
 non-optimized.

I know there is not a whole lot to go on here. Anything in particular
 that I should look at?

dojo and solr

2008-01-16 Thread Sean Laval


has anyone done any work integrating dojo based applications with solr? I am 
pretty new to both but I wondered if it anyone had developed an xsl for solr 
that returns solr queries in dojo data store format - json, but a specific 
format of json. I am not even sure if this is sensible/possible.
 
Sean
_
Share what Santa brought you
https://www.mycooluncool.com

Re: IOException: read past EOF during optimize phase

2008-01-16 Thread Yonik Seeley

This may be a Lucene bug... IIRC, I saw at least one other lucene user
with a similar stack trace.  I think the latest lucene version (2.3
dev) should fix it if that's the case.

-Yonik

On Jan 16, 2008 3:07 PM, Kevin Osborn <[EMAIL PROTECTED]> wrote:
> I am using the embedded Solr API for my indexing process. I created a brand 
> new index with my application without any problem. I then ran my indexer in 
> incremental mode. This process copies the working index to a temporary Solr 
> location, adds/updates any records, optimizes the index, and then copies it 
> back to the working location. There are currently not any instances of Solr 
> reading this index. Also, I commit after every 10 rows. The schema.xml 
> and solrconfig.xml files have not changed.
>
> Here is my function call.
> protected void optimizeProducts() throws IOException {
> UpdateHandler updateHandler = m_SolrCore.getUpdateHandler();
> CommitUpdateCommand commitCmd = new CommitUpdateCommand(true);
> commitCmd.optimize = true;
>
> updateHandler.commit(commitCmd);
>
> log.info("Optimized index");
> }
>
> So, during the optimize phase, I get the following stack trace:
> java.io.IOException: read past EOF
> at 
> org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:89)
> at 
> org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:34)
> at org.apache.lucene.store.IndexInput.readChars(IndexInput.java:107)
> at org.apache.lucene.store.IndexInput.readString(IndexInput.java:93)
> at 
> org.apache.lucene.index.FieldsReader.addFieldForMerge(FieldsReader.java:211)
> at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:119)
> at 
> org.apache.lucene.index.SegmentReader.document(SegmentReader.java:323)
> at 
> org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:206)
> at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96)
> at 
> org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:1835)
> at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:1195)
> at 
> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:508)
> at ...
>
> There are no exceptions or anything else that appears to be incorrect during 
> the adds or commits. After this, the index files are still non-optimized.
>
> I know there is not a whole lot to go on here. Anything in particular that I 
> should look at?
>
>

Re: Indexing very large files.

2008-01-16 Thread David Thibault

Yonik,
I pulled SimplePostTool apart, pulled out the main() and the postFiles() and
just use it directly in Java via postFile() -> postData().  It seems to work
OK. Maybe I should upgrade to v1.3 and try doing things directly through
Solrj.  Is 1.3 stable yet?  Might that be a better plan altogether?

Dave

On 1/16/08, Yonik Seeley <[EMAIL PROTECTED]> wrote:
>
> From your stack trace, it looks like it's your client running out of
> memory, right?
>
> SimplePostTool was meant as a command-line replacement to curl to
> remove that dependency, not as a recommended way to talk to Solr.
>
> -Yonik
>
> On Jan 16, 2008 4:29 PM, David Thibault <[EMAIL PROTECTED]>
> wrote:
> > OK, I have now bumped my tomcat JVM up to 1024MB min and 1500MB
> max.  For
> > some reason Walter's suggestion helped me get past the 8MB file upload
> to
> > Solr but it's still choking on a 32MB file.  Is there a way to set
> > per-webapp JVM settings in tomcat, or is the overall tomcat JVM
> sufficient
> > to set?  I can't see anything in the tomcat manager to suggest that
> there
> > are smaller memory limitations for solr or any other webapp (all the
> demo
> > webapps that tomcat comes with are still there right now).
> > Here's the trace I get when I try to upload the 32MB file:
> >
> >
> > java.lang.OutOfMemoryError: Java heap space
> > at java.io.ByteArrayOutputStream.write(
> ByteArrayOutputStream.java
> > :95)
> > at sun.net.www.http.PosterOutputStream.write(
> PosterOutputStream.java
> > :61)
> > at sun.nio.cs.StreamEncoder$CharsetSE.writeBytes(
> StreamEncoder.java
> > :336)
> > at sun.nio.cs.StreamEncoder$CharsetSE.implWrite(
> StreamEncoder.java
> > :395)
> > at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:136)
> > at java.io.OutputStreamWriter.write(OutputStreamWriter.java:191)
> > at com.itstrategypartners.sents.solrUpload.SimplePostTool.pipe(
> > SimplePostTool.java:167)
> > at
> com.itstrategypartners.sents.solrUpload.SimplePostTool.postData(
> > SimplePostTool.java:125)
> > at
> com.itstrategypartners.sents.solrUpload.SimplePostTool.postFile(
> > SimplePostTool.java:87)
> > at com.itstrategypartners.sents.solrUpload.Uploader.uploadFile(
> > Uploader.java:97)
> > at
> com.itstrategypartners.sents.solrUpload.UploaderTest.uploadFile(
> > UploaderTest.java:95)
> >
> > Any more thoughts on possible causes?
> >
> > Best,
> > Dave
>

Re: Indexing very large files.

2008-01-16 Thread David Thibault

Thanks, Otis.  I will take a look at those profiling tools.
Best,
Dave

On 1/16/08, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
>
> David,
> I bet you can quickly identify the source using YourKit or another Java
> profiler  jmap command line tool might also give you some direction.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> - Original Message 
> From: David Thibault <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Wednesday, January 16, 2008 1:31:23 PM
> Subject: Re: Indexing very large files.
>
> Walter and all,
> I had been bumping up the heap for my Java app (running outside of
> Tomcat)
> but I hadn't yet tried bumping up my Tomcat heap.  That seems to have
> helped
> me upload the 8MB file, but it's crashing while uploading a 32MB file
> now. I
> Just bumped tomcat to 1024MB of heap, so I'm not sure what the problem
> is
> now.  I suspect Walter was on to something, since it sort of fixed my
> problem.  I will keep troubleshooting the Tomcat memory and go from
> there..
>
> Best,
> Dave
>
> On 1/16/08, Walter Underwood <[EMAIL PROTECTED]> wrote:
> >
> > This error means that the JVM has run out of heap space. Increase the
> > heap space. That is an option on the "java" command. I set my heap to
> > 200 Meg and do it this way with Tomcat 6:
> >
> > JAVA_OPTS="-Xmx600M" tomcat/bin/startup.sh
> >
> > wunder
> >
> > On 1/16/08 8:33 AM, "David Thibault" <[EMAIL PROTECTED]>
> wrote:
> >
> > > java.lang.OutOfMemoryError: Java heap space
> >
> >
>
>
>
>

Re: IOException: read past EOF during optimize phase

2008-01-16 Thread Kevin Osborn

I did see that bug, which made me suspect Lucene. In my case, I tracked down 
the problem. It was my own application. I was using Java's 
FileChannel.transferTo functions to copy my index from one location to another. 
One of the files is bigger than 2^31-1 bytes. So, one of my files was corrupted 
during the copy because I was just doing one pass. I now loop the copy function 
until the entire file is copied and everything works fine.

DOH!

- Original Message 
From: Yonik Seeley <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Wednesday, January 16, 2008 4:57:08 PM
Subject: Re: IOException: read past EOF during optimize phase


This may be a Lucene bug... IIRC, I saw at least one other lucene user
with a similar stack trace.  I think the latest lucene version (2.3
dev) should fix it if that's the case.

-Yonik

On Jan 16, 2008 3:07 PM, Kevin Osborn <[EMAIL PROTECTED]> wrote:
> I am using the embedded Solr API for my indexing process. I created a
 brand new index with my application without any problem. I then ran my
 indexer in incremental mode. This process copies the working index to
 a temporary Solr location, adds/updates any records, optimizes the
 index, and then copies it back to the working location. There are currently
 not any instances of Solr reading this index. Also, I commit after
 every 10 rows. The schema.xml and solrconfig.xml files have not
 changed.
>
> Here is my function call.
> protected void optimizeProducts() throws IOException {
> UpdateHandler updateHandler = m_SolrCore.getUpdateHandler();
> CommitUpdateCommand commitCmd = new
 CommitUpdateCommand(true);
> commitCmd.optimize = true;
>
> updateHandler.commit(commitCmd);
>
> log.info("Optimized index");
> }
>
> So, during the optimize phase, I get the following stack trace:
> java.io.IOException: read past EOF
> at
 org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:89)
> at
 org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:34)
> at
 org.apache.lucene.store.IndexInput.readChars(IndexInput.java:107)
> at
 org.apache.lucene.store.IndexInput.readString(IndexInput.java:93)
> at
 org.apache.lucene.index.FieldsReader.addFieldForMerge(FieldsReader.java:211)
> at
 org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:119)
> at
 org.apache.lucene.index.SegmentReader.document(SegmentReader.java:323)
> at
 org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:206)
> at
 org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96)
> at
 org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:1835)
> at
 org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:1195)
> at
 
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:508)
> at ...
>
> There are no exceptions or anything else that appears to be incorrect
 during the adds or commits. After this, the index files are still
 non-optimized.
>
> I know there is not a whole lot to go on here. Anything in particular
 that I should look at?
>
>

Logging in Solr

2008-01-16 Thread David Thibault

All,
I'm new to Solr and Tomcat and I'm trying to track down some odd errors.
 How do I set up Tomcat to do fine-grained Solr-specific logging?  I have
looked around enough to know that it should be possible to do per-webapp
logging in Tomcat 5.5, but the details are hard to follow for a newbie.  Any
suggestions would be greatly appreciated.

Best,
Dave

Re: conceptual issues with solr

2008-01-16 Thread Norberto Meijome

On Wed, 16 Jan 2008 16:54:56 +0100
"Philippe Guillard" <[EMAIL PROTECTED]> wrote:

> Hi here,
> 
> It seems that Lucene accepts any kind of XML document but Solr accepts only
> flat name/value pairs inside a document to be indexed.
> You'll find below what I'd like to do, Thanks for help of any kind !
> 
> Phil
> 

Hey Phil,

> 
> I need to index products (hotels) which have a price by date, then search
> them by date or date range and price range.
> Is there a way to do that with Solr?

yes - look at the data types definition (somewhere in the wiki of the sample
schema.xml) about data-types for indexing dates and integers,etc There are some
caveats about using date data type fields (too much resolution, can slow down
too much..)

> 
> At the moment i have a document for each hotel :
> 
> 
> http:///yyy
> 1
> Hotel Opera
> 4 stars
> .
> 
> 
> 
> I would need to add my dates/price values like this but it is forbidden in
> Solr indexing:
> 
> 
> 
> Otherwise i could define a default field (being an integer) and have as many
> fields as dates, like this:
> 200
> 150
> indexing would accept it but i think i will not be able to search or sort by
> date

for simple dates like that, why not make use of dynamic fields ? define , for
example, bydate_* as dynamic fields, then you can do :



so, from your example : 
200
150


> The only solution i found at the moment is to create a document for each
> date/price
> 
> 
> http:///yyy
> 1
> Hotel Opera
> 30/01/2008
> 200
> 
> 
>  http:///yyy
> 1
> Hotel Opera
> 31/01/2008
> 150
> 
> 

If the field 'id' is your schemas ID, then this wouldn't work , but sure,
the approach would be valid though a bit wasteful wrt to storing the metadata
about the hotel There was a thread some time ago in this list (a month or 2
ago) about clever uses of the field defined as ID in the schema.

> then i'll have many documents for 1 hotel
> and in order to search by date range i would need more documents
> like this :
> 28/01/2008 to 31/01/2008
> 29/01/2008 to 31/01/2008
> 30/01/2008 to 31/01/2008
> 
> Since i need to index many other informations about an hotel (address,
> telephone, amenities etc...) i wouldn' like to duplicate too much
> information, and i think it would not be scalable to search first in a dates
> index then in hotels index to retrieve hotel information.
> 
> Any idea?

It strikes me you'd probably want a relational DB for this kind
of thing

B
_
{Beto|Norberto|Numard} Meijome

Unix is user friendly. However, it isn't idiot friendly.

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.

Re: Solr schema filters

2008-01-16 Thread Chris Hostetter

: For this exact example, use the WordDelimiterFilter exactly as
: configured in the "text" fieldType in the example schema that ships
: with solr.  The trick is to then use some slop when querying.
: 
: FT-50-43 will be indexed as FT, 50, 43 / 5043  (the last two tokens
: are in the same position).
: Now when querying, "FT-5043" won't match without slop because there is
: a "50" token in the middle of the indexed terms... so try "FT-5043"~1

FYI: this was the motivation for the "qs" param on dismax ... 

http://localhost:8983/solr/select?debugQuery=true&qt=dismax&pf=&qf=text&q=FT-5043&qs=3


-Hoss

Re: DisMax Syntax

2008-01-16 Thread Chris Hostetter


: I may be mistaken, but this is not equivalent to my query.In my query i have
: matches for x1, matches for x2 without slope and/or boosting and then match
: to "x1 x2" (exact match) with slope (~) a and boost (b) in order to have
: results with exact match score better.
: The total score is the sum of all the above.
: Your query seems diff

the structure of the query will look different in debugging, and the 
scores won't be exactly the same, but the concept is the same.



-Hoss

Re: Fuzziness with DisMaxRequestHandler

2008-01-16 Thread Chris Hostetter


: Is there any way to make the DisMaxRequestHandler a bit more forgiving with
: user queries, I'm only getting results when the user enters a close to
: perfect match. I'd like to allow near matches if possible, but I'm not sure
: how to add something like this when special query syntax isn't allowed.

the principle goal of dismax was to leave "query string" syntax as simple 
as possible, and move the mechanisms for controlling the "query structure" 
into other paramaters.

the idea of making Queries Fuzzy is an interesting one ... it's something 
i don't remember anyone ever asking about before, and i'd never really 
considered it (from a UI perspective i find "did you mean" style 
spellchecking to be a better approach then making a user's query 
implicitly fuzzy) but it seems like it would be pretty easy to add support 
for something ...  one approach qould be to add a numeric "fuzz" 
parameter, that if set would make the DisMaxQUeryParser return 
FuzzyQueries in place of TermQueries ... an alternate appraoch would be to 
allow per field fuzziness by tweaking the "qf" syntax so instead of just 
fieldA^4 where 4 is the boost value, you could have fieldA^4~0.8 where 4 
is the boost value and 0.8 is the fuzziness factor

I haven't thought about it hard enough to have an opinion about which 
would make more sense ... but the overall idea certainly seems like it 
could be a usefull feature if osmeone wants to submit a patch.




-Hoss

Re: Newbie question: facets and filter query?

2008-01-16 Thread Chris Hostetter


: The problem is that when I use the 'cd' request handler, the facet count for
: 'dvd' provided in the response is 0 because of the filter query used to only
: show the 'cd' facet results. How do I retrieve facet counts for both
: categories while only retrieving the results for one category?

the "simple facet" params provided by solr have no way to do this ... the 
facet counts are allways relative the total available docs after applying 
the "q" and "fq" params .. if it wasn't then you wouldn't be able to get 
facet counts while "drilling down" into specific facets.

you could however easily implement something like this in a custom request 
handler by using the internal SimpleFacets API and giving it a DocSet you 
generate jsut fom the q param.





-Hoss

Re: Transactions and Solr Was: Re: Delte by multiple id problem

2008-01-16 Thread Chris Hostetter


: Does anyone have more experience doing this kind of stuff and whants to share?

My advice: don't.

I work with (or work with people who work with) about two dozen Solr 
indexes -- we don't attempt to update a single one of them in any sort of 
transactional way.  Some of them are updated "real time" (ie: as soon as 
the authoritative DB is updated by some code, the same code updates the 
Solr index; Some of them are updated in batch (ie: once every N minutes 
code checks a log of all logical objects modified/deleted from the DB and 
sends the adds/delets to Solr; And some are only ever rebuilt from scrath 
every N hours (because the data in them isn't very time sensative and 
rebuilding from scratch is easier then dealing with incremental or batch 
updates.

But as i said: we never attempt to be transactional about it, for a few 
reasons:
  1) why should it be part of the transaction?  a Solr index is a 
denormalized/inverted index of data .. why should a tool (or any other 
process) be prevented from writting to an authoritative data store just 
becuase a non authoritative copy of that data can't be updated?  ... if 
you used MySQL with replication, would you really want to block all writes 
to the master just because there's a glitch in replicating to a slave?
  2) why worry about it?  It's relaly a non issue.  If an add or 
delete fails it's usually either developer error (ie: the code 
generating your add statements thinks there's a field that doesn't 
exist), a transient timeout (maybe because of a commit in progress) or 
network glitch (have the client retry once or twice), or in very rare 
instances the whole Solr index was completely jacked (either from disk 
failure, or OOM due to a huge spike in load) and we want to revert 
to a backup of the index in the shortterm and rebuild the index from 
scratch to play it safe.
  3) why limit yourself?  you're going to want the ability to trigger 
arbitrary indexing of your data objects at anytime -- if for no other 
reason then so when you decide to add a field to your index you can 
reindex them all -- so why make your index updating code inherently tied 
to your DB updating code?


As for your specific question along the lines of "why can't we do a 
mix of s and s all as part of one update message?" the answer 
is "because no one ever wrote any code to parse messages like that."  BUT! 
... that's not the question you really want to ask.  the question you 
relaly want to ask is: "*IF* someone wrote code to allow a mix of s 
and s all as part of one update message, would it solve my problem 
of wanting to be able to modify my solr index transactionally?" and the 
answer is "No."  Even if Solr accepted update messages that looked 
like this...


   42
   7bb
   666


...the low level lucene calls that it would be doing internall still 
aren't transactional, so the first "delete" and "add" might succeed, but 
if there was then some kind of internal error, or a timeout because the 
first add took a while (maybe it triggered a segment merge) and the second 
add didn't happen -- the first two commands would have still been 
executed, and there would be no way to "rollback".

In a nutshell: you would be no better off then if your client code has 
sent all three as seperate update messages.


-Hoss

Re: Restrict values in a multivalued field

2008-01-16 Thread Chris Hostetter


: In my schema I have a multivalued field, and the values of that field are
: "stored" and "indexed" in the index. I wanted to know if its possible to
: restrict the number of multiple values being returned from that field, on a
: search? And how? Because, lets say, if I have thousands of values in that
: multivalued field, returning all of them would be a lot of load on the
: system. So, I want to restrict it to send me only say, 50 values out of the
: thousands.

How would Solr pick which 50 to return?
Why not index all thousand (so you can search on them) in an unstored 
field, and only store the 50 you want returned in a seperate (unindexed 
field).  the index size will be exactly the same -- admittedly you'll have 
to send a bit more data over the wire for each doc you index, but that's 
probably a trivial amount (assuming the 50 values you want to store are 
representative of the thousands you index you are talking about at most 
a 5% increases in the amount of data you send solr on each add)





-Hoss

Re: Wildcard on last char

2008-01-16 Thread Chris Hostetter


: i have encountered a problem concerning the wildcard. When i search for
: field:testword  i get 50 results. That's ok but when I search for
: field:testwor* i get just 3 hits! I get only words returned without a
: whitespace after the char like "testwordtest" but i wont find any single
: "testword" i've searched before. Why can't I replace a single char at the end
: of a word with the wildcard?

What does the fieldtime for "field" look like in your schema?  what 
exactly is the URL you are using to query solr?

i notice in particular you say "I get only words returned without a
: whitespace after the char" ... which suggests that you expect it to 
match documents that have whitespace in the field value (ie: "testword 
more words" ... but if you are using an analyzer that splits on 
whitespace, those are now seperate terms -- wildcard and prefix searches 
match on indexed terms -- not the entire field value (if you want that, 
you need to use something like StrField or the KeywordTokenizer .. but i 
doubt that's really what you want)



-Hoss

Re: Fwd: Solr "Text" field

2008-01-16 Thread Chris Hostetter

: searches. That is fine by me. But I'm still at the first question:
: How do I conduct a wildcard search for ARIZONA on a solr.textField?  I tried

as i said:  it really depends on what kind of "index" analyzer you have 
configured for the field -- the query analyzer isn't used at all when 
dealing with ildcard and prefix queries, so what you type in before the 
"*" must match the prefix of an actually indexed term that makes it into 
your index as a result of the index analyzer.

If you add the debugQuery=true param to your queries, and compare the 
differnces you see in the parsedquery_toString value between searching for 
field:AR* and field:Arizona and field:ARIZONA you'll see what i mean.

if you take a look at the Luke request handler which will show you the 
actual raw terms in your index (or the top N anyway), you can see what's 
really in there -- or --if you use the analysis.jsp interface, it will 
show you 
what Terms your analyzer will actaully produce if you index the raw string 
"ARIZONA" ... whatever you see there is what you need to be searching for 
when you do your prefix queries.


-Hoss

Re: 2D Facet

2008-01-16 Thread Chris Hostetter

: 
: Hello, is this possible to do in one query: I have a query which returns 
: 1000 documents with names and addresses. I can run facet on state field 
: and see how many addresses I have in each state. But also I need to see 
: how many families lives in each state. So as a result I need a matrix of 
: states on top and Last Names on right. After my first query, knowing 
: which states I have I can run queries on each state using facet field 
: Last_Name. But I guess this is not an efficient way. Is this possible to 
: get in one query? Or may be some other way?

if you set rows=0 on all of those queries it won't be horribly inefficient 
... the DocSets for each state and lastname should wind up in the 
filterCache, so most of the queries will just be simple DocSet 
intersections with only the HTTP overhead (which if you use persistent 
connections should be fairly minor)

The idea of generic multidimensional faceting is acctaully pretty 
interesting ... it could be done fairly simply -- imagine if for every 
facet.field=foo param, solr checked for a f.foo.facet.matrix params, and 
once the top facet.limit terms were found for field "foo" it then 
computed the top facet founds for each f.foo.facet.matrix field 
with an implicit fq=foo:term.

that would be pretty cool.


-Hoss

Re: batch indexing takes more time than shown on SOLR output --> something to do with IO?

2008-01-16 Thread Chris Hostetter

: INFO: {add=[10485, 10488, 10489, 10490, 10491, 10495, 10497, 10498, ...(42
: more)
: ]} 0 875
: 
: However, when timing this instruction on the client-side (I use SOlrJ -->
: req.process(server)) I get totally different numbers (in the beginning the
: client-side measured time is about 2 seconds on average but after some time
: this time goes up to about 30-40 seconds, altough the solr-outputted time
: stays between 0.8-1.3 seconds? 

as Otis mentioned, that time is the raw processing of the request, not 
counting any network IO between the client and the server, or any time 
spent by the "ResponseWriter" formating the response.  you can get more 
accurate numbers about exctly how long the server spent doing all of these 
things from the access log of your servlet container (which should be 
recording the time only after every last byte is written back to the 
client.

that said: there's really no reason for as big a descrepency as you are 
describing particularly on updates where the ResposneWriter has almost 
nothing to do (30-40 seconds per update?!?!?!)

I'm not very familiar with SolrJ, but are you by any chance using it in a 
way that sends a commit after every update command?  (commits can get 
successifly longer as your index gets bigger.)

: Does this have anything to do with costly IO-activity that is accounted for
: in the SOLR output? If this is true, what tool do you recommend using to
: monitor IO-activity?

Which IO-activity are you talking about?




-Hoss

Re: Cache size and Heap size

2008-01-16 Thread Chris Hostetter

: > I know this is a lot and I'm going to decrease it, I was just experimenting,
: > but I need some guidelines of how to calculate the right size of the cache.
: 
: Each filter that matches more than ~3000 documents will occupy maxDocs/8 bytes
: of memory.  Certain kinds of faceting require one entry per unique value in a

FWIW: the magic number 3000 is the example value of the  config option ... it can be tweaked if you think you can 
tune it to a value that makes more sense given the nature of the types of 
DocSets you are dealing with, but i wouldn't bother (there are probably a 
lot of better ways you can spend your time to tweak performance)


-Hoss

Re: FunctionQuery in a custom request handler

2008-01-16 Thread Chris Hostetter


: How do I access the ValueSource for my DateField? I'd like to use a
: ReciprocalFloatFunction from inside the code, adding it aside others in the
: main BooleanQuery.

The FieldType API provides a getValueSource method (so every FieldType 
picks it's own best ValueSource implementaion).


-Hoss

Some sort of join in SOLR?

2008-01-16 Thread Michael Lackhoff


Hello,

I have two sources of data for the same "things" to search. It is book 
data in a library. First there is the usual bibliographic data (author, 
title...) and then I have scanned and OCRed table of contents data about 
the same books. Both are updated independently.

Now I don't know how to best index and search this data.
- One option would be to save the data in different records. That would
  make updates easy because I don't have to worry about the fields
  from the other source. But searching would be more difficult: I have
  to do an additional search for every hit in the "contents" data to
  get the bibliographic data.
- The other option would be to save everything in one record but then
  updates would be difficult. Before I can update a record I must first
  look if there is any data from the other source, merge it into the
  record and only then update it. This option sounds very time consuming
  for a complete reindex.

The best solution would be some sort of join: Have two records in the 
index but always give both in the result no matter where the hit was.

Any ideas on how to best organize this kind of data?

-Michael

56 matches

Mail list logo