Just guessing, but I'd say it has something to do with the dynamic fields...

I ran a similar operation (docs ranged from 1K to 2MB).  For the
initial indexing, I wrote a job to submit about 100,000 documents to
solr, committing after every 10 docs.  I never sent any optimize
commands.  I also used the example start.jar and didn't specify any
memory constraints.

My job ran for 3 days, and finished without any errors or memory problems.

The only difference I see is that I didn't use any dynamic fields, and
I only stored 2 fields instead of them all.

Just my $0.02
-Reece



On Mon, Mar 3, 2008 at 6:15 PM, Thorsten Scherler <[EMAIL PROTECTED]> wrote:
> On Mon, 2008-03-03 at 21:43 +0200, Justin wrote:
>  > I'm indexing a large number of documents.
>  >
>  > As a server I'm using the /solr/example/start.jar
>  >
>  > No matter how much memory I allocate it fails around 7200 documents.
>
>  How do you allocate the memory?
>
>  Something like:
>  java -Xms512M -Xmx1500M -jar start.jar
>
>  You may have a closer look as well at
>  http://java.sun.com/j2se/1.5.0/docs/guide/vm/gc-ergonomics.html
>
>  HTH
>
>  salu2
>
>
>
>  > I am committing every 100 docs, and optimizing every 300.
>  >
>  > all of my xml's contain on doc, and can range in size from 2k to 700k.
>  >
>  > when I restart the start.jar it again reports out of memory.
>  >
>  >
>  > a sample document looks like this:
>  > <?xml version="1.0" encoding="UTF-8"?>
>  > <add>
>  >  <doc>
>  >   <field name="PK">1851</field>
>  >   <field name="ft:genes.Symbol:1851">TRAJ20</field>
>  >   <field name="ft:external_ids.SourceAccession:15531">12049</field>
>  >   <field
>  > name="ft:external_ids.SourceAccession:15532">ENSG00000211869</field>
>  >   <field name="ft:external_ids.SourceAccession:15533">28735</field>
>  >   <field name="ft:external_ids.SourceAccession:15534">HUgn28735</field>
>  >   <field name="ft:external_ids.SourceAccession:15535">TRA_</field>
>  >   <field name="ft:external_ids.SourceAccession:15536">TRAJ20</field>
>  >   <field name="ft:external_ids.SourceAccession:15537">9953837</field>
>  >   <field
>  > name="ft:external_ids.SourceAccession:15538">ENSG00000211869</field>
>  >   <field name="ft:aliases_and_descriptions.Value:9775">T cell receptor 
> alpha
>  > joining 20</field>
>  >   <field name="ft:cytogenetic_locations.Cytoband:4909">14q11.2</field>
>  >   <field name="ft:cytogenetic_locations.Cytoband:4910">14q11</field>
>  >   <field name="ft:cytogenetic_locations.Cytoband:4911">14q11.2</field>
>  >   <field name="ft:location_extras.ContigRefseq:11806">AE000662.1</field>
>  >   <field name="ft:location_extras.ContigRefseq:11807">M94081.1</field>
>  >   <field name="ft:location_extras.ContigRefseq:11808">CH471078.2</field>
>  >   <field name="ft:location_extras.ContigRefseq:11809">NC_000014.7</field>
>  >   <field name="ft:location_extras.ContigRefseq:11810">NT_026437.11</field>
>  >   <field name="ft:location_extras.ContigRefseq:11811">NG_001332.2</field>
>  >   <field name="ft:articles.SourceAccession:192767">8188290</field>
>  >   <field name="ft:articles.Title:192767">The human T-cell receptor
>  > TCRAC/TCRDC (C alpha/C delta) region: organization,sequence, and evolution
>  > of 97.6 kb of DNA.</field>
>  >   <field name="ft:authors.AuthorName:5909">Koop B.F.</field>
>  >   <field name="ft:authors.AuthorName:6912">Rowen L.</field>
>  >   <field name="ft:authors.AuthorName:6985">Hood L.</field>
>  >   <field name="ft:authors.AuthorName:17109">Wang K.</field>
>  >   <field name="ft:authors.AuthorName:72700">Kuo C.L.</field>
>  >   <field name="ft:authors.AuthorName:84285">Seto D.</field>
>  >   <field name="ft:authors.AuthorName:166156">Lenstra J.A.</field>
>  >   <field name="ft:authors.AuthorName:216734">Howard S.</field>
>  >   <field name="ft:authors.AuthorName:285493">Shan W.</field>
>  >   <field name="ft:authors.AuthorName:346559">Deshpande P.</field>
>  >   <field name="ft:probesets.Name:6773">31311_at</field>
>  >   <field name="ft:probesets.BinaryPattern:6773">000000000000</field>
>  > </doc>
>  > </add>
>  >
>  >
>  > the schema is (in summary):
>  > <fields>
>  >    <field name="PK" type="sint" indexed="true" stored="true" 
> required="true"
>  > multiValued="false" omitNorms="true"/>
>  >    <field name="text" type="text" indexed="true" stored="false"
>  > multiValued="true"  omitNorms="true"/>
>  >
>  >    <dynamicField name="ft:*"  type="string"    indexed="true"
>  > stored="true"  omitNorms="true"/>
>  >    <dynamicField name="st:*"  type="string"  indexed="true"  stored="true"
>  > omitNorms="true"/>
>  > </fields>
>  >
>  >
>  > <uniqueKey>PK</uniqueKey>
>  > <defaultSearchField>text</defaultSearchField>
>  > <solrQueryParser defaultOperator="OR"/>
>  >
>  > <copyField source="ft:*" dest="text"/>
>  > <copyField source="st:*" dest="text"/>
>  >
>  >
>  > and my conf is:
>  >    <useCompoundFile>false</useCompoundFile>
>  >     <mergeFactor>100</mergeFactor>
>  >     <maxBufferedDocs>900</maxBufferedDocs>
>  >     <maxMergeDocs>2147483647</maxMergeDocs>
>  >     <maxFieldLength>10000</maxFieldLength>
>  --
>  Thorsten Scherler                                 thorsten.at.apache.org
>  Open Source Java                      consulting, training and solutions
>
>

Reply via email to