We had the same though about the toString being the symptom. We have performed heap analysis on four separate heap 4GB dumps and they all point to an issue with a single SolrDocument containing all data returned by every sub entity query. Stepping through the code on a live instance we can see the cache being "disabled" by the destroy calls after each root doc. This destruction causes EntityProcessorBase to change from "lookup the records by cacheKey" to "lookup all records".
-----Original Message----- From: Dyer, James [mailto:james.d...@ingramcontent.com] Sent: Tuesday, May 14, 2013 4:08 PM To: solr-user@lucene.apache.org Subject: RE: delta-import and cache (a story in conflict) The reason it is writing all the imput fields for that document is this particular error message appends "doc" to the end, which is a subclass of SolrInputDocument, which has a "toString" that shows all the fields. Not sure if this in particular changed, but I suspect this is a symptom not a cause. DIH Caching was changed substantially in Solr3.6. Several bugs in 3.6 were seemingly resolved by 4.0. There might still be bugs, but I would suspect your problem with CachedSqlEntityProcessor and OOM has to do with the amount of data you are caching in-memory. In-Memory caches are the default but obviously cannot scale indefinitely. There are now options to use disk-backed caches although there are none currently included in the distribution. (See https://issues.apache.org/jira/browse/SOLR-2948 for an option that just generates a sidecar lucene index for a cache). To use a cache, specify SqlEntityProcessor (or any other entity processor) and specify "cacheImpl="f.q.class.name". Caches must implemnt DIHCache. A heap dump would help you find out if your OOM is simply due to too much data. Otherwise, you could just do a small import with a tiny subset of the data and see if it works. On the other hand, your config has doubly-nested cached entities, something I do not thing the unit tests are checking, so possibly you found another bug here. There are bugs being reported about variable resolving (some cases ${...} doesn't resolve correctly). These are likely caused by SOLR-4086 (version 4.1). If you give details about what is not resolving correctly, someone might be able to propose a workaround. James Dyer Ingram Content Group (615) 213-4311 -----Original Message----- From: Keith Naas [mailto:keithn...@dswinc.com] Sent: Tuesday, May 14, 2013 2:24 PM To: 'solr-user@lucene.apache.org' Subject: delta-import and cache (a story in conflict) Thanks for all the great work on Solr. We have used it for over a year and have been very satisfied with it. However ,we have noticed that some of the recent changes have affected import caching in a not so good way. We are using Solr 4.2.0. We use full and delta imports. We only use a delta import query on the root entity (our object model does not safely support updates to the nested entities). Here is a snippet of the xml. <entity name="product" pk="ID" query="..." deltaImportQuery="..." deltaQuery="..." deletedPkQuery="..." > <field column="ID" name="id" /> <field column="NAME" name="name" /> ... <entity name="productSize" query="..." processor="CachedSqlEntityProcessor" cacheKey="PRODUCT_ID" cacheLookup="product.ID"> <entity name="productSizeAttributes" query="..." processor="CachedSqlEntityProcessor" cacheKey="SIZE_ID" cacheLookup="productSize.SIZE_ID" transformer="LogTransformer" logLevel="info" logTemplate="The size for product ${product.ID} is ${productSizeAttributes}"> <field column="SIZE_ID" name="size" /> <field column="SIZE_NAME" name="sizeName" /> <field column="SIZE_CODE" name="sizeCode"/> </entity> </entity> </entity> We have noticed that delta imports that used to take 30 seconds now run indefinitely and eventually cause an OutOfMemory condition on a huge multi GB Heap. Here is the Stack Trace. java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2882) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390) at java.lang.StringBuilder.append(StringBuilder.java:119) at java.lang.StringBuilder.append(StringBuilder.java:115) at java.util.AbstractCollection.toString(AbstractCollection.java:422) at java.lang.String.valueOf(String.java:2826) at java.lang.StringBuilder.append(StringBuilder.java:115) at org.apache.solr.common.SolrInputField.toString(SolrInputField.java:215) at java.lang.String.valueOf(String.java:2826) at java.lang.StringBuilder.append(StringBuilder.java:115) at java.util.AbstractCollection.toString(AbstractCollection.java:422) at java.lang.String.valueOf(String.java:2826) at java.lang.StringBuilder.append(StringBuilder.java:115) at org.apache.solr.common.SolrInputDocument.toString(SolrInputDocument.java:192) at java.lang.String.valueOf(String.java:2826) at java.lang.StringBuilder.append(StringBuilder.java:115) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:524) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404) at org.apache.solr.handler.dataimport.DocBuilder.doDelta(DocBuilder.java:353) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:219) at org.apache.solr.handler.dataimport.DataImporter.doDeltaImport(DataImporter.java:451) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:489) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:468) DocBuilder.buildDocument line 354 in Solr 4.2.0: SolrException.log(LOG, "Exception while processing: " + epw.getEntity().getName() + " document : " + doc, e); The doc.toString is appending all SolrInputFields to the string. Why are the SolrInputFields so big? It is hard to say because the original exception is not logged. After debugging for a few days it appears that during a delta-import the cache is destroyed prematurely. DocBuilder.buildDocument is called for each row returned by the deltaQuery. In the finally block of buildDocument it calls destroy on all EntityProcessorWrapper's. This eventually calls destroy on EntityProcessorBase which after destroying the cacheSupport, sets cacheSupport to null. For all other buildDocument calls, EntityProcessorBase.init() is eventually executed. This looks at the isFirstInit flag (which is false) and skips re-initializing the cache (which likely should never have been destroyed except on the last row returned by the deltaQuery). Finally when the rows for the nested entities are fetched, it skips the cache behavior, re-executes the SQL and loads every single row form the nested entities as new fields in each document. Thus if a query returned 100000 productSize records every product after the first would end up with all 100000 productSizes attached to it. This behavior makes delta-imports unusable when caching is utilized in any release after this functionality was changed. We have also noticed that caching does not seem to be honored when the SQL statement contains resolvable tokens ${}. However, we can workaround those 2 queries by disabling caching. I cannot disable caching on the other 20 queries. Imports would take hours. Has anyone else seen this? Keith Naas