I think the problem is that DIH catches Exception but not Error so a StackOverFlowError will slip past it. Normally, the SolrDispatchFilter will log such errors but the import is performed in a new thread, so the error is not logged anywhere. However, DIH will not commit documents in this case (and there is no mention of a commit in your DIH status).
We should change the catch clause to catch Throwable so that this is not repeated. I'll open an issue and give a patch. Btw, Ahmed, Solr has a Tokenizer which is much better at striping html -- HTMLStripWhitespaceTokenizerFactory which you can use for such tasks. On Sun, Nov 16, 2008 at 12:30 AM, Ahmed Hammad <[EMAIL PROTECTED]> wrote: > I had a similar problem like Giri. I have 17,000 record in one table and > DIH > can import only 12464. > > After some investigation, I found my problem. > > I have a regular expression to strip off html tags form input text, as > following: > > <field sourceColName="content" column="content" regex="<(.|\n)*?>" > replaceWith=" "/> > > The DIH RegEx have stack overflow on the record 17,000 due to error in the > content and then DIH exit without any error in the log on in the status > command. Here is the status: > > <lst name="statusMessages"> > <str name="Time Elapsed">0:0:31.657</str> > <str name="Total Requests made to DataSource">1</str> > <str name="Total Rows Fetched">12464</str> > <str name="Total Documents Processed">12464</str> > <str name="Total Documents Skipped">0</str> > <str name="Full Dump Started">2008-11-15 20:40:58</str> > </lst> > > I found the error in Eclipse Console window while debugging; it was a stack > overflow in the RegEx library. > > The problem is that, DIH does not show any problem in log file on in status > message. > What I think is important is to show whatever error happen in the log file. > > I noticed also that, in case of no error a log message show completness: > > Nov 15, 2008 8:57:34 PM org.apache.solr.handler.dataimport.DocBuilder > execute > INFO: Time taken = 0:0:40.656 > > In case of RegEx stack overflow error, this log message does not appear. > > I am researching on how to catch such error in DIH. Any ideas? > > > Regards, > ahmd > > On Sat, Nov 15, 2008 at 6:32 AM, Noble Paul നോബിള് नोब्ळ् < > [EMAIL PROTECTED]> wrote: > > > There is no obvious problem > > > > I can be reasonably sure that > > the query > > > > select * from climatedata.ws_record limit 1000000 > > > > would have fetched only 615360 rows. > > This is a very reliable pice of information > > <str name="Total Rows Fetched">615360</str> > > > > On Sat, Nov 15, 2008 at 12:41 AM, Giri <[EMAIL PROTECTED]> wrote: > > > Hi Noble, > > > thanks for the help, here are the details: the field "id" is unique, > when > > I > > > did a select distinct(id), it returned 1 million rows. > > > > > > ------------------------------------------------------------------- > > > db-data-config.xml > > > note: I limit the resultset to 1 million in the select query > > > ------------------------------------------------------------------- > > > <dataConfig> > > > <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" > > > url="jdbc:mysql://localhost:3306/climatedata" user="user" password="pw" > > > batchSize ="-1"/> > > > <document name="climateRecord"> > > > <entity name="observation" query="select * from > > > climatedata.ws_record limit 1000000"> > > > <field column="id" name="id" /> > > > <field column="inst_code" name="inst_code" /> > > > <field column="inst_name" name="inst_name" /> > > > <field column="meas_name" name="meas_name" /> > > > <field column="latitude" name="latitude" /> > > > <field column="longitude" name="longitude" /> > > > <field column="ob_id" name="ob_id" /> > > > <field column="in_id" name="in_id" /> > > > <field column="ob_name" name="ob_name" /> > > > </entity> > > > </document> > > > </dataConfig> > > > > > > ----------------------------------------------------------------- > > > in the solr Schema.xml: > > > ---------------------------------------------------------------- > > > <fields> > > > <field name="id" type="string" indexed="true" stored="true" > > > multiValued="false"/> > > > <field name="inst_code" type="text" indexed="true" stored="true" > > > multiValued="true" required="false"/> > > > <field name="inst_name" type="text" indexed="true" stored="true" > > > multiValued="true" required="false"/> > > > <field name="meas_name" type="text" indexed="true" stored="true" > > > multiValued="true" required="false"/> > > > <field name="latitude" type="sfloat" class="solr.FloatField" > > > indexed="true" stored="true" required="false"/> > > > <field name="longitude" type="sfloat" class="solr.FloatField" > > > indexed="true" stored="true" required="false"/> > > > <field name="ob_id" type="string" indexed="true" stored="true" > > > multiValued="true"/> > > > <field name="in_id" type="string" indexed="true" stored="true" > > > multiValued="true"/> > > > <field name="ob_name" type="text" indexed="true" stored="true" > > > multiValued="true"/> > > > > > > <!-- catchall field, containing all other searchable text fields > > > (implemented > > > via copyField further on in this schema --> > > > <field name="text" type="text" indexed="true" stored="false" > > > multiValued="true" required="false"/> > > > > > > <!-- non-tokenized version of manufacturer to make it easier to sort > or > > > group > > > results by manufacturer. copied from "manu" via copyField --> > > > <field name="manu_exact" type="string" indexed="true" stored="false" > > > required="false"/> > > > > > > > > > <!-- Dynamic field definitions. If a field name is not found, > > > dynamicFields > > > will be used if the name matches any of the patterns. > > > RESTRICTION: the glob-like pattern in the name attribute must > have > > > a "*" only at the start or the end. > > > EXAMPLE: name="*_i" will match any field ending in _i (like > > myid_i, > > > z_i) > > > Longer patterns will be matched first. if equal size patterns > > > both match, the first appearing in the schema will be used. --> > > > <dynamicField name="*_i" type="sint" indexed="true" > > stored="true"/> > > > <dynamicField name="*_s" type="string" indexed="true" > > stored="true"/> > > > <dynamicField name="*_l" type="slong" indexed="true" > > stored="true"/> > > > <dynamicField name="*_t" type="text" indexed="true" > > stored="true"/> > > > <dynamicField name="*_b" type="boolean" indexed="true" > > stored="true"/> > > > <dynamicField name="*_f" type="sfloat" indexed="true" > > stored="true"/> > > > <dynamicField name="*_d" type="sdouble" indexed="true" > > stored="true"/> > > > <dynamicField name="*_dt" type="date" indexed="true" > > stored="true"/> > > > </fields> > > > > > > ---------------------------------------------------- > > > I run the index via firefox browser using > > > http://localhost:8080/solr/dataimport?command=full-import > > > I checked the status using > > > http://localhost:8080/solr/dataimport?command=status > > > initially the status increased steadily, but after reaching 613071, the > > > status stayed for a while (as below), and then it displayed the > completed > > > message : > > > ---------------------------------------------------- > > > <response> > > > - > > > <lst name="responseHeader"> > > > <int name="status">0</int> > > > <int name="QTime">1</int> > > > </lst> > > > - > > > <lst name="initArgs"> > > > - > > > <lst name="defaults"> > > > <str name="config">db-data-config.xml</str> > > > </lst> > > > </lst> > > > <str name="command">status</str> > > > <str name="status">busy</str> > > > <str name="importResponse">A command is still running...</str> > > > - > > > <lst name="statusMessages"> > > > <str name="Time Elapsed">0:3:24.266</str> > > > <str name="Total Requests made to DataSource">1</str> > > > <str name="Total Rows Fetched">613071</str> > > > <str name="Total Documents Processed">613070</str> > > > <str name="Total Documents Skipped">0</str> > > > <str name="Full Dump Started">2008-11-14 12:12:16</str> > > > </lst> > > > - > > > <str name="WARNING"> > > > This response format is experimental. It is likely to change in the > > future. > > > </str> > > > </response> > > > > > > ----------------------------------------------------------- > > > > > >>>NOTE: this is the status result after it completed > > > ----------------------------------------------------------- > > > > > > <response> > > > - > > > <lst name="responseHeader"> > > > <int name="status">0</int> > > > <int name="QTime">1</int> > > > </lst> > > > - > > > <lst name="initArgs"> > > > - > > > <lst name="defaults"> > > > <str name="config">db-data-config.xml</str> > > > </lst> > > > </lst> > > > <str name="command">status</str> > > > <str name="status">idle</str> > > > <str name="importResponse"/> > > > - > > > <lst name="statusMessages"> > > > <str name="Total Requests made to DataSource">1</str> > > > <str name="Total Rows Fetched">615360</str> > > > <str name="Total Documents Skipped">0</str> > > > <str name="Full Dump Started">2008-11-14 12:12:16</str> > > > - > > > <str name=""> > > > Indexing completed. Added/Updated: 615360 documents. Deleted 0 > documents. > > > </str> > > > <str name="Committed">2008-11-14 12:16:32</str> > > > <str name="Optimized">2008-11-14 12:16:32</str> > > > <str name="Time taken ">0:4:16.154</str> > > > </lst> > > > - > > > <str name="WARNING"> > > > This response format is experimental. It is likely to change in the > > future. > > > </str> > > > </response> > > > > > > ----------------------------------------------------- > > > > > > here is the full solr scehma.xml content: > > > ---------------------------------------------------- > > > <?xml version="1.0" ?> > > > <!-- The Solr schema file. This file should be named "schema.xml" and > > > should be in the conf directory under the solr home > > > (i.e. ./solr/conf/schema.xml by default) > > > or located where the classloader for the Solr webapp can find it. > > > > > > For more information, on how to customize this file, please see... > > > http://wiki.apache.org/solr/SchemaXml > > > --> > > > > > > <schema name="example" version="1.1"> > > > <types> > > > <!-- field type definitions. The "name" attribute is > > > just a label to be used by field definitions. The "class" > > > attribute and any other attributes determine the real > > > behavior of the fieldtype. --> > > > > > > <!-- The StringField type is not analyzed, but indexed/stored > verbatim > > > --> > > > <fieldtype name="string" class="solr.StrField" > > sortMissingLast="true"/> > > > > > > <!-- boolean type: "true" or "false" --> > > > <fieldtype name="boolean" class="solr.BoolField" > > > sortMissingLast="true"/> > > > > > > <!-- The optional sortMissingLast and sortMissingFirst attributes > are > > > currently supported on types that are sorted internally as a > > > strings. > > > - If sortMissingLast="true" then a sort on this field will cause > > > documents > > > without the field to come after documents with the field, > > > regardless of the requested sort order (asc or desc). > > > - If sortMissingFirst="true" then a sort on this field will cause > > > documents > > > without the field to come before documents with the field, > > > regardless of the requested sort order. > > > - If sortMissingLast="false" and sortMissingFirst="false" (the > > > default), > > > then default lucene sorting will be used which places docs > without > > > the field > > > first in an ascending sort and last in a descending sort. > > > --> > > > > > > <!-- numeric field types that store and index the text > > > value verbatim (and hence don't support range queries since the > > > lexicographic ordering isn't equal to the numeric ordering) --> > > > <fieldtype name="integer" class="solr.IntField"/> > > > <fieldtype name="long" class="solr.LongField"/> > > > <fieldtype name="float" class="solr.FloatField"/> > > > <fieldtype name="double" class="solr.DoubleField"/> > > > > > > > > > <!-- Numeric field types that manipulate the value into > > > a string value that isn't human readable in it's internal form, > > > but with a lexicographic ordering the same as the numeric > > ordering > > > so that range queries correctly work. --> > > > <fieldtype name="sint" class="solr.SortableIntField" > > > sortMissingLast="true"/> > > > <fieldtype name="slong" class="solr.SortableLongField" > > > sortMissingLast="true"/> > > > <fieldtype name="sfloat" class="solr.SortableFloatField" > > > sortMissingLast="true"/> > > > <fieldtype name="sdouble" class="solr.SortableDoubleField" > > > sortMissingLast="true"/> > > > > > > > > > <!-- The format for this date field is of the form > > 1995-12-31T23:59:59Z, > > > and > > > is a more restricted form of the canonical representation of > > > dateTime > > > http://www.w3.org/TR/xmlschema-2/#dateTime > > > The trailing "Z" designates UTC time and is mandatory. > > > Optional fractional seconds are allowed: > 1995-12-31T23:59:59.999Z > > > All other components are mandatory. --> > > > <fieldtype name="date" class="solr.DateField" > sortMissingLast="true"/> > > > > > > <!-- solr.TextField allows the specification of custom text > analyzers > > > specified as a tokenizer and a list of token filters. Different > > > analyzers may be specified for indexing and querying. > > > > > > The optional positionIncrementGap puts space between multiple > > > fields of > > > this type on the same document, with the purpose of preventing > > > false phrase > > > matching across fields. > > > > > > For more info on customizing your analyzer chain, please see... > > > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters > > > > > > --> > > > > > > <!-- Standard analyzer commonly used by Lucene developers > > > --> > > > <!-- Standard analyzer commonly used by Lucene developers --> > > > <fieldtype name="text_lu" class="solr.TextField" > > > positionIncrementGap="100"> > > > <analyzer> > > > <tokenizer class="solr.StandardTokenizerFactory"/> > > > <filter class="solr.StandardFilterFactory"/> > > > <filter class="solr.LowerCaseFilterFactory"/> > > > <filter class="solr.StopFilterFactory"/> > > > <filter class="solr.EnglishPorterFilterFactory"/> > > > </analyzer> > > > </fieldtype> > > > <!-- One could also specify an existing Analyzer implementation in > > Java > > > via the class attribute on the analyzer element: > > > <fieldtype name="text_lu" class="solr.TextField"> > > > <analyzer > > > class="org.apache.lucene.analysis.snowball.SnowballAnalyzer"/> > > > </fieldType> > > > --> > > > > > > <!-- A text field that only splits on whitespace for more exact > > matching > > > --> > > > <fieldtype name="text_ws" class="solr.TextField" > > > positionIncrementGap="100"> > > > <analyzer> > > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > > </analyzer> > > > </fieldtype> > > > > > > <!-- A text field that uses WordDelimiterFilter to enable splitting > > and > > > matching of > > > words on case-change, alpha numeric boundaries, and > > non-alphanumeric > > > chars > > > so that a query of "wifi" or "wi fi" could match a document > > > containing "Wi-Fi". > > > Synonyms and stopwords are customized by external files, and > > > stemming is enabled --> > > > <fieldtype name="text" class="solr.TextField" > > > positionIncrementGap="100"> > > > <analyzer type="index"> > > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > > <!-- in this example, we will only use synonyms at query time > > > <filter class="solr.SynonymFilterFactory" > > > synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> > > > --> > > > <!--<filter class="solr.WordDelimiterFilterFactory" > > > generateWordParts="1"/>--> > > > <filter class="solr.StopFilterFactory" ignoreCase="true"/> > > > <filter class="solr.LowerCaseFilterFactory"/> > > > </analyzer> > > > <analyzer type="query"> > > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > > <filter class="solr.StopFilterFactory" ignoreCase="true"/> > > > <filter class="solr.LowerCaseFilterFactory"/> > > > </analyzer> > > > </fieldtype> > > > > > > <!-- Less flexible matching, but less false matches. Probably not > > ideal > > > for product names > > > but may be good for SKUs. Can insert dashes in the wrong place > > and > > > still match. --> > > > <fieldtype name="textTight" class="solr.TextField" > > > positionIncrementGap="100" > > > > <analyzer> > > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > > <filter class="solr.SynonymFilterFactory" > synonyms="synonyms.txt" > > > ignoreCase="true" expand="false"/> > > > <filter class="solr.StopFilterFactory" ignoreCase="true"/> > > > <filter class="solr.WordDelimiterFilterFactory" > > > generateWordParts="0" generateNumberParts="0" catenateWords="1" > > > catenateNumbers="1" catenateAll="0"/> > > > <filter class="solr.LowerCaseFilterFactory"/> > > > <filter class="solr.EnglishPorterFilterFactory" > > > protected="protwords.txt"/> > > > </analyzer> > > > </fieldtype> > > > </types> > > > <fields> > > > <!-- Valid attributes for fields: > > > name: mandatory - the name for the field > > > type: mandatory - the name of a previously defined type from the > > > <types> section > > > indexed: true if this field should be indexed (searchable) > > > stored: true if this field should be retrievable > > > multiValued: true if this field may contain multiple values per > > > document > > > omitNorms: (expert) set to true to omit the norms associated with > > > this field > > > (this disables length normalization and index-time > > > boosting for the field) > > > --> > > > <field name="id" type="string" indexed="true" stored="true" > > > multiValued="false"/> > > > <field name="inst_code" type="text" indexed="true" stored="true" > > > multiValued="true" required="false"/> > > > <field name="inst_name" type="text" indexed="true" stored="true" > > > multiValued="true" required="false"/> > > > <field name="meas_name" type="text" indexed="true" stored="true" > > > multiValued="true" required="false"/> > > > <field name="latitude" type="sfloat" class="solr.FloatField" > > > indexed="true" stored="true" required="false"/> > > > <field name="longitude" type="sfloat" class="solr.FloatField" > > > indexed="true" stored="true" required="false"/> > > > <field name="ob_id" type="string" indexed="true" stored="true" > > > multiValued="true"/> > > > <field name="in_id" type="string" indexed="true" stored="true" > > > multiValued="true"/> > > > <field name="ob_name" type="text" indexed="true" stored="true" > > > multiValued="true"/> > > > > > > <!-- catchall field, containing all other searchable text fields > > > (implemented > > > via copyField further on in this schema --> > > > <field name="text" type="text" indexed="true" stored="false" > > > multiValued="true" required="false"/> > > > > > > > > > <!-- non-tokenized version of manufacturer to make it easier to sort > or > > > group > > > results by manufacturer. copied from "manu" via copyField --> > > > <field name="manu_exact" type="string" indexed="true" stored="false" > > > required="false"/> > > > > > > > > > <!-- Dynamic field definitions. If a field name is not found, > > > dynamicFields > > > will be used if the name matches any of the patterns. > > > RESTRICTION: the glob-like pattern in the name attribute must > have > > > a "*" only at the start or the end. > > > EXAMPLE: name="*_i" will match any field ending in _i (like > > myid_i, > > > z_i) > > > Longer patterns will be matched first. if equal size patterns > > > both match, the first appearing in the schema will be used. --> > > > <dynamicField name="*_i" type="sint" indexed="true" > > stored="true"/> > > > <dynamicField name="*_s" type="string" indexed="true" > > stored="true"/> > > > <dynamicField name="*_l" type="slong" indexed="true" > > stored="true"/> > > > <dynamicField name="*_t" type="text" indexed="true" > > stored="true"/> > > > <dynamicField name="*_b" type="boolean" indexed="true" > > stored="true"/> > > > <dynamicField name="*_f" type="sfloat" indexed="true" > > stored="true"/> > > > <dynamicField name="*_d" type="sdouble" indexed="true" > > stored="true"/> > > > <dynamicField name="*_dt" type="date" indexed="true" > > stored="true"/> > > > </fields> > > > > > > <!-- field to use to determine and enforce document uniqueness. --> > > > <uniqueKey>id</uniqueKey> > > > > > > <!-- field for the QueryParser to use when an explicit fieldname is > > absent > > > --> > > > <defaultSearchField>text</defaultSearchField> > > > > > > <!-- SolrQueryParser configuration: defaultOperator="AND|OR" --> > > > <solrQueryParser defaultOperator="AND"/> > > > > > > <!-- copyField commands copy one field to another at the time a > document > > > is added to the index. It's used either to index the same field > > > different > > > ways, or to add multiple fields to the same field for > > easier/faster > > > searching. --> > > > > > > > > > > > > <!-- Similarity is the scoring routine for each document vs a query. > > > A custom similarity may be specified here, but the default is fine > > > for most applications. --> > > > <!-- <similarity class="org.apache.lucene.search.DefaultSimilarity"/> > > --> > > > > > > </schema> > > > > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------- > > > > > > > > > On Wed, Nov 12, 2008 at 11:01 PM, Noble Paul നോബിള് नोब्ळ् < > > > [EMAIL PROTECTED]> wrote: > > > > > >> the fact that it got committed in the end suggests there was no error > in > > >> between > > >> > > >> look at the status url and see the no:of rows returned etc. > > >> > > >> It gives a clue as to what would have really happened. or you can > > >> paste your dataconfig and status xmls and we may be able to suggest > > >> something > > >> > > >> On Thu, Nov 13, 2008 at 9:26 AM, Giri <[EMAIL PROTECTED]> wrote: > > >> > Hi Noble, > > >> > > > >> > thanks for reply, my comments are below > > >> > > > >> >>>why is the id field multivalued? > > >> > I was just trying various options, yes, this ID is unique, and I > check > > >> for > > >> > duplicates, when I did a distinct (id) query to the MySQL database, > it > > >> > returned almost 2 million. > > >> > > > >> >>> look at the status host:post/dataimport gives you the status > > >> > I constantly checked the status using the dataimport URL, the > > status > > >> was > > >> > increased upto 600K records, then it stopped increasing, then took > few > > >> > minutes to commit the indexed data. > > >> > > > >> > > > >> > On Tue, Nov 11, 2008 at 11:35 PM, Noble Paul നോബിള് नोब्ळ् < > > >> > [EMAIL PROTECTED]> wrote: > > >> > > > >> >> why is the id field multivalued? is there a uniqueKey in the schema > ? > > >> >> Are you sure there are no duplicates? > > >> >> > > >> >> look at the status host:post/dataimport gives you the status > > >> >> it can give you some clue > > >> >> > > >> >> --Noble > > >> >> > > >> >> > > >> >> On Wed, Nov 12, 2008 at 4:53 AM, Giri <[EMAIL PROTECTED]> > wrote: > > >> >> > Hi, > > >> >> > > > >> >> > I have about ~ 2 million records in a mySQL database table (about > 9 > > >> >> fields > > >> >> > from a single table), and I am trying to load it to the solr > using > > >> >> > DataImportHandler using the command=full-import option. it only > > >> indexed > > >> >> > about 615360 records out of 2 millions. > > >> >> > > > >> >> > here is my db-data-config.xml > > >> >> > <dataConfig> > > >> >> > <dataSource type="JdbcDataSource" > driver="com.mysql.jdbc.Driver" > > >> >> > url="jdbc:mysql://localhost:3306/mydb" user="ua" password="pw" > > >> batchSize > > >> >> > ="-1"/> > > >> >> > <document name="climate"> > > >> >> > <entity name="occurence" query="select * from > mylargetable"> > > >> >> > <field column="id" name="id" /> > > >> >> > <field column="title" name="title" /> > > >> >> > <field column="url" name="url" /> > > >> >> > </entity> > > >> >> > </document> > > >> >> > </dataConfig> > > >> >> > > > >> >> > and in my solr schema.xml, i define these fields as: > > >> >> > > > >> >> > <field name="id" type="string" indexed="true" stored="true" > > >> >> > multiValued="true"/> > > >> >> > <field name="title" type="text" indexed="true" stored="true" > > >> >> > multiValued="true" required="false"/> > > >> >> > <field name="url" type="text" indexed="true" stored="true" > > >> >> > multiValued="true" required="false"/> > > >> >> > > > >> >> > > > >> >> > If I try to index just one field (id), then it indexes about > 960000 > > >> >> records, > > >> >> > but if I try to index all the above three fields, it indexes only > > >> 615360 > > >> >> > records. > > >> >> > > > >> >> > Any help will be appreciated. > > >> >> > > > >> >> > thanks! > > >> >> > > > >> >> > > >> >> > > >> >> > > >> >> -- > > >> >> --Noble Paul > > >> >> > > >> > > > >> > > >> > > >> > > >> -- > > >> --Noble Paul > > >> > > > > > > > > > > > -- > > --Noble Paul > > > -- Regards, Shalin Shekhar Mangar.
