Re: problems indexing web content

Markus Jelsma Mon, 28 Mar 2011 10:37:32 -0700

The analyzer order doesn't really matter, char filters are regardless of 
position in the analyzer always executed first.  Multiple filters of the same 
type, however, are affected by order. Also, your error is not caused by a 
faulty analyzer, there is something wrong in your XML.


Anyway, according to your error, check row 1591 column 90 of your XML input, 
there seems to be a loose space somewhere.

> Jan,
> 
> thank you for such a quick reply. I have a feed coming in that I convert to
> an <add><doc></doc><doc></doc> Here is the type for text including index
> and query with the changes suggested.
> 
> 
>         <fieldtype name="text" class="solr.TextField"
> positionIncrementGap="100"> <analyzer type="index">
>                 <charfilter class="solr.HTMLStripCharFilterFactory"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>                 <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/> <filter
> class="solr.RemoveDuplicatesTokenFilterFactory"/> <tokenizer
> class="solr.WhitespaceTokenizerFactory"/> </analyzer>
>             <analyzer type="query">
>                 <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter
> class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> catenateAll="0"/> <filter class="solr.LowerCaseFilterFactory"/>
>                 <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/> <filter
> class="solr.RemoveDuplicatesTokenFilterFactory"/> <tokenizer
> class="solr.WhitespaceTokenizerFactory"/> </analyzer>
>         </fieldtype>
> 
> 
> Here is the snippit of the file I generate.
> 
> ?xml version="1.0" encoding="UTF-8"?>
> <add>
> <doc>
> <field
> name="guid">http://twitter.com/uswautis/statuses/51997364122165249</field>
> <field name="title">E X I T</field>
> <field name="authorName">uswautis (Hasanah Uswa)</field>
> <field name="authorEmail"></field>
> <field name="authorLinkMimeType"></field>
> <field name="authorLink">http://twitter.com/uswautis</field>
> <field name="lang">U</field>
> <field name="publishDate">2011-03-27T13:21:52Z</field>
> <field name="aquiDate">2011-03-27T13:22:13Z</field>
> <field name="source"></field>
> <field
> name="feedURL">http://twitter.com/uswautis/statuses/51997364122165249</fie
> ld> <field name="feedContentMimeType">text/html</field>
> <field name="feedContentEncoding"></field>
> <field name="feedContent">null</field>
> <field name="inboundLinks">0</field>
> <field name="publisherType">MICROBLOG</field>
> <field name="postTitle">E X I T</field>
> <field name="postBodyMimeType">text/html</field>
> <field name="postBodyEncoding">zlib</field>
> <field name="postBody">mime_type: "text/html"
> data: ""
> </field>
> <field name="tags">[]</field>
> </doc>
> 
> <doc>
> <field
> name="guid">http://twitter.com/imsuperangelica/statuses/51997364050862080<
> /field> <field name="title">I want the sweater i saw in mango sooooo
> bad.</field> <field name="authorName">imsuperangelica (angelica
> marie)</field>
> <field name="authorEmail"></field>
> <field name="authorLinkMimeType"></field>
> <field name="authorLink">http://twitter.com/imsuperangelica</field>
> <field name="lang">en</field>
> <field name="publishDate">2011-03-27T13:21:52Z</field>
> <field name="aquiDate">2011-03-27T13:22:13Z</field>
> <field name="source"></field>
> <field
> name="feedURL">http://twitter.com/imsuperangelica/statuses/519973640508620
> 80</field> <field name="feedContentMimeType">text/html</field>
> <field name="feedContentEncoding"></field>
> <field name="feedContent">null</field>
> <field name="inboundLinks">0</field>
> <field name="publisherType">MICROBLOG</field>
> <field name="postTitle">I want the sweater i saw in mango sooooo
> bad.</field> <field name="postBodyMimeType">text/html</field>
> <field name="postBodyEncoding">zlib</field>
> <field name="postBody">mime_type: "text/html"
> data: ""
> </field>
> <field name="tags">[]</field>
> </doc>
> 
> </add>
> 
> On Mar 28, 2011, at 1:02 PM, Jan Høydahl wrote:
> > Hi,
> > 
> > I assume you try to post HTML files from post.jar, and use
> > HTMLStripCharFilter to sanitize the HTML.
> > 
> > But you refer to "my file" as if you have multiple docs in one file? XML
> > or HTML? Multiple files? To what UpdateRequestHandler are you posting?
> > /update/xml or /update/extract ? For us to understand what you're trying
> > to achieve, please describe your project in more detail.
> > 
> > 
> > To give some concrete feedback too: First off, your analyzer for "text"
> > is wrong. All charFilter's need to be before the tokenizer. You also
> > lack an analyzer with type="query". If I were you I'd try the simplest
> > case first, get rid of mappingCharFilter, StopFilter, WordDelimFilter
> > and Stemmer - just do the most basic stuff you can and go from there.
> > 
> > --
> > Jan Høydahl, search solution architect
> > Cominvent AS - www.cominvent.com
> > 
> > On 28. mars 2011, at 18.52, Charles Wardell wrote:
> >> Hi Everyone,
> >> 
> >> I setup a server and began to index my data. I have two questions I am
> >> hoping someone can help me with. Many of my files seem to index without
> >> any problems. Others, I get a host of different errors. I am indexing
> >> primarily web based content and have identified my text field as
> >> follows:
> >> 
> >> <fieldtype name="text" class="solr.TextField"
> >> positionIncrementGap="100">
> >> 
> >>           <analyzer type="index">
> >>           
> >>               <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>               <charFilter class="solr.MappingCharFilterFactory"
> >>               mapping="mapping.txt"/> <charfilter
> >>               class="solr.HTMLStripCharFilterFactory"/> <filter
> >>               class="solr.StopFilterFactory" ignoreCase="true"
> >>               words="stopwords.txt"/> <filter
> >>               class="solr.WordDelimiterFilterFactory"
> >>               generateWordParts="1" generateNumberParts="1"
> >>               catenateWords="1" catenateNumbers="1" catenateAll="0"/>
> >>               <filter class="solr.LowerCaseFilterFactory"/>
> >>               <filter class="solr.EnglishPorterFilterFactory"
> >>               protected="protwords.txt"/> <filter
> >>               class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >>           
> >>           </analyzer>
> >>       
> >>       </fieldtype>
> >> 
> >> q1) Errors while indexing.
> >> 
> >> * SimplePostTool: WARNING: Unexpected response from Solr: '<result
> >> status="0"></result>' does not contain '<int name="status">0</int>'
> >> 
> >> * SEVERE: Error processing "legacy" update
> >> command:com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected
> >> character ' ' (code 32) in content after '<' (malformed start
> >> element?). at [row,col {unknown-source}]: [1591,90] at
> >> com.ctc.wstx.sr.StreamScanner.throwUnexpectedChar(StreamScanner.java:64
> >> 8)
> >> 
> >> * Although I can't find the actual error, I recall solr giving me an
> >> error when it came across a string &What - The error was something like
> >> expecting semicolon after "What"
> >> 
> >> 
> >> q2) If my file has 1000 documents and I submit it with post.jar, if it
> >> comes across any of the above errors, will it break the processing of
> >> the whole file, or just the document with the error?
> >> 
> >> 
> >> Thanks in advance.
> >> Your help is very much appreciated.
> >> 
> >> Charlie

Re: problems indexing web content

Reply via email to