I've seen the JSoup HTML parser library used for this. It worked
really well. The Boilerpipe library may be what you want. Its
schwerpunkt (*) is to separate boilerplate from wanted text in an HTML
page. I don't know what fine-grained control it has.

* raison d'ĂȘtre. There is no English word for this concept.

On Tue, Dec 6, 2011 at 1:39 PM, Tommaso Teofili
<tommaso.teof...@gmail.com> wrote:
> Hello Michael,
>
> I can help you with using the UIMA UpdateRequestProcessor [1]; the current
> implementation uses in-memory execution of UIMA pipelines but since I was
> planning to add the support for higher scalability (with UIMA-AS [2]) that
> may help you as well.
>
> Tommaso
>
> [1] :
> http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/uima/src/java/org/apache/solr/uima/processor/UIMAUpdateRequestProcessor.java
> [2] : http://uima.apache.org/doc-uimaas-what.html
>
> 2011/12/5 Michael Kelleher <mj.kelle...@gmail.com>
>
>> Hello Erik,
>>
>> I will take a look at both:
>>
>> org.apache.solr.update.**processor.**LangDetectLanguageIdentifierUp**
>> dateProcessor
>>
>> and
>>
>> org.apache.solr.update.**processor.**TikaLanguageIdentifierUpdatePr**
>> ocessor
>>
>>
>> and figure out what I need to extend to handle processing in the way I am
>> looking for.  I am assuming that "component" configuration is handled in a
>> standard way such that I can configure my new UpdateProcessor in the same
>> way I would configure any other UpdateProcessor "component"?
>>
>> Thanks for the suggestion.
>>
>>
>> 1 more question:  given that I am probably going to convert the HTML to
>> XML so I can use XPath expressions to "extract" my content, do you think
>> that this kind of processing will overload Solr?  This Solr instance will
>> be used solely for indexing, and will only ever have a single ManifoldCF
>> crawling job feeding it documents at one time.
>>
>> --mike
>>



-- 
Lance Norskog
goks...@gmail.com

Reply via email to