August 29, 2012 7:37:37 PM
> | Subject: Re: Document Processing
> |
> | I've seen the JSoup HTML parser library used for this. It worked
> | really well. The Boilerpipe library may be what you want. Its
> | schwerpunkt (*) is to separate boilerplate from wanted text in an
&g
| Subject: Re: Document Processing
|
| I've seen the JSoup HTML parser library used for this. It worked
| really well. The Boilerpipe library may be what you want. Its
| schwerpunkt (*) is to separate boilerplate from wanted text in an
| HTML
| page. I don't know what fine-gr
I've seen the JSoup HTML parser library used for this. It worked
really well. The Boilerpipe library may be what you want. Its
schwerpunkt (*) is to separate boilerplate from wanted text in an HTML
page. I don't know what fine-grained control it has.
* raison d'ĂȘtre. There is no English word for t
Hello Michael,
I can help you with using the UIMA UpdateRequestProcessor [1]; the current
implementation uses in-memory execution of UIMA pipelines but since I was
planning to add the support for higher scalability (with UIMA-AS [2]) that
may help you as well.
Tommaso
[1] :
http://svn.apache.org
As for XML "overloading" Solr... certainly it will add processing time to the
situation as well as additional memory requirements. At worst it'd require
more RAM and slow things down, but all depends on scale of ingestion rate and
size of the documents whether it'd be prohibitive.
Erik
Hello Erik,
I will take a look at both:
org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor
and
org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessor
and figure out what I need to extend to handle processing in the way I
am looking for. I am assumi
On 12/05/2011 01:52 PM, Michael Kelleher wrote:
I am crawling a bunch of HTML pages within a site (using ManifoldCF),
that will be sent to Solr for indexing. I want to extract some
content out of the pages, each piece of content to be stored as its
own field BEFORE indexing in Solr.
My guess
Michael -
I was following your discussion on the MCF list too as well.
What kind of information do you want to extract from the HTML pages? The UIMA
thing would be fairly heavy weight. The simplest thing on the Solr-side of the
equation would be to write an UpdateProcessor(Factory) and creat