Re: Document Processing

2012-09-06 Thread Tanguy Moal
August 29, 2012 7:37:37 PM > | Subject: Re: Document Processing > | > | I've seen the JSoup HTML parser library used for this. It worked > | really well. The Boilerpipe library may be what you want. Its > | schwerpunkt (*) is to separate boilerplate from wanted text in an &g

Re: Document Processing

2012-09-05 Thread Lance Norskog
| Subject: Re: Document Processing | | I've seen the JSoup HTML parser library used for this. It worked | really well. The Boilerpipe library may be what you want. Its | schwerpunkt (*) is to separate boilerplate from wanted text in an | HTML | page. I don't know what fine-gr

Re: Document Processing

2012-08-29 Thread Lance Norskog
I've seen the JSoup HTML parser library used for this. It worked really well. The Boilerpipe library may be what you want. Its schwerpunkt (*) is to separate boilerplate from wanted text in an HTML page. I don't know what fine-grained control it has. * raison d'ĂȘtre. There is no English word for t

Re: Document Processing

2011-12-06 Thread Tommaso Teofili
Hello Michael, I can help you with using the UIMA UpdateRequestProcessor [1]; the current implementation uses in-memory execution of UIMA pipelines but since I was planning to add the support for higher scalability (with UIMA-AS [2]) that may help you as well. Tommaso [1] : http://svn.apache.org

Re: Document Processing

2011-12-06 Thread Erik Hatcher
As for XML "overloading" Solr... certainly it will add processing time to the situation as well as additional memory requirements. At worst it'd require more RAM and slow things down, but all depends on scale of ingestion rate and size of the documents whether it'd be prohibitive. Erik

Re: Document Processing

2011-12-05 Thread Michael Kelleher
Hello Erik, I will take a look at both: org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor and org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessor and figure out what I need to extend to handle processing in the way I am looking for. I am assumi

Re: Document Processing

2011-12-05 Thread Michael Kelleher
On 12/05/2011 01:52 PM, Michael Kelleher wrote: I am crawling a bunch of HTML pages within a site (using ManifoldCF), that will be sent to Solr for indexing. I want to extract some content out of the pages, each piece of content to be stored as its own field BEFORE indexing in Solr. My guess

Re: Document Processing

2011-12-05 Thread Erik Hatcher
Michael - I was following your discussion on the MCF list too as well. What kind of information do you want to extract from the HTML pages? The UIMA thing would be fairly heavy weight. The simplest thing on the Solr-side of the equation would be to write an UpdateProcessor(Factory) and creat