Re: boilerpipe solr tika howto please

2011-01-17 Thread arnaud gaudinat
Thanks Ken, this what I wanted to know, I'm not very familiar with this kind of modification. However, I will try to do it and ask you some information in case of need. regards, Arno Le 14.01.2011 18:04, Ken Krugler a écrit : Hi Arno, On Jan 14, 2011, at 3:57am, arnaud gaudinat wrote: Hel

Re: boilerpipe solr tika howto please

2011-01-14 Thread Ken Krugler
Hi Arno, On Jan 14, 2011, at 3:57am, arnaud gaudinat wrote: Hello, I would like to use BoilerPipe (a very good program which cleans the html content from surplus "clutter"). I saw that BoilerPipe is inside Tika 0.8 and so should be accessible from solr, am I right? How I can Activate Boi

Re: boilerpipe solr tika howto please

2011-01-14 Thread Adam Estrada
There is another way to ingest data using DIH. Check out the HTMLStripTransformer http://www2c.cdc.gov/podcasts/createrss.asp?t=r&c=19"; processor="XPathEntityProcessor" forEach="/rss/channel | /rss/channel/item" transformer="DateFormatTransformer,HTMLStripTransformer

Re: boilerpipe solr tika howto please

2011-01-14 Thread arnaud gaudinat
I just saw TagSoup and it seems to clean bad HTML tags to create a good HTML file. what's BoilerPipe does, it try to eliminate html content which is not part of the useful content for a human reader (ie. navigation contents, ads, comments...) take a look here: http://boilerpipe-web.appspot.com/

Re: boilerpipe solr tika howto please

2011-01-14 Thread Adam Estrada
Is there a drastic difference between this and TagSoup which is already included in Solr? On Fri, Jan 14, 2011 at 6:57 AM, arnaud gaudinat wrote: > Hello, > > I would like to use BoilerPipe (a very good program which cleans the html > content from surplus "clutter"). > I saw that BoilerPipe is in

boilerpipe solr tika howto please

2011-01-14 Thread arnaud gaudinat
Hello, I would like to use BoilerPipe (a very good program which cleans the html content from surplus "clutter"). I saw that BoilerPipe is inside Tika 0.8 and so should be accessible from solr, am I right? How I can Activate BoilerPipe in Solr? Do I need to change solrconfig.xml ( with org.