Thanks Ken,
this what I wanted to know, I'm not very familiar with this kind of
modification. However, I will try to do it and ask you some information
in case of need.
regards,
Arno
Le 14.01.2011 18:04, Ken Krugler a écrit :
Hi Arno,
On Jan 14, 2011, at 3:57am, arnaud gaudinat wrote:
Hel
Hi Arno,
On Jan 14, 2011, at 3:57am, arnaud gaudinat wrote:
Hello,
I would like to use BoilerPipe (a very good program which cleans the
html content from surplus "clutter").
I saw that BoilerPipe is inside Tika 0.8 and so should be accessible
from solr, am I right?
How I can Activate Boi
There is another way to ingest data using DIH. Check out the
HTMLStripTransformer
http://www2c.cdc.gov/podcasts/createrss.asp?t=r&c=19";
processor="XPathEntityProcessor"
forEach="/rss/channel | /rss/channel/item"
transformer="DateFormatTransformer,HTMLStripTransformer
I just saw TagSoup and it seems to clean bad HTML tags to create a good
HTML file.
what's BoilerPipe does, it try to eliminate html content which is not
part of the useful content for a human reader (ie. navigation contents,
ads, comments...)
take a look here: http://boilerpipe-web.appspot.com/
Is there a drastic difference between this and TagSoup which is already
included in Solr?
On Fri, Jan 14, 2011 at 6:57 AM, arnaud gaudinat
wrote:
> Hello,
>
> I would like to use BoilerPipe (a very good program which cleans the html
> content from surplus "clutter").
> I saw that BoilerPipe is in
Hello,
I would like to use BoilerPipe (a very good program which cleans the
html content from surplus "clutter").
I saw that BoilerPipe is inside Tika 0.8 and so should be accessible
from solr, am I right?
How I can Activate BoilerPipe in Solr? Do I need to change
solrconfig.xml ( with
org.