Given an RSS raw feed source link such as the following:
http://persistent.info/cgi-bin/feed-proxy?url=http%3A%2F%2Fwww.amazon.com%2Frss%2Ftag%2Fblu-ray%2Fnew%2Fref%3Dtag_rsh_hl_ersn

I can easily get to the value of the description for an item like so:
<field column="description" xpath="/rss/item/description" />

But the content of "description" happens to be in HTML and sadly it is this
HTML chunk that has some pretty decent information that I would like to
import as well.
1) For example it has the image for the item:
<img src="
http://ecx.images-amazon.com/images/I/51yyAAoYzKL._SL160_SS160_.jpg"; ... />
2) It has the price for the item:
<span class="tgProductPrice">$13.99</span>
And many other useful pieces of data that aren't in a proper rss format but
they are simply thrown together inside the html chunk that is served as the
value for the xpath="/rss/item/description"

So, how can I configure DIH to start importing this html information as
well?
Is Tika the way to go?
Can someone give a brief example of what a config file with both Tika config
and RSS config would/should look like?

Thanks!
- Pulkit

Reply via email to