It's starting to sound like Solr Cell needs a SearchComponent as well,
that can come before the QueryComponent and can be used to map into
the other components. Essentially, take the functionality of the
extractOnly option and have it feed other SearchComponent.
On Aug 8, 2009, at 10:42 AM, Ken Krugler wrote:
On Aug 7, 2009, at 5:23pm, Jay Hill wrote:
I'm using the MoreLikeThisHandler with a content stream to get
documents
from my index that match content from an html page like this:
http://localhost:8080/solr/mlt?stream.url=http://www.sfgate.com/cgi-bin/article.cgi
?f=/c/a/2009/08/06/SP5R194Q13.DTL&mlt.fl=body&rows=4&debugQuery=true
But, not surprisingly, the query generated is meaningless because a
lot of
the markup is picked out as terms:
<str name="parsedquery_toString">
body:li body:href body:div body:class body:a body:script body:type
body:js
body:ul body:text body:javascript body:style body:css body:h body:img
body:var body:articl body:ad body:http body:span body:prop
</str>
Does anyone know a way to transform the html so that the content
can be
parsed out of the content stream and processed w/o the markup? Or
do I need
to write my own HTMLParsingMoreLikeThisHandler?
You'd want to parse the HTML to extract only text first, and use
that for your index data.
Both the Nutch and Tika OSS projects have examples of using HTML
parsers (based on TagSoup or CyberNeko) to generate content suitable
for indexing.
-- Ken
If I parse the content out to a plain text file and point the
stream.url
param to file:///parsedfile.txt it works great.
-Jay
--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search