David,
Solr doesn't index XML files, but rather XML is used as the wrapper
of the text that does get indexed. The document structure is defined
in schema.xml, and the field text to be indexed is sent wrapped in an
XML request.
Regarding your scenario, you would need to write code that parsed the
HTML as desired, taking into account any exclude rules, wrap the text
to be indexed (along with any metadata such as the HTML filename or
URL) into XML and POST it to Solr using the XML structure described
here:
<http://wiki.apache.org/solr/UpdateXmlMessages>
The XML request body is just a carrier of the data in a structured
way, nothing more.
Erik
On Apr 26, 2006, at 4:27 AM, David Trattnig wrote:
Hello!
I'd like to setup/develop a search-server. I thought I would use
Lucene,
then I read about Solr. So I have done the Solr-Tutorial. Firstly
really
happy about the additional features to the Lucene-Functionality I now
noticed that Solr can index only XML files. Or am I completely wrong?
What should I use for the following situation:
1. Copy HTML-files to the Live-Server (via RSync)
2. Index them by the search engine
3. Exclude some "tagged" files (these files for example would have a
specific meta-data-tag)
4. Exclude HTML-tags and other unworthy stuff
How much work of development would that be with Lucene or Solr (If
possible)?
Any help would be appreciated!
Thx in advance,
david