David,

Solr doesn't index XML files, but rather XML is used as the wrapper of the text that does get indexed. The document structure is defined in schema.xml, and the field text to be indexed is sent wrapped in an XML request.

Regarding your scenario, you would need to write code that parsed the HTML as desired, taking into account any exclude rules, wrap the text to be indexed (along with any metadata such as the HTML filename or URL) into XML and POST it to Solr using the XML structure described here:

        <http://wiki.apache.org/solr/UpdateXmlMessages>

The XML request body is just a carrier of the data in a structured way, nothing more.

        Erik


On Apr 26, 2006, at 4:27 AM, David Trattnig wrote:

Hello!

I'd like to setup/develop a search-server. I thought I would use Lucene, then I read about Solr. So I have done the Solr-Tutorial. Firstly really
happy about the additional features to the Lucene-Functionality I now
noticed that Solr can index only XML files. Or am I completely wrong?

What should I use for the following situation:

1. Copy HTML-files to the Live-Server (via RSync)
2. Index them by the search engine
3. Exclude some "tagged" files (these files for example would have a
specific meta-data-tag)
4. Exclude HTML-tags and other unworthy stuff

How much work of development would that be with Lucene or Solr (If
possible)?

Any help would be appreciated!

Thx in advance,
david

Reply via email to