Solr is indexing XML only?

2006-04-26 Thread David Trattnig
Hello!

I'd like to setup/develop a search-server. I thought I would use Lucene,
then I read about Solr. So I have done the Solr-Tutorial. Firstly really
happy about the additional features to the Lucene-Functionality I now
noticed that Solr can index only XML files. Or am I completely wrong?

What should I use for the following situation:

1. Copy HTML-files to the Live-Server (via RSync)
2. Index them by the search engine
3. Exclude some "tagged" files (these files for example would have a
specific meta-data-tag)
4. Exclude HTML-tags and other unworthy stuff

How much work of development would that be with Lucene or Solr (If
possible)?

Any help would be appreciated!

Thx in advance,
david


Re: Solr is indexing XML only?

2006-04-27 Thread David Trattnig
Hi Chris,

thank you so much! Could you also explain me how to use these two
Tokenizers?
But if there is a Tokenizer which throws away HTML markup it should be also
possible to extend it and exclude additional content easily?

TIA,
david


: will need to process that data you want to index (ie exclude certain
> : files and remove HTML tags) and put them into Solr's input format.
>
> minor clarification: Solr does ship with two Tokenizers that do a pretty
> good job of throwing away HTML markup, os you don't have to parse it
> yourlsef -- but therye are still analyzers, all of the tokens they produce
> go into one fields, there's no way to use them to parse an entire HTML
> file and put the  in one field and  in another.
>
> : > 1. Copy HTML-files to the Live-Server (via RSync)
> : > 2. Index them by the search engine
> : > 3. Exclude some "tagged" files (these files for example would have a
> : > specific meta-data-tag)
> : > 4. Exclude HTML-tags and other unworthy stuff
> : >
> : > How much work of development would that be with Lucene or Solr (If
> : > possible)?
>
> with the exception of item #4 in your list (which i addressed above)
> The amount of work neccessary to process your files and extract the text
> you want to index will largely be the same regardless of wether you use
> Lucene or Solr -- what Solr provides is all of the "service" layer stuff,
> for example...
> * an HTTP based api so the file processing and the searching don't have
> to live on the same machine.
> * a schema that allows you to say "this text should be searchable, and
> this number should be sortable" without needing to hardcode those rules
> into your indexer .. you can change your mind later and only modify your
> schema, not your code.
> * a really smart caching system that knows when the data in your index
> has been modified.
>
> ...etc.
>



-Hoss