: will need to process that data you want to index (ie exclude certain
: files and remove HTML tags) and put them into Solr's input format.

minor clarification: Solr does ship with two Tokenizers that do a pretty
good job of throwing away HTML markup, os you don't have to parse it
yourlsef -- but therye are still analyzers, all of the tokens they produce
go into one fields, there's no way to use them to parse an entire HTML
file and put the <title> in one field and <body> in another.

: > 1. Copy HTML-files to the Live-Server (via RSync)
: > 2. Index them by the search engine
: > 3. Exclude some "tagged" files (these files for example would have a
: > specific meta-data-tag)
: > 4. Exclude HTML-tags and other unworthy stuff
: >
: > How much work of development would that be with Lucene or Solr (If
: > possible)?

with the exception of item #4 in your list (which i addressed above)
The amount of work neccessary to process your files and extract the text
you want to index will largely be the same regardless of wether you use
Lucene or Solr -- what Solr provides is all of the "service" layer stuff,
for example...
  * an HTTP based api so the file processing and the searching don't have
to live on the same machine.
  * a schema that allows you to say "this text should be searchable, and
this number should be sortable" without needing to hardcode those rules
into your indexer .. you can change your mind later and only modify your
schema, not your code.
  * a really smart caching system that knows when the data in your index
has been modified.

...etc.



-Hoss

Reply via email to