: will need to process that data you want to index (ie exclude certain : files and remove HTML tags) and put them into Solr's input format.
minor clarification: Solr does ship with two Tokenizers that do a pretty good job of throwing away HTML markup, os you don't have to parse it yourlsef -- but therye are still analyzers, all of the tokens they produce go into one fields, there's no way to use them to parse an entire HTML file and put the <title> in one field and <body> in another. : > 1. Copy HTML-files to the Live-Server (via RSync) : > 2. Index them by the search engine : > 3. Exclude some "tagged" files (these files for example would have a : > specific meta-data-tag) : > 4. Exclude HTML-tags and other unworthy stuff : > : > How much work of development would that be with Lucene or Solr (If : > possible)? with the exception of item #4 in your list (which i addressed above) The amount of work neccessary to process your files and extract the text you want to index will largely be the same regardless of wether you use Lucene or Solr -- what Solr provides is all of the "service" layer stuff, for example... * an HTTP based api so the file processing and the searching don't have to live on the same machine. * a schema that allows you to say "this text should be searchable, and this number should be sortable" without needing to hardcode those rules into your indexer .. you can change your mind later and only modify your schema, not your code. * a really smart caching system that knows when the data in your index has been modified. ...etc. -Hoss