Hi Chris, thank you so much! Could you also explain me how to use these two Tokenizers? But if there is a Tokenizer which throws away HTML markup it should be also possible to extend it and exclude additional content easily?
TIA, david : will need to process that data you want to index (ie exclude certain > : files and remove HTML tags) and put them into Solr's input format. > > minor clarification: Solr does ship with two Tokenizers that do a pretty > good job of throwing away HTML markup, os you don't have to parse it > yourlsef -- but therye are still analyzers, all of the tokens they produce > go into one fields, there's no way to use them to parse an entire HTML > file and put the <title> in one field and <body> in another. > > : > 1. Copy HTML-files to the Live-Server (via RSync) > : > 2. Index them by the search engine > : > 3. Exclude some "tagged" files (these files for example would have a > : > specific meta-data-tag) > : > 4. Exclude HTML-tags and other unworthy stuff > : > > : > How much work of development would that be with Lucene or Solr (If > : > possible)? > > with the exception of item #4 in your list (which i addressed above) > The amount of work neccessary to process your files and extract the text > you want to index will largely be the same regardless of wether you use > Lucene or Solr -- what Solr provides is all of the "service" layer stuff, > for example... > * an HTTP based api so the file processing and the searching don't have > to live on the same machine. > * a schema that allows you to say "this text should be searchable, and > this number should be sortable" without needing to hardcode those rules > into your indexer .. you can change your mind later and only modify your > schema, not your code. > * a really smart caching system that knows when the data in your index > has been modified. > > ...etc. > -Hoss