Hi Chris,

thank you so much! Could you also explain me how to use these two
Tokenizers?
But if there is a Tokenizer which throws away HTML markup it should be also
possible to extend it and exclude additional content easily?

TIA,
david


: will need to process that data you want to index (ie exclude certain
> : files and remove HTML tags) and put them into Solr's input format.
>
> minor clarification: Solr does ship with two Tokenizers that do a pretty
> good job of throwing away HTML markup, os you don't have to parse it
> yourlsef -- but therye are still analyzers, all of the tokens they produce
> go into one fields, there's no way to use them to parse an entire HTML
> file and put the <title> in one field and <body> in another.
>
> : > 1. Copy HTML-files to the Live-Server (via RSync)
> : > 2. Index them by the search engine
> : > 3. Exclude some "tagged" files (these files for example would have a
> : > specific meta-data-tag)
> : > 4. Exclude HTML-tags and other unworthy stuff
> : >
> : > How much work of development would that be with Lucene or Solr (If
> : > possible)?
>
> with the exception of item #4 in your list (which i addressed above)
> The amount of work neccessary to process your files and extract the text
> you want to index will largely be the same regardless of wether you use
> Lucene or Solr -- what Solr provides is all of the "service" layer stuff,
> for example...
> * an HTTP based api so the file processing and the searching don't have
> to live on the same machine.
> * a schema that allows you to say "this text should be searchable, and
> this number should be sortable" without needing to hardcode those rules
> into your indexer .. you can change your mind later and only modify your
> schema, not your code.
> * a really smart caching system that knows when the data in your index
> has been modified.
>
> ...etc.
>



-Hoss

Reply via email to