On Dec 12, 2007, at 2:50 AM, Nuno Leitao wrote:
FAST uses two pipelines - an ingestion pipeline (for document
feeding) and a query pipeline which are fully programmable (i.e.,
you can customize it fully). At ingestion time you typically prepare
documents for indexing (tokenize, character normalize, lemmatize,
clean up text, perform entity extraction for facets, perform static
boosting for certain documents, etc.), while at query time you can
expand synonyms, and do other general query side tasks (not unlike
Solr).
Horizontal scalability means the ability to cluster your search
engine across a large number of servers, so you can scale up on the
number of documents, queries, crawls, etc.
There are FAST deployments out there which run on dozens, in some
cases hundreds of nodes serving multiple terabyte size indexes and
achieving hundreds of queries per seconds.
Yet again, if your requirements are relatively simple then Lucene
might do the job just fine.
Hope this helps.
With Fast, you will also get things like:
- categorization
- clustering
- more flexible collapsing / grouping
- more scalable facets (navigators) - at least for multivalued fields
- gigabytes of poorly documented software
- operations from hell
- huge amount of bugs
- high bills, both for software and hardware.
As for linguistic features (named entity extraction, dictionary based
lemmatization and so on) and things like categorization / clustering
etc, things should not be expected to work to well unless you put a
huge amount of work into it, and some of the features are really
primitive.
To sum up, if Solr meets your needs I would highly recommend Solr. If
you need some additional features and have the knowledge, integrate
other products with Solr. If you really need the scalability, go for
Fast or some other commercial software.
As for document preprocessing and connectors for Solr, if you need it,
you could have a look at OpenPipe, http://openpipe.berlios.de/ (not
yet announced).
Svein