Re: Solr is indexing XML only?

2006-04-27 Thread David Trattnig
Hi Chris,

thank you so much! Could you also explain me how to use these two
Tokenizers?
But if there is a Tokenizer which throws away HTML markup it should be also
possible to extend it and exclude additional content easily?

TIA,
david


: will need to process that data you want to index (ie exclude certain
> : files and remove HTML tags) and put them into Solr's input format.
>
> minor clarification: Solr does ship with two Tokenizers that do a pretty
> good job of throwing away HTML markup, os you don't have to parse it
> yourlsef -- but therye are still analyzers, all of the tokens they produce
> go into one fields, there's no way to use them to parse an entire HTML
> file and put the  in one field and  in another.
>
> : > 1. Copy HTML-files to the Live-Server (via RSync)
> : > 2. Index them by the search engine
> : > 3. Exclude some "tagged" files (these files for example would have a
> : > specific meta-data-tag)
> : > 4. Exclude HTML-tags and other unworthy stuff
> : >
> : > How much work of development would that be with Lucene or Solr (If
> : > possible)?
>
> with the exception of item #4 in your list (which i addressed above)
> The amount of work neccessary to process your files and extract the text
> you want to index will largely be the same regardless of wether you use
> Lucene or Solr -- what Solr provides is all of the "service" layer stuff,
> for example...
> * an HTTP based api so the file processing and the searching don't have
> to live on the same machine.
> * a schema that allows you to say "this text should be searchable, and
> this number should be sortable" without needing to hardcode those rules
> into your indexer .. you can change your mind later and only modify your
> schema, not your code.
> * a really smart caching system that knows when the data in your index
> has been modified.
>
> ...etc.
>



-Hoss


Re: Solr is indexing XML only?

2006-04-27 Thread Yonik Seeley
On 4/27/06, David Trattnig <[EMAIL PROTECTED]> wrote:
> thank you so much! Could you also explain me how to use these two
> Tokenizers?

Here's the HTMLStrip tokenizer description:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-031d5d370010955fdcc529d208395cd556f4a73e

Read through the Solr example schema.xml and it should hopefully be
apparent how to use it.

> But if there is a Tokenizer which throws away HTML markup it should be also
> possible to extend it and exclude additional content easily?

If the additional content has nothing to do with HTML, it should be
developed as a separate TokenFilter.  Filters are meant to be chained
to gether to gain more configuration flexibility.

-Yonik


Re: distributing indexes via solr

2006-04-27 Thread Chris Hostetter

: Suppose I want the xml input submitted to solr to be distributed among a
: fixed set of partitions; basically, something like round-robin among each of
: them, so that each directory has a relatively equal size in terms of # of
: segments.  Is there an easy way to do this?  I took a quick look at the solr

I'm not sure if i'm understanding your question:  What would the
motivation be for doing something like this? ... what would the usage be
like from a search perspective one you had built up these directories?


-Hoss



Re: distributing indexes via solr

2006-04-27 Thread Johnny Monsod
So the thinking here was to divide the total indexed data among N partitions
since the amount of data will be massive.  Each partition would probably be
using a separate physical disk(s), and then for searching I could use
ParallelMultiSearcher to dispatch searches to each of these partitions as a
separate Searchable.  I know that the Lucene doc mentioned that there is
really not much gain in using ParallelMultiSearcher versus MultiSearcher
(sequential of a bunch of searchables) when using it against a single disk,
so if we had separate physical disks, the parallel version might be of more
tangible benefit.

-John

On 4/27/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
>
> : Suppose I want the xml input submitted to solr to be distributed among a
> : fixed set of partitions; basically, something like round-robin among
> each of
> : them, so that each directory has a relatively equal size in terms of #
> of
> : segments.  Is there an easy way to do this?  I took a quick look at the
> solr
>
> I'm not sure if i'm understanding your question:  What would the
> motivation be for doing something like this? ... what would the usage be
> like from a search perspective one you had built up these directories?
>
>
> -Hoss
>
>


Re: distributing indexes via solr

2006-04-27 Thread Yonik Seeley
If you are after faster disks, it might just be easier to use RAID.
If you want real scalability with a single-index view, you want
multiple machines (which Solr doesn't support yet).

If you can partition your data such that queries can be run against
single partitions, then use separate Solr servers and put different
parts of the collection on each server.  Then make a smart front-end
that queries the correct collection based on something in the data.

> So the thinking here was to divide the total indexed data among N partitions
> since the amount of data will be massive.

How much data?  (number of docs, number of indexed fields per doc,
size of all indexed fields, etc)

-Yonik

On 4/27/06, Johnny Monsod <[EMAIL PROTECTED]> wrote:
> So the thinking here was to divide the total indexed data among N partitions
> since the amount of data will be massive.  Each partition would probably be
> using a separate physical disk(s), and then for searching I could use
> ParallelMultiSearcher to dispatch searches to each of these partitions as a
> separate Searchable.  I know that the Lucene doc mentioned that there is
> really not much gain in using ParallelMultiSearcher versus MultiSearcher
> (sequential of a bunch of searchables) when using it against a single disk,
> so if we had separate physical disks, the parallel version might be of more
> tangible benefit.
>
> -John
>
> On 4/27/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:
> >
> >
> > : Suppose I want the xml input submitted to solr to be distributed among a
> > : fixed set of partitions; basically, something like round-robin among
> > each of
> > : them, so that each directory has a relatively equal size in terms of #
> > of
> > : segments.  Is there an easy way to do this?  I took a quick look at the
> > solr
> >
> > I'm not sure if i'm understanding your question:  What would the
> > motivation be for doing something like this? ... what would the usage be
> > like from a search perspective one you had built up these directories?
> >
> >
> > -Hoss


Re: distributing indexes via solr

2006-04-27 Thread Johnny Monsod
Each indexed document will represent an email, consisting of the typical
fields to/from/subject/cc/bcc/body/attachment/mailheaders where the body and
attachment texts will be indexed and tokenized but not stored.  It's
difficult to give an estimate of the # of such documents, other than to say
that it would be similar to what a small to midsize corp, would generate.
The system would have to cover the total amount of emails generated up to a
certain date range in the past (to start out), then continuously add
incremental additions on a daily basis moving forward.

-John

On 4/27/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
>
> If you are after faster disks, it might just be easier to use RAID.
> If you want real scalability with a single-index view, you want
> multiple machines (which Solr doesn't support yet).
>
> If you can partition your data such that queries can be run against
> single partitions, then use separate Solr servers and put different
> parts of the collection on each server.  Then make a smart front-end
> that queries the correct collection based on something in the data.
>
> > So the thinking here was to divide the total indexed data among N
> partitions
> > since the amount of data will be massive.
>
> How much data?  (number of docs, number of indexed fields per doc,
> size of all indexed fields, etc)
>
> -Yonik
>
> On 4/27/06, Johnny Monsod <[EMAIL PROTECTED]> wrote:
> > So the thinking here was to divide the total indexed data among N
> partitions
> > since the amount of data will be massive.  Each partition would probably
> be
> > using a separate physical disk(s), and then for searching I could use
> > ParallelMultiSearcher to dispatch searches to each of these partitions
> as a
> > separate Searchable.  I know that the Lucene doc mentioned that there is
> > really not much gain in using ParallelMultiSearcher versus MultiSearcher
> > (sequential of a bunch of searchables) when using it against a single
> disk,
> > so if we had separate physical disks, the parallel version might be of
> more
> > tangible benefit.
> >
> > -John
> >
> > On 4/27/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:
> > >
> > >
> > > : Suppose I want the xml input submitted to solr to be distributed
> among a
> > > : fixed set of partitions; basically, something like round-robin among
> > > each of
> > > : them, so that each directory has a relatively equal size in terms of
> #
> > > of
> > > : segments.  Is there an easy way to do this?  I took a quick look at
> the
> > > solr
> > >
> > > I'm not sure if i'm understanding your question:  What would the
> > > motivation be for doing something like this? ... what would the usage
> be
> > > like from a search perspective one you had built up these directories?
> > >
> > >
> > > -Hoss
>