RE: Interest in Extending SOLR

Bryzek.Michael Wed, 12 Apr 2006 10:43:45 -0700

Yonik -

> So the number of filters is equal to the number of sites?  
> How many sites are there?

Today: When new customers join, we generally don't do anything
special. Currently we have roughly 400 customers, most of which have
one site each. Note that a few customers have as many as 50 sites. In
total, we probably filter data in 500 unique ways, before we actually
search on the query string entered by the user. Of the 500 unique ways
in which we filter data, there are approximately 50 for which we would
prefer to use a unique index. I don't have 100% accurate numbers, but
these should be in the ballpark.

Future: We are planning to expand on the concepts we've developed to
integrate Lucene and hopefully SOLR into other applications. One in
particular:

  * Provides a core data set of 100K records

  * Allows each of 1,000 customers to create their own view of that
    data

  * In theory, our overall dataset may contain up to 100K * 1,000
    records (100M), but we know that at any given time, only 100K
    records should be made available.

We did rough tests and found that creating multiple indexes performed
better at run time, especially as the logic to determine what results
should be presented to which customer became more complex.

> Support for indexing from CSV files as well as simple pulling from a
> database is on our "todo" list: http://wiki.apache.org/solr/TaskList

I had seen this on the TODO list. I'm offering to contribute this
piece when we've got an idea of overall fit...

> How would one identify what index (or SolrCore) an update is
> targeted to?

This is a good question. I think the query interface itself would have
to be extended. That is, a new parameter would have to be introduced
which identified the objectType you would like to search/update. If omitted,
the default object type would be used. In our current system, we set
the objectType to the name of the database table and thus can issue
queries like:

  search.jsp?tableName=users&queryString=email:michael.bryzek

> What is the relationship between the multiple indicies... do queries
> ever go across multiple indicies, or would there be an "objectType"
> parameter passed in as part of the query?

In our case, there is no relationship between the multiple indices,
but I do see value here (more on this below). In our specific case, we
have a one to one mapping between a database table and a Lucene index
and have not needed to search across tables.

I think the value of the objectType is this true independence. If you
are indexing similar data, use a field on your data. If your data sets
are truly different, use a different object type.

> What is the purpose of multiple indicies... is it so search results
> are always restricted to a single site, but it's not practical to
> have that many Solr instances?  It looks like the indicies are
> partitioned along the lines of object type, and not site-id though.

Your questions and comments are good. Thinking about it has helped me
to clarify what exactly we're trying to accomplish. I think it boils
down to these goals:

  a) Minimize the number of instances of SOLR. If I have 3 web
     applications, each with 12 database tables to index, I don't want
     to run 36 JVMs. I think introducing an objectType would address
     this.

  b) Optimize retrieval when I have some knowledge that I can use to
     define partitions of data. This may actually be more appropriate
     for Lucene itself, but I see SOLR pretty well positioned to
     address. One approach is to introduce a "partitionField" that
     SOLR would use to figure out if a new index is required. For each
     unique value of the partitionField, we create a separate physical
     index. If the query does NOT contain a term for the
     partitionField, we use a multi reader to search across all
     indexes. If the query DOES contain the term, we only search
     across those partitions.

     We have tried using cached bitsets to implement this sort of
     approach, but have found that when we have one large document set
     partitioned into much smaller sets (e.g. 1-10% of the total
     document space), creating separate indexes gives us a much higher
     boost in performance.

-Mike

-----Original Message-----
From:   Yonik Seeley [mailto:[EMAIL PROTECTED]
Sent:   Wed 4/12/06 11:54 AM
To:     solr-user@lucene.apache.org
Cc:     
Subject:        Re: Interest in Extending SOLR

Welcome Michael,

On 4/12/06, Bryzek.Michael <[EMAIL PROTECTED]> wrote:
>   * Integrated support for partitioning - database tables can be
>     partitioned for scalability reasons. The most common scenario for
>     us is to partition off data for our largest customers. For
>     example, imagine a users table:
>
>      * user_id
>      * email_address
>      * site_id
>
>     where site_id refers to the customer to whom the user
>     belongs. Some sites aggregate data... i.e. one of our customers
>     may have 100 sites. When indexing, we create a separate index to
>     store only data for a given site. This precomputes one of our more
>     expensive computations for search - a filter for all users that
>     belong to a given site.

So the number of filters is equal to the number of sites?  How many
sites are there?

>   * Decoupled infrastructure - we wanted the ability to fully scale
>     our search application independent of our database application

That makes total sense... we do the same thing.

>   * High speed indexing - we initially moved data from the database to
>     Lucene via XML documents. We found that to index even a 100k
>     documents, it was much faster to move the data in CSV files
>     (smaller files, less intensive processing).

Support for indexing from CSV files as well as simple pulling from a
database is on our "todo" list: http://wiki.apache.org/solr/TaskList

> IDEAS:
>
> Looking through SOLR, I've identified the following main categories of
> change. I would love to hear comments and feedback from this group.

It would be nice to make any changes as general as possible, while
still solving your particular problem.

I think I understand many of the internal changes you outlined, but
I'm not sure yet exactly what problem you are trying to solve, and how
the multiple indicies will be used.
- How would one identify what index (or SolrCore) an update is targeted to?
- What is the relationship between the multiple indicies... do queries
ever go across multiple indicies, or would there be an "objectType"
parameter passed in as part of the query?
- What is the purpose of multiple indicies... is it so search results
are always restricted to a single site, but it's not practical to have
that many Solr instances?  It looks like the indicies are partitioned
along the lines of object type, and not site-id though.

-Yonik

RE: Interest in Extending SOLR

Reply via email to