Re: Interest in Extending SOLR

Vish D. Thu, 13 Apr 2006 10:26:53 -0700

Mike,

I am currently evaluating different search engine technologies (esp., open
source ones), and this is very interesting to me, for the following reasons:


Our data is much like yours in that we have different types of data
(abstracts, fulltext, music, etc...), which eventually fall under different
"databases" in our subscription/offering model. So, the ability of have
different indexes (on database level, and on type level) would be the ideal
solution. The only difference being, when comparing to your needs, it would
be a requirement to be able to search between different indexes (searching
between "databases"), but also be able to search only within types. That is,
with your proposal, objectType could be "type" or "database." The point here
isn't that it would be nice to have second parameter, but it would be a
necessity to be able search between indexes.

I am truly interested in how this all works out, and hope to get myself
involved in Solr technology.





On 4/12/06, Bryzek.Michael <[EMAIL PROTECTED]> wrote:
>
> Yonik -
>
> > So the number of filters is equal to the number of sites?
> > How many sites are there?
>
> Today: When new customers join, we generally don't do anything
> special. Currently we have roughly 400 customers, most of which have
> one site each. Note that a few customers have as many as 50 sites. In
> total, we probably filter data in 500 unique ways, before we actually
> search on the query string entered by the user. Of the 500 unique ways
> in which we filter data, there are approximately 50 for which we would
> prefer to use a unique index. I don't have 100% accurate numbers, but
> these should be in the ballpark.
>
> Future: We are planning to expand on the concepts we've developed to
> integrate Lucene and hopefully SOLR into other applications. One in
> particular:
>
>   * Provides a core data set of 100K records
>
>   * Allows each of 1,000 customers to create their own view of that
>     data
>
>   * In theory, our overall dataset may contain up to 100K * 1,000
>     records (100M), but we know that at any given time, only 100K
>     records should be made available.
>
> We did rough tests and found that creating multiple indexes performed
> better at run time, especially as the logic to determine what results
> should be presented to which customer became more complex.
>
>
> > Support for indexing from CSV files as well as simple pulling from a
> > database is on our "todo" list: http://wiki.apache.org/solr/TaskList
>
> I had seen this on the TODO list. I'm offering to contribute this
> piece when we've got an idea of overall fit...
>
>
> > How would one identify what index (or SolrCore) an update is
> > targeted to?
>
> This is a good question. I think the query interface itself would have
> to be extended. That is, a new parameter would have to be introduced
> which identified the objectType you would like to search/update. If
> omitted,
> the default object type would be used. In our current system, we set
> the objectType to the name of the database table and thus can issue
> queries like:
>
>   search.jsp?tableName=users&queryString=email:michael.bryzek
>
>
> > What is the relationship between the multiple indicies... do queries
> > ever go across multiple indicies, or would there be an "objectType"
> > parameter passed in as part of the query?
>
> In our case, there is no relationship between the multiple indices,
> but I do see value here (more on this below). In our specific case, we
> have a one to one mapping between a database table and a Lucene index
> and have not needed to search across tables.
>
> I think the value of the objectType is this true independence. If you
> are indexing similar data, use a field on your data. If your data sets
> are truly different, use a different object type.
>
>
> > What is the purpose of multiple indicies... is it so search results
> > are always restricted to a single site, but it's not practical to
> > have that many Solr instances?  It looks like the indicies are
> > partitioned along the lines of object type, and not site-id though.
>
> Your questions and comments are good. Thinking about it has helped me
> to clarify what exactly we're trying to accomplish. I think it boils
> down to these goals:
>
>   a) Minimize the number of instances of SOLR. If I have 3 web
>      applications, each with 12 database tables to index, I don't want
>      to run 36 JVMs. I think introducing an objectType would address
>      this.
>
>   b) Optimize retrieval when I have some knowledge that I can use to
>      define partitions of data. This may actually be more appropriate
>      for Lucene itself, but I see SOLR pretty well positioned to
>      address. One approach is to introduce a "partitionField" that
>      SOLR would use to figure out if a new index is required. For each
>      unique value of the partitionField, we create a separate physical
>      index. If the query does NOT contain a term for the
>      partitionField, we use a multi reader to search across all
>      indexes. If the query DOES contain the term, we only search
>      across those partitions.
>
>      We have tried using cached bitsets to implement this sort of
>      approach, but have found that when we have one large document set
>      partitioned into much smaller sets (e.g. 1-10% of the total
>      document space), creating separate indexes gives us a much higher
>      boost in performance.
>
> -Mike
>
>
> -----Original Message-----
> From:   Yonik Seeley [mailto:[EMAIL PROTECTED]
> Sent:   Wed 4/12/06 11:54 AM
> To:     solr-user@lucene.apache.org
> Cc:
> Subject:        Re: Interest in Extending SOLR
>
> Welcome Michael,
>
> On 4/12/06, Bryzek.Michael <[EMAIL PROTECTED]> wrote:
> >   * Integrated support for partitioning - database tables can be
> >     partitioned for scalability reasons. The most common scenario for
> >     us is to partition off data for our largest customers. For
> >     example, imagine a users table:
> >
> >      * user_id
> >      * email_address
> >      * site_id
> >
> >     where site_id refers to the customer to whom the user
> >     belongs. Some sites aggregate data... i.e. one of our customers
> >     may have 100 sites. When indexing, we create a separate index to
> >     store only data for a given site. This precomputes one of our more
> >     expensive computations for search - a filter for all users that
> >     belong to a given site.
>
> So the number of filters is equal to the number of sites?  How many
> sites are there?
>
> >   * Decoupled infrastructure - we wanted the ability to fully scale
> >     our search application independent of our database application
>
> That makes total sense... we do the same thing.
>
> >   * High speed indexing - we initially moved data from the database to
> >     Lucene via XML documents. We found that to index even a 100k
> >     documents, it was much faster to move the data in CSV files
> >     (smaller files, less intensive processing).
>
> Support for indexing from CSV files as well as simple pulling from a
> database is on our "todo" list: http://wiki.apache.org/solr/TaskList
>
> > IDEAS:
> >
> > Looking through SOLR, I've identified the following main categories of
> > change. I would love to hear comments and feedback from this group.
>
> It would be nice to make any changes as general as possible, while
> still solving your particular problem.
>
> I think I understand many of the internal changes you outlined, but
> I'm not sure yet exactly what problem you are trying to solve, and how
> the multiple indicies will be used.
> - How would one identify what index (or SolrCore) an update is targeted
> to?
> - What is the relationship between the multiple indicies... do queries
> ever go across multiple indicies, or would there be an "objectType"
> parameter passed in as part of the query?
> - What is the purpose of multiple indicies... is it so search results
> are always restricted to a single site, but it's not practical to have
> that many Solr instances?  It looks like the indicies are partitioned
> along the lines of object type, and not site-id though.
>
> -Yonik
>
>
>
>

Re: Interest in Extending SOLR

Reply via email to