Re: Interest in Extending SOLR

2006-04-13 Thread Vish D.
Mike,

I am currently evaluating different search engine technologies (esp., open
source ones), and this is very interesting to me, for the following reasons:

Our data is much like yours in that we have different types of data
(abstracts, fulltext, music, etc...), which eventually fall under different
"databases" in our subscription/offering model. So, the ability of have
different indexes (on database level, and on type level) would be the ideal
solution. The only difference being, when comparing to your needs, it would
be a requirement to be able to search between different indexes (searching
between "databases"), but also be able to search only within types. That is,
with your proposal, objectType could be "type" or "database." The point here
isn't that it would be nice to have second parameter, but it would be a
necessity to be able search between indexes.

I am truly interested in how this all works out, and hope to get myself
involved in Solr technology.





On 4/12/06, Bryzek.Michael <[EMAIL PROTECTED]> wrote:
>
> Yonik -
>
> > So the number of filters is equal to the number of sites?
> > How many sites are there?
>
> Today: When new customers join, we generally don't do anything
> special. Currently we have roughly 400 customers, most of which have
> one site each. Note that a few customers have as many as 50 sites. In
> total, we probably filter data in 500 unique ways, before we actually
> search on the query string entered by the user. Of the 500 unique ways
> in which we filter data, there are approximately 50 for which we would
> prefer to use a unique index. I don't have 100% accurate numbers, but
> these should be in the ballpark.
>
> Future: We are planning to expand on the concepts we've developed to
> integrate Lucene and hopefully SOLR into other applications. One in
> particular:
>
>   * Provides a core data set of 100K records
>
>   * Allows each of 1,000 customers to create their own view of that
> data
>
>   * In theory, our overall dataset may contain up to 100K * 1,000
> records (100M), but we know that at any given time, only 100K
> records should be made available.
>
> We did rough tests and found that creating multiple indexes performed
> better at run time, especially as the logic to determine what results
> should be presented to which customer became more complex.
>
>
> > Support for indexing from CSV files as well as simple pulling from a
> > database is on our "todo" list: http://wiki.apache.org/solr/TaskList
>
> I had seen this on the TODO list. I'm offering to contribute this
> piece when we've got an idea of overall fit...
>
>
> > How would one identify what index (or SolrCore) an update is
> > targeted to?
>
> This is a good question. I think the query interface itself would have
> to be extended. That is, a new parameter would have to be introduced
> which identified the objectType you would like to search/update. If
> omitted,
> the default object type would be used. In our current system, we set
> the objectType to the name of the database table and thus can issue
> queries like:
>
>   search.jsp?tableName=users&queryString=email:michael.bryzek
>
>
> > What is the relationship between the multiple indicies... do queries
> > ever go across multiple indicies, or would there be an "objectType"
> > parameter passed in as part of the query?
>
> In our case, there is no relationship between the multiple indices,
> but I do see value here (more on this below). In our specific case, we
> have a one to one mapping between a database table and a Lucene index
> and have not needed to search across tables.
>
> I think the value of the objectType is this true independence. If you
> are indexing similar data, use a field on your data. If your data sets
> are truly different, use a different object type.
>
>
> > What is the purpose of multiple indicies... is it so search results
> > are always restricted to a single site, but it's not practical to
> > have that many Solr instances?  It looks like the indicies are
> > partitioned along the lines of object type, and not site-id though.
>
> Your questions and comments are good. Thinking about it has helped me
> to clarify what exactly we're trying to accomplish. I think it boils
> down to these goals:
>
>   a) Minimize the number of instances of SOLR. If I have 3 web
>  applications, each with 12 database tables to index, I don't want
>  to run 36 JVMs. I think introducing an objectType would address
>  this.
>
>   b) Optimize retrieval when I have some knowledge that I can use to
>  define partitions of data. This may actually be more appropriate
>  for Lucene itself, but I see SOLR pretty well positioned to
>  address. One approach is to introduce a "partitionField" that
>  SOLR would use to figure out if a new index is required. For each
>  unique value of the partitionField, we create a separate physical
>  index. If the query does NOT contain a term for the
>  partitionField, 

RE: Interest in Extending SOLR

2006-04-13 Thread Chris Hostetter


The crux of the issue seems to be supporting multiple indexes within a
single JVM. This has come up before, and personally i'm still in favor of
implimenting this via multiple webapps in the same servlet container,
rather then a single webapp with many seperate configs/schemas/cores that
it chooses between...

http://www.nabble.com/Re%3A-Multiple-indices-p3540026.html

you mentioned...

: I think the value of the objectType is this true independence. If you
: are indexing similar data, use a field on your data. If your data sets
: are truly different, use a different object type.

...there is still some dependency there, unless you add a similar
objectType sectioning to the solrconfig -- there might be some query
handlers that i only want for one index but not others, or
newSearcher/firstSearcher listeners i only want for one index, etc...
having truely seperate webapsp gives you all the benefits of independency,
plus the benefit of a single shared JVM with a shared memory pool


I'm still confused however about why you want multiple indexes in your
specicic use case, you mention...

: In our case, there is no relationship between the multiple indices,
: but I do see value here (more on this below). In our specific case, we
: have a one to one mapping between a database table and a Lucene index
: and have not needed to search across tables.

but then you also said..

:  address. One approach is to introduce a "partitionField" that
:  SOLR would use to figure out if a new index is required. For each
:  unique value of the partitionField, we create a separate physical
:  index. If the query does NOT contain a term for the
:  partitionField, we use a multi reader to search across all
:  indexes. If the query DOES contain the term, we only search
:  across those partitions.

...which confuses me.  I also don't see how the partitionField idea could
work cleanly, because i can't think of anyway you could safely use a
MultiReader (or even a MultiSearcher) accross indexes that had differnet
schemas ... even if field names were the same, they might have been
analyzed completely differnetly, or use field types that encode the terms
in non-uniform ways.


are we discussig two differnet use cases?

  1) support for multiple indexes with heterogeneous schemas in the same
 JVM for better memory management (but with no interaction at query
 time)
  2) support for multiple indexes using hte same index in the same JVM for
 better partitioning, with query time access to either a multireader
 across all, or a single reader of an individual partition.

...if so, perhaps the two use cases should be solved in differnet ways:
multiple webaps for #1, and some new config/schema options and a new
SolrQueryRequest.getSearcher(String partition) method for #2.


-Hoss



Re: Interest in Extending SOLR

2006-04-13 Thread Yonik Seeley
Michael,

I'm not sure that objectType should be tied to which index something
is stored in.
If Solr does evolve multiple index support, one usecase would be
partitioning data based on other factors than objectType
(documentType).

It would seem more flexible for clients (the direct updater or querier
of Solr) to identify which index should be used.  Of course each index
could have it's own schema, but it shouldn't be mandatory... it seems
like a new index should be able to be created on-the-fly somehow,
perhaps using an existing index as a template.

On 4/12/06, Bryzek.Michael <[EMAIL PROTECTED]> wrote:
> We did rough tests and found that creating multiple indexes performed
> better at run time, especially as the logic to determine what results
> should be presented to which customer became more complex.

I would expect searching a small index would be somewhat faster than
searching a large index with the small one embedded in it.  How much
faster though?  Is it really worth the effort to separate things out? 
When you did the benchmarks, did you make sure to discount the first
queries (because of first-use norm and FieldCache loading)?  All that
can be done in the background...

I'm not arguing against extending Solr to support multiple indicies,
but wondering if you could start using it as-is until such support is
well hashed out.  Seems so, since it seems to be an issue of
performance (an optimization) and not functionallity, right?

Another easy optimization you might be able to make external to Solr
is to segment your site data into different Solr collections (on
different boxes).  This assumes that search traffic is naturally
partitioned by siteId (but I may be misunderstanding).

>   a) Minimize the number of instances of SOLR. If I have 3 web
>  applications, each with 12 database tables to index, I don't want
>  to run 36 JVMs. I think introducing an objectType would address
>  this.

Another possible option is to run multiple Solr instances (webapps)
per appserver... I recall someone else going after this solution.

>   b) Optimize retrieval when I have some knowledge that I can use to
>  define partitions of data. This may actually be more appropriate
>  for Lucene itself, but I see SOLR pretty well positioned to
>  address. One approach is to introduce a "partitionField" that
>  SOLR would use to figure out if a new index is required. For each
>  unique value of the partitionField, we create a separate physical
>  index. If the query does NOT contain a term for the
>  partitionField, we use a multi reader to search across all
>  indexes. If the query DOES contain the term, we only search
>  across those partitions.

While that approach might be better w/o caching, it might be worse
with caching... it really depends on the nature of the index and the
queries.
It would really complicate Solr's caching though since a cache item
would only be valid for certain combinations of sub-indicies.

>  We have tried using cached bitsets to implement this sort of
>  approach, but have found that when we have one large document set
>  partitioned into much smaller sets (e.g. 1-10% of the total
>  document space), creating separate indexes gives us a much higher
>  boost in performance.

I assume this was with Lucene and not Solr?
Solr has better/faster filter representations... (and if I ever get
around to finishing it, a faster BitSet implementation too).

-Yonik


Re: Interest in Extending SOLR

2006-04-13 Thread Yonik Seeley
On 4/13/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:
> This has come up before, and personally i'm still in favor of
> implimenting this via multiple webapps in the same servlet container

That would certainly be easier (and you bring up a good point about
probably wanting different solrconfigs for truely separate indicies
also).
I think we are 99% of the way to multiple webapps now... just need to
set the default directory to look for config based on the webapp name.

-Yonik


RE: Interest in Extending SOLR

2006-04-13 Thread Brian Lucas
Hello there.  I moved from Lucene to Solr and am thus far impressed with the
speed, even with the added XML transport necessary to send and receive data.


However, one thing I did like about the Lucene implementation was the
ability to specify indices for each invocation.   

I agree with Michael Bryzek's comments about extending Solr to include
multiple indices.  This would be an extremely useful feature for me, as I
tend to place data into stratifications to maximize low seek times and data
sizes.  I'm currently looking to run multiple instances of Solr if necessary
to handle this.   

Another possible way to implement multiple indices (in addition to Michael's
suggestion through objectTypes) could be simply appending the index name to
the query, similar to the following:

via Indexing:

.


via Query:
http://localhost/solr/select?index=products&q= . 

And either schema.xml broken into , or
schema-.xml could handle that particular schema.

Regardless how it gets done, it would be extremely valuable to offer this
functionality and afford a better deployment "strategy" for those of us
dealing with different types of data.

Thanks Yonik, Chris, and others for releasing this - it seems to work great
for my needs.

Brian

---
Michael Bryzek wrote:
  a) Minimize the number of instances of SOLR. If I have 3 web
 applications, each with 12 database tables to index, I don't want
 to run 36 JVMs. I think introducing an objectType would address
 this.




Sorting non-indexed fields

2006-04-13 Thread Ken Krugler

Hi all,

Just checking - it seems that only indexed fields can specified for 
sorting purposes in the query string, or at least that's what I'm 
seeing for a date field.


Is this correct? Did I miss this info on the wiki someplace?

And if it is the case, then it would be handy for 
StandardRequestHandler to return an error if an invalid request such 
as this (sorting by non-indexed field) is requested.


Thanks,

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"


Re: Sorting non-indexed fields

2006-04-13 Thread Chris Hostetter
: Just checking - it seems that only indexed fields can specified for
: sorting purposes in the query string, or at least that's what I'm
: seeing for a date field.

correct -- Solr uses Lucene's built in sorting, and Lucene requires that
the field be indexed and un-tokenized.

: Is this correct? Did I miss this info on the wiki someplace?

I'll add it.

: And if it is the case, then it would be handy for
: StandardRequestHandler to return an error if an invalid request such
: as this (sorting by non-indexed field) is requested.

Could you do me a favor and file that in Jira as a feature request?  I
remember looking at doing that a while back and thinking it required
more then just a quick change to the RequestHandler.





-Hoss



RE: Interest in Extending SOLR

2006-04-13 Thread Bryzek.Michael
I definitely like the idea of support for multiple indexes based on
partitioning data that is NOT tied to a predefined element named
objectType. If we combine this with Chris' mention of completing the
work to support multiple schemas via multiple webapps in the same
servlet container, then I no longer see an immediate need to have more
than one schema per webapp. The concept would be:

 

* One schema per webapp, Multiple webapps per JVM

* Partitioning of data into multiple indexes in each webapp
based on logic that you provide

 

For our own applications, my preference is to migrate away from our
homegrown solution to SOLR, prior to investing further in what we
currently have built. I will plan on testing performance a bit more
formally to see if SOLR out of the box would work for us. Note that in
our present environment, our performance changed significantly (factor
of ~10) when we partitioned data into multiple indexes, though our tests
were very rough.

 

I would be very happy to contribute time to expand SOLR to provide
initial support for the partitioning concept as I believe this will
prove critical when we evaluate how our database structure maps to a
query index.

 

One last note: last night, I did spend a bit of time looking into what
exactly it would mean to add support for object types in SOLR. I
modified the code base to support the object type tag in the schema,
providing a working proof of concept (I'm happy to send a sample schema
if anybody is interested). The main changes:

 

* Modify IndexSchema to keep an object type

* Provide a factory in SolrCore that returns the correct
instance of SolrCore based on object type

* Modify loading of schema to load one copy per object type

 

I really do like where this conversation has gone, but if the community
does chose to support multiple object types, on the surface (to a
newcomer) it appears highly doable.

 

-Mike

 

-Original Message-
From: Yonik Seeley [mailto:[EMAIL PROTECTED] 
Sent: Thursday, April 13, 2006 2:53 PM
To: solr-user@lucene.apache.org
Subject: Re: Interest in Extending SOLR

 

Michael,

 

I'm not sure that objectType should be tied to which index something

is stored in.

If Solr does evolve multiple index support, one usecase would be

partitioning data based on other factors than objectType

(documentType).

 

It would seem more flexible for clients (the direct updater or querier

of Solr) to identify which index should be used.  Of course each index

could have it's own schema, but it shouldn't be mandatory... it seems

like a new index should be able to be created on-the-fly somehow,

perhaps using an existing index as a template.

 

On 4/12/06, Bryzek.Michael <[EMAIL PROTECTED]> wrote:

> We did rough tests and found that creating multiple indexes performed

> better at run time, especially as the logic to determine what results

> should be presented to which customer became more complex.

 

I would expect searching a small index would be somewhat faster than

searching a large index with the small one embedded in it.  How much

faster though?  Is it really worth the effort to separate things out? 

When you did the benchmarks, did you make sure to discount the first

queries (because of first-use norm and FieldCache loading)?  All that

can be done in the background...

 

I'm not arguing against extending Solr to support multiple indicies,

but wondering if you could start using it as-is until such support is

well hashed out.  Seems so, since it seems to be an issue of

performance (an optimization) and not functionallity, right?

 

Another easy optimization you might be able to make external to Solr

is to segment your site data into different Solr collections (on

different boxes).  This assumes that search traffic is naturally

partitioned by siteId (but I may be misunderstanding).

 

>   a) Minimize the number of instances of SOLR. If I have 3 web

>  applications, each with 12 database tables to index, I don't want

>  to run 36 JVMs. I think introducing an objectType would address

>  this.

 

Another possible option is to run multiple Solr instances (webapps)

per appserver... I recall someone else going after this solution.

 

>   b) Optimize retrieval when I have some knowledge that I can use to

>  define partitions of data. This may actually be more appropriate

>  for Lucene itself, but I see SOLR pretty well positioned to

>  address. One approach is to introduce a "partitionField" that

>  SOLR would use to figure out if a new index is required. For each

>  unique value of the partitionField, we create a separate physical

>  index. If the query does NOT contain a term for the

>  partitionField, we use a multi reader to search across all

>  indexes. If the query DOES contain the term, we only search

>  across those partitions.

 

While that approach might be better w/o caching, it might b

Parsing/indexing XML data

2006-04-13 Thread Ken Krugler

Hi all,

I've got some fields that will contain embedded XML. Two questions 
relating to that:


1. It appears as though I'll need to XML-escape the field data, as 
otherwise Solr complains about find a start tag (one of the embedded 
tags) before it finds the end tag for a field.


Is this an expected constraint?

And is XML-escaping the data the best way to handle it? This is kind 
of related to question #2...


2. What would be the easiest way to ignore XML tag data while 
indexing these types of XML-containing fields? It seems like I could 
define a new field type (e.g. text_xml) and set the associated 
tokenizer class to something new that I create. Though I'd have to 
un-escape the data (ick) before parsing it to skip tags.


Thanks,

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"