Re: replication test problems

2006-11-02 Thread Bill Au

I have created a bug to track this:

https://issues.apache.org/jira/browse/SOLR-63

I will attach a patch to the bug shortly.

Bill

On 11/1/06, Yu-Hui Jin <[EMAIL PROTECTED]> wrote:


Yap, Bill.

The backslash-escaping one works for my zsh as well.   And I'm sure you
checked it's working for other major shells.

So I would say backslash seems to be a good solution since you don't have
to
worry about double-single quotes.

Thanks!


regards,
-Hui



On 11/1/06, Bill Au <[EMAIL PROTECTED]> wrote:
>
> I did some testing and blackslash-escaping also works:
>
> find /home/yjin/apps/solr-nightly/example/solr/data/ -name snapshot.\*
> -print
>
> Hui, can you verify that?
>
> I am already using single quote in the snappuller script to specify the
> find
> command
> to as an argument to ssh.  I could change that to double quote and then
> use
> single quote for snapshot.*, or blackslash escape the *.
>
> I am fine with either way.  Does anyone has any strong preference?
> If not, I will just randomly choose one.
>
> Bill
>
> On 10/31/06, Mike Klaas <[EMAIL PROTECTED]> wrote:
> >
> > On 10/31/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:
> >
> > > Bill: what do you think about explicitly putting in the single
quotes
> as
> > > Hui suggested?  that should still work under bash and sh right?
> >
> > That should work in bash, at least.  Backslash-escaping is also an
> option.
> >
> > The semantics of file globbing in bash are irritating.
> >
> > -Mike
> >
>
>


--
Regards,

-Hui




Hierarchical Facets?

2006-11-02 Thread David Legg
I've just discovered Solr and realized its potential!  My main interest 
is in Solr's fledgling support for faceted search.


Are there any ideas on how to support hierarchical facets?

Take a facet like 'Location' for example.  This can be thought of as 
hierarchical because you could let the user select a path through the 
hierarchy like: -


 United States> Massachusetts> Boston


I suppose you could use the CNET approach of storing a Meta document for 
each facet and then define three facets called Country, State and City 
and set things up so that if the Country facet is selected the State 
facet is displayed.  This sounds workable (but long winded) and you'd 
have the extra problem that someone selecting 'London' from the City 
facet may get documents from both Canada and England etc.


David Legg



Null as match sort of deal

2006-11-02 Thread Corey Tisdale
I'm scoping out the simple faceting for a project we have coming up,  
and I am wondering what the best way to achieve the following would  
be (if possible). We have some facets where the presence of some  
value (in our current prototype on non-solr we use null, so we call  
the setup null as match :) constitutes a hit all of the time (where  
facet=x or facet=null) and on some facets constitutes a not match  
(facet=x only). Can out of the box solr faceting support something  
like this, or can it be done with a light massaging?


Thanks!
Corey


Re: Re: Recommended Update Batch Size?

2006-11-02 Thread Yonik Seeley

On 11/1/06, Mike Klaas <[EMAIL PROTECTED]> wrote:

DUH2.doDeletions() would also highly benefit from sorting the id terms
before looking them up in these types of cases (as it would trigger
optimizations in lucene as well as being kinder to the os' read-ahead
buffers).


Hmmm, good point.  I wonder how simply using a TreeMap instead of a
HashMap would work.


> If you have a multi-CPU server, you could increase indexing
> performance by using a multithreaded client to keep all the CPUs on
> the server busy.

I thought so, too, but it turns out that there isn't a huge amount of
concurrent updating that can occur, if I am reading the code
correctly.  DUH2.addDoc() calls exactly one of addConditionally,
overwriteBoth, or allowDups, each of which add the document in a
synchronized(this) block.


Good catch.
And with the way that deletes are deferred, moving the add outside of
the sync block should work OK I think... then the analysis if
documents can be done in parallel.

Hmmm, but it may not work well in a mixed-overwriting environment.
Thread 1 overwrites doc 100, Thread 2 adds doc 100 (allowing duplicates).
With add synchronization the index has two possible states:
  Index contains doc_from_thread1  OR index contains both docs
Without sync around the adds, an additional possible state is added:
 Index contains doc_from_thread2

Even though synchronized behavior != unsynchronized behavior, this is
only a problem if someone actually desires to mix overwriting &
non-overwriting on the same document ids, and is OK with the two
possible states in the synchronized case.

I'm tempted to say "mixing overwriting & non-overwriting adds for the
same documents has undefined behavior".  Thoughts?

-Yonik


Re: Recommended Update Batch Size?

2006-11-02 Thread Walter Underwood
A quick update on my experiments with update rate:

* 20 docs/sec using one wget call per POST
* 170 docs/sec using single doc POST over a persistent HTTP connection
* 250 docs/sec using 20 doc batches over persistent HTTP
* 250 docs/sec using 100 doc batches over persistent HTTP

The latter three used a commit every 2000 docs (not batches)
and an optimize every 10,000 docs.

Each submitted document is between 200 and 700 bytes, pretty small.

I didn't try parallel connections, since this speed is just
fine.

This is using the default settings for merge factor, max buffered docs,
and so on.

wunder
-- 
Walter Underwood
Search Guru, Netflix





Re: Hierarchical Facets?

2006-11-02 Thread Yonik Seeley

On 11/2/06, David Legg <[EMAIL PROTECTED]> wrote:

I've just discovered Solr and realized its potential!


Welcome!


My main interest
is in Solr's fledgling support for faceted search.

Are there any ideas on how to support hierarchical facets?


I've thought of some optimizations and general approach (not yet
implemented) when it's a strict hierarchy.

Store the full tag path as a single value w/ a special separator:
"United States/Massachusetts/Boston"

That allows one to use the FieldCache (which maps a document back to a
single field value... can only be used for single-valued non-tokenized
fields).

The existing facet.field functionality could be used to return counts
of unique values, and the Solr client (front-end) could sum everything
under Massachusetts to get a count for Massachusetts itself, etc.

An alternative would be to add the concept of a strict facet hierarchy
into Solr, and it could do the summing itself (useful if there are too
many leaves to return them all to the client).

For narrowing search results (specifying filter queries via fq), one
could use prefix queries for higher level categories:
fq=location:United States/Massachusetts/*


So in short, I think it's already doable with what's there, but could
be made more efficient.

-Yonik


Re: Re: Re: Recommended Update Batch Size?

2006-11-02 Thread Mike Klaas

On 11/2/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:

On 11/1/06, Mike Klaas <[EMAIL PROTECTED]> wrote:
> DUH2.doDeletions() would also highly benefit from sorting the id terms
> before looking them up in these types of cases (as it would trigger
> optimizations in lucene as well as being kinder to the os' read-ahead
> buffers).

Hmmm, good point.  I wonder how simply using a TreeMap instead of a
HashMap would work.


Definitely.


> I thought so, too, but it turns out that there isn't a huge amount of
> concurrent updating that can occur, if I am reading the code
> correctly.  DUH2.addDoc() calls exactly one of addConditionally,
> overwriteBoth, or allowDups, each of which add the document in a
> synchronized(this) block.

Good catch.
And with the way that deletes are deferred, moving the add outside of
the sync block should work OK I think... then the analysis if
documents can be done in parallel.


The one thing I'm worried about is closing the writer while documents
are being added to it. IndexWriter is nominally thread-safe, but I'm
not sure what happens to documents that are being added at the time.
Looking at IndexWriter.java, it seems like if addDocument() is entered
but hasn't reached the synchronized block, then close() is called, the
document could be lost or an exception raised.

<>

I'm tempted to say "mixing overwriting & non-overwriting adds for the
same documents has undefined behavior".  Thoughts?


I believe that is reasonable.

I was going to try to put in some basic autoCommit logic while I was
mucking about here.  One question: did you intend for maxCommitTime to
trigger deterministically (regardless of any events occurring or not)?
I had in mind checking these constraints only when documents are
added, but this could result in maxCommitTime elapsing without a
commit.

regards,
-Mike


Re: Null as match sort of deal

2006-11-02 Thread Chris Hostetter

: be (if possible). We have some facets where the presence of some
: value (in our current prototype on non-solr we use null, so we call
: the setup null as match :) constitutes a hit all of the time (where
: facet=x or facet=null) and on some facets constitutes a not match
: (facet=x only). Can out of the box solr faceting support something
: like this, or can it be done with a light massaging?

it would certainly be doable in a custom request handler ... you would
want to generate a DocSet of all documents that match your sentinle value
(ie: "null") and then union that with each of the DocSets from your
constraints before intersecting with the DocSet from your main search to
get the counts.

alternately, if you know that your fields are "single value" you could
probably just find the intersection of your main DocSet with the DocSet of
things matching the sentinal value, and then add that number to each of
your constraint intersections ... that should be the same number,
correct?


-Hoss



Re: Hierarchical Facets?

2006-11-02 Thread Chris Hostetter

: > My main interest
: > is in Solr's fledgling support for faceted search.
: >
: > Are there any ideas on how to support hierarchical facets?

there are two different things that fall under the heading of
"hierarchical facets", and they can be very differnet...

The first usage is when the word "hierarchy" refers largely to the UI of
the facet, and not of the data itself.  An example of this would be a
facet on "date", where you show each decade as a possible constraint
w/count and if/when the user picks a decade, then you should the
individual years as constraints, and if/when they pick a year you should
them months, etc...  untill you get to the granularity that makes sense.
Another example of this would be a facet on a "Name" field, where you
start by showing them constratints and coutns on the first letter of a
name, and one they pick that you should them all of hte two letter combos,
and then the 3 letter combos, etc...

The second usage is when the documents themselves can be organized into a
hierarchical taxonomy (or perhaps multiple differnet taxonomies) and you
want to expose that information as a facet ...  Nabble.com's "Narrow
Search Results" right nav fits this model.


It's not allways a clear cut line though ... you might actually think of
your data being organized in a hierarcy of decades, which have years as
sub-categories which have months as sub-categories ... but most people
wouldn't do that.  I don't know anyone who would think that it acctually
makes sense to categorize people in a taxonomy based on the first N leters
of their name.  Yonik's Location example however, is a good one where the
lines are very blurry.  It might make sense to think of it in the first
usage case, where the UI of the facets is presented in such a way that
we only should the State level constraints once a Country level constraint
is picked, etc...  But other people might prefer a UI driven by the second
usage, where a biz search for "Dunkin Donuts" lists the first five
constraints in the Location field as...
 United States/Massachusetts/Boston (103)
 United States/Massachusetts/Cambridge (92)
 United States/Massachusetts/Brookline (84)
 United States/Colorado (77)
 United States/Massachusetts/Newton (55)

...because there are just so damn many Dunkin Donuts in Massachusetts, we
go down to the citi level, but for Colorado we just show at the state
granularity


The first type of "hierarchical facet" is obviously a lot easier then the
second -- largely because the second can't typcially be done using simpel
Term comparisons ... and you need some carefully choosen logic for
deciding when to be granular, and when to be general.

: An alternative would be to add the concept of a strict facet hierarchy
: into Solr, and it could do the summing itself (useful if there are too
: many leaves to return them all to the client).

FYI: what we found when building the "Category" Facet in the left nav of
this page...
  http://shopper.search.com/search?q=compactflash
...was that sorting on the strict count wasn't very good from a usability
perspective, because the very general categories in the taxonomy tended to
contain so many products they allways sorted to the top, even if they
weren't particularaly relevent (ie: even if only half of the Digital
Cameras on the market use compact flash cards, that may still be more then
the total number of products in the entire "flash memory" category).  We
wound up computing a custom score for each category based on a function of
the number of matching results in that category, the total number of items in
that category, and a few other metrics.

It may be hard to roll out a truely "generic" way of sorting hierarchical
counts that would be usefull for people.


-Hoss



Re: Re: Re: Recommended Update Batch Size?

2006-11-02 Thread Yonik Seeley

On 11/2/06, Mike Klaas <[EMAIL PROTECTED]> wrote:

The one thing I'm worried about is closing the writer while documents
are being added to it. IndexWriter is nominally thread-safe, but I'm
not sure what happens to documents that are being added at the time.
Looking at IndexWriter.java, it seems like if addDocument() is entered
but hasn't reached the synchronized block, then close() is called, the
document could be lost or an exception raised.


This seems harder to address in "user code" and still maintain parallelism.
Perhaps a Lucene patch would be more appropriate?

Perhaps IndexWriter should have a close flag, and addDocument should
return a boolean indicating if the document was added or not.  Then we
could move addDocument() outside the sync block, and put a big do
while(!addDocument()) loop around the whole thing.

There is still another case to consider: if a commit happens between
adding the id to the pset and adding the document to the index, and
the add succeeds, the id will no longer be in the pset so we will end
up with a duplicate after the next commit.


I was going to try to put in some basic autoCommit logic while I was
mucking about here.  One question: did you intend for maxCommitTime to
trigger deterministically (regardless of any events occurring or not)?


I hadn't thought through the whole thing, but it seems like it should
only trigger if it would make a difference.


 I had in mind checking these constraints only when documents are
added, but this could result in maxCommitTime elapsing without a
commit.


If there is nothing to commit, that should be fine.
I think the type of guarantee we should make is that if you add a
document, it will be committed within a certain period of time
(leaving out variances for autowarming time, etc).

-Yonik


Re: Hierarchical Facets?

2006-11-02 Thread David Legg

Hi Yonik,

Are there any ideas on how to support hierarchical facets?

I've thought of some optimizations and general approach (not yet
implemented) when it's a strict hierarchy.

Thanks for your comments.

Of the two approaches you mention (storing the full tag path or creating 
a strict facet hierarchy concept in Solr) I think the latter is the 
tidier approach.

So in short, I think it's already doable with what's there, but could
be made more efficient.
I think a powerful facet supporting framework will be a very sought 
after feature in the next few years.  The question is whether Solr is 
the right structure to develop it on.  I'm torn because I believe that 
complex facets will require some kind of ontological support rather than 
an inverted index structure like Lucene.  The problem is most 
ontological code is relatively slow and difficult to update in a 
scalable way.


Thanks for adding the Jira note (SOLR-64).

- David.


Re: Re: Re: Re: Recommended Update Batch Size?

2006-11-02 Thread Mike Klaas

On 11/2/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:

On 11/2/06, Mike Klaas <[EMAIL PROTECTED]> wrote:
> The one thing I'm worried about is closing the writer while documents
> are being added to it. IndexWriter is nominally thread-safe, but I'm
> not sure what happens to documents that are being added at the time.
> Looking at IndexWriter.java, it seems like if addDocument() is entered
> but hasn't reached the synchronized block, then close() is called, the
> document could be lost or an exception raised.

This seems harder to address in "user code" and still maintain parallelism.
Perhaps a Lucene patch would be more appropriate?

Perhaps IndexWriter should have a close flag, and addDocument should
return a boolean indicating if the document was added or not.  Then we
could move addDocument() outside the sync block, and put a big do
while(!addDocument()) loop around the whole thing.

There is still another case to consider: if a commit happens between
adding the id to the pset and adding the document to the index, and
the add succeeds, the id will no longer be in the pset so we will end
up with a duplicate after the next commit.


I think that I've come up with a new locking strategy that circumvents
all these issues... stay tuned.


> I was going to try to put in some basic autoCommit logic while I was
> mucking about here.  One question: did you intend for maxCommitTime to
> trigger deterministically (regardless of any events occurring or not)?

I hadn't thought through the whole thing, but it seems like it should
only trigger if it would make a difference.


Right--I was more concerned with whether it would occur by itself, or
if this was a condition that would trigger if true when checked.


>  I had in mind checking these constraints only when documents are
> added, but this could result in maxCommitTime elapsing without a
> commit.

If there is nothing to commit, that should be fine.
I think the type of guarantee we should make is that if you add a
document, it will be committed within a certain period of time
(leaving out variances for autowarming time, etc).


That's the condition I was wondering about.  I may leave tha tout of
the patch for the time being.

-Mike


Re: Hierarchical Facets?

2006-11-02 Thread David Legg

Chris,


: > Are there any ideas on how to support hierarchical facets?

It may be hard to roll out a truely "generic" way of sorting hierarchical
counts that would be usefull for people.
  


Thanks for the examples.  I see what you mean about the difficulty in 
creating a generic facet framework.  It would be interesting to know if 
any work has been published which enumerates the design patterns 
involved.  I found Marti Hearst's research paper "Design Recommendations 
for Hierarchical Faceted Search" very useful... 
http://flamenco.berkeley.edu/papers/faceted-workshop06.pdf


I hadn't considered the problem of a simple facet like a date 
potentially being presented as a set of pseudo facets like decade, year, 
month etc.  That would require the query server to further split the 
counts depending on how it was to be presented in the user interface.


Sure, sites like epicurious.com with their fixed number of simple facets 
can be catered for in Solr.  Future sites, based on thousands of facets, 
some hierarchical, some Tag-like and some numerical would struggle I think!


- David