Multiple sorting on text fields

2010-09-13 Thread Stanislaw
Hi all!

i found some strange behavior of solr. If I do sorting by 2 text fields in
chain, I do receive some results doubled.
The both text fields are not multivalued, one of them is string, the other
custom type based on text field and keyword analyzer.

I do this:

*CommonsHttpSolrServer server =
SolrServer.getInstance().getServer();
SolrQuery query = new SolrQuery();
query.setQuery(suchstring);
query.addSortField("type", SolrQuery.ORDER.asc);
//String field- it's only one letter
query.addSortField("sortName", SolrQuery.ORDER.asc); //text
field, not tokenized

QueryResponse rsp = new QueryResponse();
rsp = server.query(query);*

after that I extract results as a list Entity objects, the most of them are
unique, but some of them are doubled and even tripled in this list.
(Each object has a unique id and there is only one time in index)
If I'm sorting only by one text field, I'm receiving "normal" results w/o
problems.
Where could I do a mistake, or is it a bug?

Best regards,
Stanislaw


Re: what differents between SolrCloud and Solr+Hadoop

2010-09-13 Thread Marc Sturlese

Well these are pretty different things. SolrCloud is meant to handle
distributed search in a more easy way that "raw" solr distributed search.
You have to build the shards in your own way.
Solr+hadoop is a way to build these shards/indexes in paralel.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/what-differents-between-SolrCloud-and-Solr-Hadoop-tp1463809p1464106.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Multiple sorting on text fields

2010-09-13 Thread Dennis Gearon
My guess is two things are happening:
  1/ Your combination of filters is in parallel,or an OR expression. This I 
think for sure  maybe, seen next.
  2/ To get 3 duplicate results, your custom filter AND the OR expression above 
have to be working togther, or it's possible that your customer filter is the 
WHOLE problem, supplying the duplicates and the triplicates.

A first guess  nothing more :-)
Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Mon, 9/13/10, Stanislaw  wrote:

> From: Stanislaw 
> Subject: Multiple sorting on text fields
> To: solr-user@lucene.apache.org
> Date: Monday, September 13, 2010, 12:12 AM
> Hi all!
> 
> i found some strange behavior of solr. If I do sorting by 2
> text fields in
> chain, I do receive some results doubled.
> The both text fields are not multivalued, one of them is
> string, the other
> custom type based on text field and keyword analyzer.
> 
> I do this:
> 
> *        CommonsHttpSolrServer server
> =
> SolrServer.getInstance().getServer();
>         SolrQuery query = new
> SolrQuery();
>         query.setQuery(suchstring);
>         query.addSortField("type",
> SolrQuery.ORDER.asc);
> //String field- it's only one letter
>         query.addSortField("sortName",
> SolrQuery.ORDER.asc);     //text
> field, not tokenized
> 
>         QueryResponse rsp = new
> QueryResponse();
>         rsp = server.query(query);*
> 
> after that I extract results as a list Entity objects, the
> most of them are
> unique, but some of them are doubled and even tripled in
> this list.
> (Each object has a unique id and there is only one time in
> index)
> If I'm sorting only by one text field, I'm receiving
> "normal" results w/o
> problems.
> Where could I do a mistake, or is it a bug?
> 
> Best regards,
> Stanislaw
>


Re: Tuning Solr caches with high commit rates (NRT)

2010-09-13 Thread Peter Sturge
The balanced segment merging is a really cool idea. I'll definetely
have a look at this, thanks!

One thing I forgot to mention in the original post is we use a
mergeFactor of 25. Somewhat on the high side, so that incoming commits
aren't trying to merge new data into large segments.
25 is a good balance for us between number of files and search
performance. This LinkedIn patch could come in very handy for handling
merges.


On Mon, Sep 13, 2010 at 2:20 AM, Lance Norskog  wrote:
> Bravo!
>
> Other tricks: here is a policy for deciding when to merge segments that
> attempts to balance merging with performance. It was contributed by
> LinkedIn- they also run index&search in the same instance (not Solr, a
> different Lucene app).
>
> lucene/contrib/misc/src/java/org/apache/lucene/index/BalancedSegmentMergePolicy.java
>
> The optimize command now includes a partial optimize option, so you can do
> larger controlled merges.
>
> Peter Sturge wrote:
>>
>> Hi,
>>
>> Below are some notes regarding Solr cache tuning that should prove
>> useful for anyone who uses Solr with frequent commits (e.g.<5min).
>>
>> Environment:
>> Solr 1.4.1 or branch_3x trunk.
>> Note the 4.x trunk has lots of neat new features, so the notes here
>> are likely less relevant to the 4.x environment.
>>
>> Overview:
>> Our Solr environment makes extensive use of faceting, we perform
>> commits every 30secs, and the indexes tend be on the large-ish side
>> (>20million docs).
>> Note: For our data, when we commit, we are always adding new data,
>> never changing existing data.
>> This type of environment can be tricky to tune, as Solr is more geared
>> toward fast reads than frequent writes.
>>
>> Symptoms:
>> If anyone has used faceting in searches where you are also performing
>> frequent commits, you've likely encountered the dreaded OutOfMemory or
>> GC Overhead Exeeded errors.
>> In high commit rate environments, this is almost always due to
>> multiple 'onDeck' searchers and autowarming - i.e. new searchers don't
>> finish autowarming their caches before the next commit()
>> comes along and invalidates them.
>> Once this starts happening on a regular basis, it is likely your
>> Solr's JVM will run out of memory eventually, as the number of
>> searchers (and their cache arrays) will keep growing until the JVM
>> dies of thirst.
>> To check if your Solr environment is suffering from this, turn on INFO
>> level logging, and look for: 'PERFORMANCE WARNING: Overlapping
>> onDeckSearchers=x'.
>>
>> In tests, we've only ever seen this problem when using faceting, and
>> facet.method=fc.
>>
>> Some solutions to this are:
>>     Reduce the commit rate to allow searchers to fully warm before the
>> next commit
>>     Reduce or eliminate the autowarming in caches
>>     Both of the above
>>
>> The trouble is, if you're doing NRT commits, you likely have a good
>> reason for it, and reducing/elimintating autowarming will very
>> significantly impact search performance in high commit rate
>> environments.
>>
>> Solution:
>> Here are some setup steps we've used that allow lots of faceting (we
>> typically search with at least 20-35 different facet fields, and date
>> faceting/sorting) on large indexes, and still keep decent search
>> performance:
>>
>> 1. Firstly, you should consider using the enum method for facet
>> searches (facet.method=enum) unless you've got A LOT of memory on your
>> machine. In our tests, this method uses a lot less memory and
>> autowarms more quickly than fc. (Note, I've not tried the new
>> segement-based 'fcs' option, as I can't find support for it in
>> branch_3x - looks nice for 4.x though)
>> Admittedly, for our data, enum is not quite as fast for searching as
>> fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile
>> tradeoff.
>> If you do have access to LOTS of memory, AND you can guarantee that
>> the index won't grow beyond the memory capacity (i.e. you have some
>> sort of deletion policy in place), fc can be a lot faster than enum
>> when searching with lots of facets across many terms.
>>
>> 2. Secondly, we've found that LRUCache is faster at autowarming than
>> FastLRUCache - in our tests, about 20% faster. Maybe this is just our
>> environment - your mileage may vary.
>>
>> So, our filterCache section in solrconfig.xml looks like this:
>>     >       class="solr.LRUCache"
>>       size="3600"
>>       initialSize="1400"
>>       autowarmCount="3600"/>
>>
>> For a 28GB index, running in a quad-core x64 VMWare instance, 30
>> warmed facet fields, Solr is running at ~4GB. Stats filterCache size
>> shows usually in the region of ~2400.
>>
>> 3. It's also a good idea to have some sort of
>> firstSearcher/newSearcher event listener queries to allow new data to
>> populate the caches.
>> Of course, what you put in these is dependent on the facets you need/use.
>> We've found a good combination is a firstSearcher with as many facets
>> in the search as your environment can handle, then a subset of the

Re: Tuning Solr caches with high commit rates (NRT)

2010-09-13 Thread Peter Sturge
1. You can run multiple Solr instances in separate JVMs, with both
having their solr.xml configured to use the same index folder.
You need to be careful that one and only one of these instances will
ever update the index at a time. The best way to ensure this is to use
one for writing only,
and the other is read-only and never writes to the index. This
read-only instance is the one to use for tuning for high search
performance. Even though the RO instance doesn't write to the index,
it still needs periodic (albeit empty) commits to kick off
autowarming/cache refresh.

Depending on your needs, you might not need to have 2 separate
instances. We need it because the 'write' instance is also doing a lot
of metadata pre-write operations in the same jvm as Solr, and so has
its own memory requirements.

2. We use sharding all the time, and it works just fine with this
scenario, as the RO instance is simply another shard in the pack.


On Sun, Sep 12, 2010 at 8:46 PM, Peter Karich  wrote:
> Peter,
>
> thanks a lot for your in-depth explanations!
> Your findings will be definitely helpful for my next performance
> improvement tests :-)
>
> Two questions:
>
> 1. How would I do that:
>
>> or a local read-only instance that reads the same core as the indexing
>> instance (for the latter, you'll need something that periodically refreshes 
>> - i.e. runs commit()).
>
>
> 2. Did you try sharding with your current setup (e.g. one big,
> nearly-static index and a tiny write+read index)?
>
> Regards,
> Peter.
>
>> Hi,
>>
>> Below are some notes regarding Solr cache tuning that should prove
>> useful for anyone who uses Solr with frequent commits (e.g. <5min).
>>
>> Environment:
>> Solr 1.4.1 or branch_3x trunk.
>> Note the 4.x trunk has lots of neat new features, so the notes here
>> are likely less relevant to the 4.x environment.
>>
>> Overview:
>> Our Solr environment makes extensive use of faceting, we perform
>> commits every 30secs, and the indexes tend be on the large-ish side
>> (>20million docs).
>> Note: For our data, when we commit, we are always adding new data,
>> never changing existing data.
>> This type of environment can be tricky to tune, as Solr is more geared
>> toward fast reads than frequent writes.
>>
>> Symptoms:
>> If anyone has used faceting in searches where you are also performing
>> frequent commits, you've likely encountered the dreaded OutOfMemory or
>> GC Overhead Exeeded errors.
>> In high commit rate environments, this is almost always due to
>> multiple 'onDeck' searchers and autowarming - i.e. new searchers don't
>> finish autowarming their caches before the next commit()
>> comes along and invalidates them.
>> Once this starts happening on a regular basis, it is likely your
>> Solr's JVM will run out of memory eventually, as the number of
>> searchers (and their cache arrays) will keep growing until the JVM
>> dies of thirst.
>> To check if your Solr environment is suffering from this, turn on INFO
>> level logging, and look for: 'PERFORMANCE WARNING: Overlapping
>> onDeckSearchers=x'.
>>
>> In tests, we've only ever seen this problem when using faceting, and
>> facet.method=fc.
>>
>> Some solutions to this are:
>>     Reduce the commit rate to allow searchers to fully warm before the
>> next commit
>>     Reduce or eliminate the autowarming in caches
>>     Both of the above
>>
>> The trouble is, if you're doing NRT commits, you likely have a good
>> reason for it, and reducing/elimintating autowarming will very
>> significantly impact search performance in high commit rate
>> environments.
>>
>> Solution:
>> Here are some setup steps we've used that allow lots of faceting (we
>> typically search with at least 20-35 different facet fields, and date
>> faceting/sorting) on large indexes, and still keep decent search
>> performance:
>>
>> 1. Firstly, you should consider using the enum method for facet
>> searches (facet.method=enum) unless you've got A LOT of memory on your
>> machine. In our tests, this method uses a lot less memory and
>> autowarms more quickly than fc. (Note, I've not tried the new
>> segement-based 'fcs' option, as I can't find support for it in
>> branch_3x - looks nice for 4.x though)
>> Admittedly, for our data, enum is not quite as fast for searching as
>> fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile
>> tradeoff.
>> If you do have access to LOTS of memory, AND you can guarantee that
>> the index won't grow beyond the memory capacity (i.e. you have some
>> sort of deletion policy in place), fc can be a lot faster than enum
>> when searching with lots of facets across many terms.
>>
>> 2. Secondly, we've found that LRUCache is faster at autowarming than
>> FastLRUCache - in our tests, about 20% faster. Maybe this is just our
>> environment - your mileage may vary.
>>
>> So, our filterCache section in solrconfig.xml looks like this:
>>     >       class="solr.LRUCache"
>>       size="3600"
>>       initialSize="1400"
>>       autowarmCount="

Re: Tuning Solr caches with high commit rates (NRT)

2010-09-13 Thread Peter Sturge
Hi Erik,

I thought this would be good for the wiki, but I've not submitted to
the wiki before, so I thought I'd put this info out there first, then
add it if it was deemed useful.
If you could let me know the procedure for submitting, it probably
would be worth getting it into the wiki (couldn't do it straightaway,
as I have a lot of projects on at the moment). If you're able/willing
to put it on there for me, that would be very kind of you!

Thanks!
Peter


On Sun, Sep 12, 2010 at 5:43 PM, Erick Erickson  wrote:
> Peter:
>
> This kind of information is extremely useful to document, thanks! Do you
> have the time/energy to put it up on the Wiki? Anyone can edit it by
> creating
> a logon. If you don't, would it be OK if someone else did it (with
> attribution,
> of course)? I guess that by bringing it up I'm volunteering :)...
>
> Best
> Erick
>
> On Sun, Sep 12, 2010 at 12:26 PM, Peter Sturge wrote:
>
>> Hi,
>>
>> Below are some notes regarding Solr cache tuning that should prove
>> useful for anyone who uses Solr with frequent commits (e.g. <5min).
>>
>> Environment:
>> Solr 1.4.1 or branch_3x trunk.
>> Note the 4.x trunk has lots of neat new features, so the notes here
>> are likely less relevant to the 4.x environment.
>>
>> Overview:
>> Our Solr environment makes extensive use of faceting, we perform
>> commits every 30secs, and the indexes tend be on the large-ish side
>> (>20million docs).
>> Note: For our data, when we commit, we are always adding new data,
>> never changing existing data.
>> This type of environment can be tricky to tune, as Solr is more geared
>> toward fast reads than frequent writes.
>>
>> Symptoms:
>> If anyone has used faceting in searches where you are also performing
>> frequent commits, you've likely encountered the dreaded OutOfMemory or
>> GC Overhead Exeeded errors.
>> In high commit rate environments, this is almost always due to
>> multiple 'onDeck' searchers and autowarming - i.e. new searchers don't
>> finish autowarming their caches before the next commit()
>> comes along and invalidates them.
>> Once this starts happening on a regular basis, it is likely your
>> Solr's JVM will run out of memory eventually, as the number of
>> searchers (and their cache arrays) will keep growing until the JVM
>> dies of thirst.
>> To check if your Solr environment is suffering from this, turn on INFO
>> level logging, and look for: 'PERFORMANCE WARNING: Overlapping
>> onDeckSearchers=x'.
>>
>> In tests, we've only ever seen this problem when using faceting, and
>> facet.method=fc.
>>
>> Some solutions to this are:
>>    Reduce the commit rate to allow searchers to fully warm before the
>> next commit
>>    Reduce or eliminate the autowarming in caches
>>    Both of the above
>>
>> The trouble is, if you're doing NRT commits, you likely have a good
>> reason for it, and reducing/elimintating autowarming will very
>> significantly impact search performance in high commit rate
>> environments.
>>
>> Solution:
>> Here are some setup steps we've used that allow lots of faceting (we
>> typically search with at least 20-35 different facet fields, and date
>> faceting/sorting) on large indexes, and still keep decent search
>> performance:
>>
>> 1. Firstly, you should consider using the enum method for facet
>> searches (facet.method=enum) unless you've got A LOT of memory on your
>> machine. In our tests, this method uses a lot less memory and
>> autowarms more quickly than fc. (Note, I've not tried the new
>> segement-based 'fcs' option, as I can't find support for it in
>> branch_3x - looks nice for 4.x though)
>> Admittedly, for our data, enum is not quite as fast for searching as
>> fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile
>> tradeoff.
>> If you do have access to LOTS of memory, AND you can guarantee that
>> the index won't grow beyond the memory capacity (i.e. you have some
>> sort of deletion policy in place), fc can be a lot faster than enum
>> when searching with lots of facets across many terms.
>>
>> 2. Secondly, we've found that LRUCache is faster at autowarming than
>> FastLRUCache - in our tests, about 20% faster. Maybe this is just our
>> environment - your mileage may vary.
>>
>> So, our filterCache section in solrconfig.xml looks like this:
>>    >      class="solr.LRUCache"
>>      size="3600"
>>      initialSize="1400"
>>      autowarmCount="3600"/>
>>
>> For a 28GB index, running in a quad-core x64 VMWare instance, 30
>> warmed facet fields, Solr is running at ~4GB. Stats filterCache size
>> shows usually in the region of ~2400.
>>
>> 3. It's also a good idea to have some sort of
>> firstSearcher/newSearcher event listener queries to allow new data to
>> populate the caches.
>> Of course, what you put in these is dependent on the facets you need/use.
>> We've found a good combination is a firstSearcher with as many facets
>> in the search as your environment can handle, then a subset of the
>> most common facets for the newSearcher.
>>
>> 

Re: Tuning Solr caches with high commit rates (NRT)

2010-09-13 Thread Peter Sturge
Hi Dennis,

These are the Lucene file segments that hold the index data on the file system.
Have a look at: http://wiki.apache.org/solr/SolrPerformanceFactors

Peter


On Mon, Sep 13, 2010 at 7:02 AM, Dennis Gearon  wrote:
> BTW, what is a segment?
>
> I've only heard about them in the last 2 weeks here on the list.
> Dennis Gearon
>
> Signature Warning
> 
> EARTH has a Right To Life,
>  otherwise we all die.
>
> Read 'Hot, Flat, and Crowded'
> Laugh at http://www.yert.com/film.php
>
>
> --- On Sun, 9/12/10, Jason Rutherglen  wrote:
>
>> From: Jason Rutherglen 
>> Subject: Re: Tuning Solr caches with high commit rates (NRT)
>> To: solr-user@lucene.apache.org
>> Date: Sunday, September 12, 2010, 7:52 PM
>> Yeah there's no patch... I think
>> Yonik can write it. :-)  Yah... The
>> Lucene version shouldn't matter.  The distributed
>> faceting
>> theoretically can easily be applied to multiple segments,
>> however the
>> way it's written for me is a challenge to untangle and
>> apply
>> successfully to a working patch.  Also I don't have
>> this as an itch to
>> scratch at the moment.
>>
>> On Sun, Sep 12, 2010 at 7:18 PM, Peter Sturge 
>> wrote:
>> > Hi Jason,
>> >
>> > I've tried some limited testing with the 4.x trunk
>> using fcs, and I
>> > must say, I really like the idea of per-segment
>> faceting.
>> > I was hoping to see it in 3.x, but I don't see this
>> option in the
>> > branch_3x trunk. Is your SOLR-1606 patch referred to
>> in SOLR-1617 the
>> > one to use with 3.1?
>> > There seems to be a number of Solr issues tied to this
>> - one of them
>> > being Lucene-1785. Can the per-segment faceting patch
>> work with Lucene
>> > 2.9/branch_3x?
>> >
>> > Thanks,
>> > Peter
>> >
>> >
>> >
>> > On Mon, Sep 13, 2010 at 12:05 AM, Jason Rutherglen
>> > 
>> wrote:
>> >> Peter,
>> >>
>> >> Are you using per-segment faceting, eg, SOLR-1617?
>>  That could help
>> >> your situation.
>> >>
>> >> On Sun, Sep 12, 2010 at 12:26 PM, Peter Sturge
>> 
>> wrote:
>> >>> Hi,
>> >>>
>> >>> Below are some notes regarding Solr cache
>> tuning that should prove
>> >>> useful for anyone who uses Solr with frequent
>> commits (e.g. <5min).
>> >>>
>> >>> Environment:
>> >>> Solr 1.4.1 or branch_3x trunk.
>> >>> Note the 4.x trunk has lots of neat new
>> features, so the notes here
>> >>> are likely less relevant to the 4.x
>> environment.
>> >>>
>> >>> Overview:
>> >>> Our Solr environment makes extensive use of
>> faceting, we perform
>> >>> commits every 30secs, and the indexes tend be
>> on the large-ish side
>> >>> (>20million docs).
>> >>> Note: For our data, when we commit, we are
>> always adding new data,
>> >>> never changing existing data.
>> >>> This type of environment can be tricky to
>> tune, as Solr is more geared
>> >>> toward fast reads than frequent writes.
>> >>>
>> >>> Symptoms:
>> >>> If anyone has used faceting in searches where
>> you are also performing
>> >>> frequent commits, you've likely encountered
>> the dreaded OutOfMemory or
>> >>> GC Overhead Exeeded errors.
>> >>> In high commit rate environments, this is
>> almost always due to
>> >>> multiple 'onDeck' searchers and autowarming -
>> i.e. new searchers don't
>> >>> finish autowarming their caches before the
>> next commit()
>> >>> comes along and invalidates them.
>> >>> Once this starts happening on a regular basis,
>> it is likely your
>> >>> Solr's JVM will run out of memory eventually,
>> as the number of
>> >>> searchers (and their cache arrays) will keep
>> growing until the JVM
>> >>> dies of thirst.
>> >>> To check if your Solr environment is suffering
>> from this, turn on INFO
>> >>> level logging, and look for: 'PERFORMANCE
>> WARNING: Overlapping
>> >>> onDeckSearchers=x'.
>> >>>
>> >>> In tests, we've only ever seen this problem
>> when using faceting, and
>> >>> facet.method=fc.
>> >>>
>> >>> Some solutions to this are:
>> >>>    Reduce the commit rate to allow searchers
>> to fully warm before the
>> >>> next commit
>> >>>    Reduce or eliminate the autowarming in
>> caches
>> >>>    Both of the above
>> >>>
>> >>> The trouble is, if you're doing NRT commits,
>> you likely have a good
>> >>> reason for it, and reducing/elimintating
>> autowarming will very
>> >>> significantly impact search performance in
>> high commit rate
>> >>> environments.
>> >>>
>> >>> Solution:
>> >>> Here are some setup steps we've used that
>> allow lots of faceting (we
>> >>> typically search with at least 20-35 different
>> facet fields, and date
>> >>> faceting/sorting) on large indexes, and still
>> keep decent search
>> >>> performance:
>> >>>
>> >>> 1. Firstly, you should consider using the enum
>> method for facet
>> >>> searches (facet.method=enum) unless you've got
>> A LOT of memory on your
>> >>> machine. In our tests, this method uses a lot
>> less memory and
>> >>> autowarms more quickly than fc. (Note, I've
>> not tried the new
>> >>> segement-based 'fcs' option, as I can't find
>> support for it in
>> >>> branch_3x - looks

Re: Tuning Solr caches with high commit rates (NRT)

2010-09-13 Thread Simon Willnauer
On Mon, Sep 13, 2010 at 8:02 AM, Dennis Gearon  wrote:
> BTW, what is a segment?

On the Lucene level an index is composed of one or more index
segments. Each segment is an index by itself and consists of several
files like doc stores, proximity data, term dictionaries etc. During
indexing Lucene / Solr creates those segments depending on ram buffer
/ document buffer settings and flushes them to disk (if you index to
disk). Once a segment has been flushed Lucene will never change the
segments (well up to a certain level - lets keep this simple) but
write new ones for new added documents. Since segments have a
write-once policy Lucene merges multiple segments into a new segment
(how and when this happens is different story) from time to time to
get rid of deleted documents and to reduce the number of overall
segments in the index.
Generally a higher number of segments will also influence you search
performance since Lucene performs almost all operations on a
per-segment level. If you want to reduce the number of segment to one
you need to call optimize and lucene will merge all existing ones into
one single segment.

hope that answers your question

simon
>
> I've only heard about them in the last 2 weeks here on the list.
> Dennis Gearon
>
> Signature Warning
> 
> EARTH has a Right To Life,
>  otherwise we all die.
>
> Read 'Hot, Flat, and Crowded'
> Laugh at http://www.yert.com/film.php
>
>
> --- On Sun, 9/12/10, Jason Rutherglen  wrote:
>
>> From: Jason Rutherglen 
>> Subject: Re: Tuning Solr caches with high commit rates (NRT)
>> To: solr-user@lucene.apache.org
>> Date: Sunday, September 12, 2010, 7:52 PM
>> Yeah there's no patch... I think
>> Yonik can write it. :-)  Yah... The
>> Lucene version shouldn't matter.  The distributed
>> faceting
>> theoretically can easily be applied to multiple segments,
>> however the
>> way it's written for me is a challenge to untangle and
>> apply
>> successfully to a working patch.  Also I don't have
>> this as an itch to
>> scratch at the moment.
>>
>> On Sun, Sep 12, 2010 at 7:18 PM, Peter Sturge 
>> wrote:
>> > Hi Jason,
>> >
>> > I've tried some limited testing with the 4.x trunk
>> using fcs, and I
>> > must say, I really like the idea of per-segment
>> faceting.
>> > I was hoping to see it in 3.x, but I don't see this
>> option in the
>> > branch_3x trunk. Is your SOLR-1606 patch referred to
>> in SOLR-1617 the
>> > one to use with 3.1?
>> > There seems to be a number of Solr issues tied to this
>> - one of them
>> > being Lucene-1785. Can the per-segment faceting patch
>> work with Lucene
>> > 2.9/branch_3x?
>> >
>> > Thanks,
>> > Peter
>> >
>> >
>> >
>> > On Mon, Sep 13, 2010 at 12:05 AM, Jason Rutherglen
>> > 
>> wrote:
>> >> Peter,
>> >>
>> >> Are you using per-segment faceting, eg, SOLR-1617?
>>  That could help
>> >> your situation.
>> >>
>> >> On Sun, Sep 12, 2010 at 12:26 PM, Peter Sturge
>> 
>> wrote:
>> >>> Hi,
>> >>>
>> >>> Below are some notes regarding Solr cache
>> tuning that should prove
>> >>> useful for anyone who uses Solr with frequent
>> commits (e.g. <5min).
>> >>>
>> >>> Environment:
>> >>> Solr 1.4.1 or branch_3x trunk.
>> >>> Note the 4.x trunk has lots of neat new
>> features, so the notes here
>> >>> are likely less relevant to the 4.x
>> environment.
>> >>>
>> >>> Overview:
>> >>> Our Solr environment makes extensive use of
>> faceting, we perform
>> >>> commits every 30secs, and the indexes tend be
>> on the large-ish side
>> >>> (>20million docs).
>> >>> Note: For our data, when we commit, we are
>> always adding new data,
>> >>> never changing existing data.
>> >>> This type of environment can be tricky to
>> tune, as Solr is more geared
>> >>> toward fast reads than frequent writes.
>> >>>
>> >>> Symptoms:
>> >>> If anyone has used faceting in searches where
>> you are also performing
>> >>> frequent commits, you've likely encountered
>> the dreaded OutOfMemory or
>> >>> GC Overhead Exeeded errors.
>> >>> In high commit rate environments, this is
>> almost always due to
>> >>> multiple 'onDeck' searchers and autowarming -
>> i.e. new searchers don't
>> >>> finish autowarming their caches before the
>> next commit()
>> >>> comes along and invalidates them.
>> >>> Once this starts happening on a regular basis,
>> it is likely your
>> >>> Solr's JVM will run out of memory eventually,
>> as the number of
>> >>> searchers (and their cache arrays) will keep
>> growing until the JVM
>> >>> dies of thirst.
>> >>> To check if your Solr environment is suffering
>> from this, turn on INFO
>> >>> level logging, and look for: 'PERFORMANCE
>> WARNING: Overlapping
>> >>> onDeckSearchers=x'.
>> >>>
>> >>> In tests, we've only ever seen this problem
>> when using faceting, and
>> >>> facet.method=fc.
>> >>>
>> >>> Some solutions to this are:
>> >>>    Reduce the commit rate to allow searchers
>> to fully warm before the
>> >>> next commit
>> >>>    Reduce or eliminate the autowarming in
>> caches
>> >>>    Both of the above
>> >>>

Re: Sorting not working on a string field

2010-09-13 Thread Jan Høydahl / Cominvent
Hi,

May you show us what result you actually get? Wouldn't it make more sense to 
choose a numeric fieldtype? To get proper sort order of numbers in a string 
field, all number need to be exactly same length since order will be 
lexiographical, i.e. "10" will come before "2", but after "02".

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 10. sep. 2010, at 19.14, n...@frameweld.com wrote:

> Hello, I seem to be having a problem with sorting. I have a string field 
> (time_code) that I want to order by. When the results come up, it displays 
> the results differently from relevance which I would assume, but the results 
> aren't ordered. The data in time_code came from a numeric decimal with a six 
> digit precision if that makes a difference(ex: 1.00).
> 
> Here is the query I give it:
> 
> q=ceremony+AND+presentation_id%3A296+AND+type%3Ablob&version=1.3&json.nl=map&rows=10&start=0&wt=json&hl=true&hl.fl=text&hl.simple.pre=&hl.simple.post=<%2Fspan>&hl.fragsize=0&hl.mergeContiguous=false&&sort=time_code+asc
> 
> 
> And here's the field schema:
> 
> 
> 
> 
>  multiValued="true"/>
> 
> 
> 
>  allowDups="true" multiValued="true"/>
> 
>  allowDups="true"/>
> 
> 
> Thanks for any help.
> 



Re: mm=0?

2010-09-13 Thread Jan Høydahl / Cominvent
As Erick points out, you don't want a random doc as response!
What you're looking at is how to avoid the "0 hits" problem.
You could look into one of these:
* Introduce autosuggest to avoid many 0-hits cases
* Introduce spellchecking
* Re-run the failed query with fuzzy turned on (e.g. alpha~)
* Redirect user to some other, broader source (wikipedia, google...) if 
relevant to your domain.
No matter what you do, it is important to communicate it to the user in a very 
clear way.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 11. sep. 2010, at 19.10, Satish Kumar wrote:

> Hi,
> 
> We have a requirement to show at least one result every time -- i.e., even
> if user entered term is not found in any of the documents. I was hoping
> setting mm to 0 will return results in all cases, but it is not.
> 
> For example, if user entered term "alpha" and it is *not* in any of the
> documents in the index, any document in the index can be returned. If term
> "alpha" is in the document set, documents having the term "alpha" only must
> be returned.
> 
> My idea so far is to perform a search using user entered term. If there are
> any results, return them. If there are no results, perform another search
> without the query term-- this means doing two searches. Any suggestions on
> implementing this requirement using only one search?
> 
> 
> Thanks,
> Satish



Re: Multiple sorting on text fields

2010-09-13 Thread Stanislaw
Hi Dennis,
thanks for reply.
Please explain me what filter do you mean.

I'm searching only on one field with names:
query.setQuery(suchstring);

then I'm adding two sortings on another fields:
query.addSortField("type", SolrQuery.ORDER.asc);
query.addSortField("sortName", SolrQuery.ORDER.asc);

the results should be sorted in first queue by 'type' (only one letter 'A'
or 'B')
and then they should be sorted by names

how I can define hier 'OR' or 'AND' relations?

Best regards,
Stanislaw


2010/9/13 Dennis Gearon 

> My guess is two things are happening:
>  1/ Your combination of filters is in parallel,or an OR expression. This I
> think for sure  maybe, seen next.
>  2/ To get 3 duplicate results, your custom filter AND the OR expression
> above have to be working togther, or it's possible that your customer filter
> is the WHOLE problem, supplying the duplicates and the triplicates.
>
> A first guess  nothing more :-)
> Dennis Gearon
>
> Signature Warning
> 
> EARTH has a Right To Life,
>  otherwise we all die.
>
> Read 'Hot, Flat, and Crowded'
> Laugh at http://www.yert.com/film.php
>
>
> --- On Mon, 9/13/10, Stanislaw  wrote:
>
> > From: Stanislaw 
> > Subject: Multiple sorting on text fields
> > To: solr-user@lucene.apache.org
> > Date: Monday, September 13, 2010, 12:12 AM
> > Hi all!
> >
> > i found some strange behavior of solr. If I do sorting by 2
> > text fields in
> > chain, I do receive some results doubled.
> > The both text fields are not multivalued, one of them is
> > string, the other
> > custom type based on text field and keyword analyzer.
> >
> > I do this:
> >
> > *CommonsHttpSolrServer server
> > =
> > SolrServer.getInstance().getServer();
> > SolrQuery query = new
> > SolrQuery();
> > query.setQuery(suchstring);
> > query.addSortField("type",
> > SolrQuery.ORDER.asc);
> > //String field- it's only one letter
> > query.addSortField("sortName",
> > SolrQuery.ORDER.asc); //text
> > field, not tokenized
> >
> > QueryResponse rsp = new
> > QueryResponse();
> > rsp = server.query(query);*
> >
> > after that I extract results as a list Entity objects, the
> > most of them are
> > unique, but some of them are doubled and even tripled in
> > this list.
> > (Each object has a unique id and there is only one time in
> > index)
> > If I'm sorting only by one text field, I'm receiving
> > "normal" results w/o
> > problems.
> > Where could I do a mistake, or is it a bug?
> >
> > Best regards,
> > Stanislaw
> >
>


Re: Solr CoreAdmin create ignores dataDir Parameter

2010-09-13 Thread Frank Wesemann

MitchK schrieb:

Frank,

have a look at SOLR-646.

Do you think a workaround for the data-dir-tag in the solrconfig.xml can
help?
I think about something like ${solr./data/corename} for
illustration.

Unfortunately I am not very skilled in working with solr's variables and
therefore I do not know what variables are available. 
  

No, variables are not available at this stage.

If we find a solution, we should provide it as a suggestion at the wiki's
CoreAdmin-page.

Kind regards,
- Mitch
  



--
mit freundlichem Gruß,

Frank Wesemann
Fotofinder GmbH USt-IdNr. DE812854514
Software EntwicklungWeb: http://www.fotofinder.com/
Potsdamer Str. 96   Tel: +49 30 25 79 28 90
10785 BerlinFax: +49 30 25 79 28 999

Sitz: Berlin
Amtsgericht Berlin Charlottenburg (HRB 73099)
Geschäftsführer: Ali Paczensky





Re: Multiple sorting on text fields

2010-09-13 Thread Erick Erickson
A couple of things come to mind:
1> what happens if you remove the sort clauses?
 Because I suspect they're irrelevant and your
 duplicate issue is something different.
2> SOLR admin should let you determine this.
3> Please show us the configurations that
 make you sure that the documents
 are unique (I'm assuming you've defined
  in your schema, but please
 show us. And show us the field TYPE
  definition).
4> Assuming the uniqueKey is defined, did you
 perhaps define it after you'd indexed some
 documents? SOLR doesn't apply uniqueness
 retroactively.
5> Your secondary sort looks like it's on a tokenized
 field (again guessing, you haven't provided your
 schema definitions). It should not be. NOTE: this
 is different than multivalued! Again, I doubt this
 has anything to do with your duplcate issue, but
 it'll make your sorting "interesting".

Again, I think the sorting is unrelated to your underlying
duplication issue, so until you're sure your index is in the
state you think it's in, I'd ignore sorting..

Best
Erick

On Mon, Sep 13, 2010 at 5:56 AM, Stanislaw wrote:

> Hi Dennis,
> thanks for reply.
> Please explain me what filter do you mean.
>
> I'm searching only on one field with names:
> query.setQuery(suchstring);
>
> then I'm adding two sortings on another fields:
> query.addSortField("type", SolrQuery.ORDER.asc);
> query.addSortField("sortName", SolrQuery.ORDER.asc);
>
> the results should be sorted in first queue by 'type' (only one letter 'A'
> or 'B')
> and then they should be sorted by names
>
> how I can define hier 'OR' or 'AND' relations?
>
> Best regards,
> Stanislaw
>
>
> 2010/9/13 Dennis Gearon 
>
> > My guess is two things are happening:
> >  1/ Your combination of filters is in parallel,or an OR expression. This
> I
> > think for sure  maybe, seen next.
> >  2/ To get 3 duplicate results, your custom filter AND the OR expression
> > above have to be working togther, or it's possible that your customer
> filter
> > is the WHOLE problem, supplying the duplicates and the triplicates.
> >
> > A first guess  nothing more :-)
> > Dennis Gearon
> >
> > Signature Warning
> > 
> > EARTH has a Right To Life,
> >  otherwise we all die.
> >
> > Read 'Hot, Flat, and Crowded'
> > Laugh at http://www.yert.com/film.php
> >
> >
> > --- On Mon, 9/13/10, Stanislaw  wrote:
> >
> > > From: Stanislaw 
> > > Subject: Multiple sorting on text fields
> > > To: solr-user@lucene.apache.org
> > > Date: Monday, September 13, 2010, 12:12 AM
> > > Hi all!
> > >
> > > i found some strange behavior of solr. If I do sorting by 2
> > > text fields in
> > > chain, I do receive some results doubled.
> > > The both text fields are not multivalued, one of them is
> > > string, the other
> > > custom type based on text field and keyword analyzer.
> > >
> > > I do this:
> > >
> > > *CommonsHttpSolrServer server
> > > =
> > > SolrServer.getInstance().getServer();
> > > SolrQuery query = new
> > > SolrQuery();
> > > query.setQuery(suchstring);
> > > query.addSortField("type",
> > > SolrQuery.ORDER.asc);
> > > //String field- it's only one letter
> > > query.addSortField("sortName",
> > > SolrQuery.ORDER.asc); //text
> > > field, not tokenized
> > >
> > > QueryResponse rsp = new
> > > QueryResponse();
> > > rsp = server.query(query);*
> > >
> > > after that I extract results as a list Entity objects, the
> > > most of them are
> > > unique, but some of them are doubled and even tripled in
> > > this list.
> > > (Each object has a unique id and there is only one time in
> > > index)
> > > If I'm sorting only by one text field, I'm receiving
> > > "normal" results w/o
> > > problems.
> > > Where could I do a mistake, or is it a bug?
> > >
> > > Best regards,
> > > Stanislaw
> > >
> >
>


stopwords in AND clauses

2010-09-13 Thread Xavier Noria
Let's suppose we have a regular search field body_t, and an internal
boolean flag flag_t not exposed to the user.

I'd like

body_t:foo AND flag_t:true

to be an intersection, but if "foo" is a stopword I get all documents
for which flag_t is true, as if the first class was dropped, or if
technically all documents match an empty string.

Is there a way to get 0 results instead?


Re: stopwords in AND clauses

2010-09-13 Thread Simon Willnauer
On Mon, Sep 13, 2010 at 3:27 PM, Xavier Noria  wrote:
> Let's suppose we have a regular search field body_t, and an internal
> boolean flag flag_t not exposed to the user.
>
> I'd like
>
>    body_t:foo AND flag_t:true

this is solr right? why don't you use filterquery for you unexposed
flat_t field q=boty_t:foo&fq=flag_t:true
this might help too: http://wiki.apache.org/solr/CommonQueryParameters#fq

simon
>
> to be an intersection, but if "foo" is a stopword I get all documents
> for which flag_t is true, as if the first class was dropped, or if
> technically all documents match an empty string.
>
> Is there a way to get 0 results instead?
>


Re: stopwords in AND clauses

2010-09-13 Thread Xavier Noria
On Mon, Sep 13, 2010 at 4:29 PM, Simon Willnauer
 wrote:

> On Mon, Sep 13, 2010 at 3:27 PM, Xavier Noria  wrote:
>> Let's suppose we have a regular search field body_t, and an internal
>> boolean flag flag_t not exposed to the user.
>>
>> I'd like
>>
>>    body_t:foo AND flag_t:true
>
> this is solr right? why don't you use filterquery for you unexposed
> flat_t field q=boty_t:foo&fq=flag_t:true
> this might help too: http://wiki.apache.org/solr/CommonQueryParameters#fq

Sounds good.


Re: mm=0?

2010-09-13 Thread Satish Kumar
Hi Erik,

I completely agree with you that showing a random document for user's query
would be very poor experience. I have raised this in our product review
meetings before. I was told that because of contractual agreement some
sponsored content needs to be returned even if it meant no match. And the
sponsored content drives the ads displayed on the page-- so it is more for
showing some ad on the page when there is no matching result from sponsored
content for user's query.

Note that some other content in addition to sponsored content is displayed
on the page, so user is not seeing just one random result when there is not
a good match.

It looks like I have to do another search to get a random result when there
are no results. In this case I will use RandomSortField to generate random
result (so that a different ad is displayed from set of sponsored ads) for
each no result case.

Thanks for the comments!


Satish



On Sun, Sep 12, 2010 at 10:25 AM, Erick Erickson wrote:

> Could you explain the use-case a bit? Because the very
> first response I would have is "why in the world did
> product management make this a requirement" and try
> to get the requirement changed
>
> As a user, I'm having a hard time imagining being well
> served by getting a document in response to a search that
> had no relation to my search, it was just a random doc
> selected from the corpus.
>
> All that said, I don't think a single query would do the trick.
> You could include a "very special" document with a field
> that no other document had with very special text in it. Say
> field name "bogusmatch", filled with the text "bogustext"
> then, at least the second query would match one and only
> one document and would take minimal time. Or you could
> tack on to each and every query "OR bogusmatch:bogustext^0.001"
> (which would really be inexpensive) and filter it out if there
> was more than one response. By boosting it really low, it should
> always appear at the end of the list which wouldn't be a bad thing.
>
> DisMax might help you here...
>
> But do ask if it is really a requirement or just something nobody's
> objected to before bothering IMO...
>
> Best
> Erick
>
> On Sat, Sep 11, 2010 at 1:10 PM, Satish Kumar <
> satish.kumar.just.d...@gmail.com> wrote:
>
> > Hi,
> >
> > We have a requirement to show at least one result every time -- i.e.,
> even
> > if user entered term is not found in any of the documents. I was hoping
> > setting mm to 0 will return results in all cases, but it is not.
> >
> > For example, if user entered term "alpha" and it is *not* in any of the
> > documents in the index, any document in the index can be returned. If
> term
> > "alpha" is in the document set, documents having the term "alpha" only
> must
> > be returned.
> >
> > My idea so far is to perform a search using user entered term. If there
> are
> > any results, return them. If there are no results, perform another search
> > without the query term-- this means doing two searches. Any suggestions
> on
> > implementing this requirement using only one search?
> >
> >
> > Thanks,
> > Satish
> >
>


Re: Sorting not working on a string field

2010-09-13 Thread noel
You're right, it would be better to just give it a sortable numerical value. 
For now I gave time_code a sdouble type and see if it sorted, and it did. 
However all the 0's are trimmed, but that shouldn't be a problem unless it were 
to truncate any values past the hundreds column.

Thanks.
- Noel

-Original Message-
From: "Jan Høydahl / Cominvent" 
Sent: Monday, September 13, 2010 5:31am
To: solr-user@lucene.apache.org
Subject: Re: Sorting not working on a string field

Hi,

May you show us what result you actually get? Wouldn't it make more sense to 
choose a numeric fieldtype? To get proper sort order of numbers in a string 
field, all number need to be exactly same length since order will be 
lexiographical, i.e. "10" will come before "2", but after "02".

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 10. sep. 2010, at 19.14, n...@frameweld.com wrote:

> Hello, I seem to be having a problem with sorting. I have a string field 
> (time_code) that I want to order by. When the results come up, it displays 
> the results differently from relevance which I would assume, but the results 
> aren't ordered. The data in time_code came from a numeric decimal with a six 
> digit precision if that makes a difference(ex: 1.00).
> 
> Here is the query I give it:
> 
> q=ceremony+AND+presentation_id%3A296+AND+type%3Ablob&version=1.3&json.nl=map&rows=10&start=0&wt=json&hl=true&hl.fl=text&hl.simple.pre=&hl.simple.post=<%2Fspan>&hl.fragsize=0&hl.mergeContiguous=false&&sort=time_code+asc
> 
> 
> And here's the field schema:
> 
> 
> 
> 
>  multiValued="true"/>
> 
> 
> 
>  allowDups="true" multiValued="true"/>
> 
>  allowDups="true"/>
> 
> 
> Thanks for any help.
> 





Re: Multiple sorting on text fields

2010-09-13 Thread Dennis Gearon
I thought I saw 'custom analyzer', but you wrote 'custom field'.

My mistake.

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Mon, 9/13/10, Stanislaw  wrote:

> From: Stanislaw 
> Subject: Re: Multiple sorting on text fields
> To: solr-user@lucene.apache.org
> Date: Monday, September 13, 2010, 2:56 AM
> Hi Dennis,
> thanks for reply.
> Please explain me what filter do you mean.
> 
> I'm searching only on one field with names:
> query.setQuery(suchstring);
> 
> then I'm adding two sortings on another fields:
> query.addSortField("type", SolrQuery.ORDER.asc);
> query.addSortField("sortName", SolrQuery.ORDER.asc);
> 
> the results should be sorted in first queue by 'type' (only
> one letter 'A'
> or 'B')
> and then they should be sorted by names
> 
> how I can define hier 'OR' or 'AND' relations?
> 
> Best regards,
> Stanislaw
> 
> 
> 2010/9/13 Dennis Gearon 
> 
> > My guess is two things are happening:
> >  1/ Your combination of filters is in parallel,or
> an OR expression. This I
> > think for sure  maybe, seen next.
> >  2/ To get 3 duplicate results, your custom
> filter AND the OR expression
> > above have to be working togther, or it's possible
> that your customer filter
> > is the WHOLE problem, supplying the duplicates and the
> triplicates.
> >
> > A first guess  nothing more :-)
> > Dennis Gearon
> >
> > Signature Warning
> > 
> > EARTH has a Right To Life,
> >  otherwise we all die.
> >
> > Read 'Hot, Flat, and Crowded'
> > Laugh at http://www.yert.com/film.php
> >
> >
> > --- On Mon, 9/13/10, Stanislaw 
> wrote:
> >
> > > From: Stanislaw 
> > > Subject: Multiple sorting on text fields
> > > To: solr-user@lucene.apache.org
> > > Date: Monday, September 13, 2010, 12:12 AM
> > > Hi all!
> > >
> > > i found some strange behavior of solr. If I do
> sorting by 2
> > > text fields in
> > > chain, I do receive some results doubled.
> > > The both text fields are not multivalued, one of
> them is
> > > string, the other
> > > custom type based on text field and keyword
> analyzer.
> > >
> > > I do this:
> > >
> > > *       
> CommonsHttpSolrServer server
> > > =
> > > SolrServer.getInstance().getServer();
> > >         SolrQuery
> query = new
> > > SolrQuery();
> > >     
>    query.setQuery(suchstring);
> > >     
>    query.addSortField("type",
> > > SolrQuery.ORDER.asc);
> > > //String field- it's only one letter
> > >     
>    query.addSortField("sortName",
> > > SolrQuery.ORDER.asc); 
>    //text
> > > field, not tokenized
> > >
> > >     
>    QueryResponse rsp = new
> > > QueryResponse();
> > >         rsp =
> server.query(query);*
> > >
> > > after that I extract results as a list Entity
> objects, the
> > > most of them are
> > > unique, but some of them are doubled and even
> tripled in
> > > this list.
> > > (Each object has a unique id and there is only
> one time in
> > > index)
> > > If I'm sorting only by one text field, I'm
> receiving
> > > "normal" results w/o
> > > problems.
> > > Where could I do a mistake, or is it a bug?
> > >
> > > Best regards,
> > > Stanislaw
> > >
> >
>


Re: mm=0?

2010-09-13 Thread Dennis Gearon
This issue is one I hope to head off in my application / on my site. Instead of 
an ad feed, I HOPE to be able to have an ad QUEUE on my site. If necessary, 
I'll convert the feed TO a queue.

The queue will get a first pass done on it by either an employee or a 
compensated user. Either one generates up to 4 keywords/tags for the 
advertisement. THEY determine when the ad gets shown based on relevancy.

Nice idea, hope it'll fly :-)

I actually detest the adds that say 'Lucene instance for sale, lowest prices!', 
or the industrial clearing houses that make you wade through 4 -6 screens to 
find that you need a membership in order to look for the rice of some stainless 
steel nuts. 

And usually, those ads must be paying top dollar, because they are the first 
three ads on google's search (that is until reacently.) Anyone notice that 
there's hardly any more ads on google search results?


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Mon, 9/13/10, Satish Kumar  wrote:

> From: Satish Kumar 
> Subject: Re: mm=0?
> To: solr-user@lucene.apache.org
> Date: Monday, September 13, 2010, 7:41 AM
> Hi Erik,
> 
> I completely agree with you that showing a random document
> for user's query
> would be very poor experience. I have raised this in our
> product review
> meetings before. I was told that because of contractual
> agreement some
> sponsored content needs to be returned even if it meant no
> match. And the
> sponsored content drives the ads displayed on the page-- so
> it is more for
> showing some ad on the page when there is no matching
> result from sponsored
> content for user's query.
> 
> Note that some other content in addition to sponsored
> content is displayed
> on the page, so user is not seeing just one random result
> when there is not
> a good match.
> 
> It looks like I have to do another search to get a random
> result when there
> are no results. In this case I will use RandomSortField to
> generate random
> result (so that a different ad is displayed from set of
> sponsored ads) for
> each no result case.
> 
> Thanks for the comments!
> 
> 
> Satish
> 
> 
> 
> On Sun, Sep 12, 2010 at 10:25 AM, Erick Erickson 
> wrote:
> 
> > Could you explain the use-case a bit? Because the
> very
> > first response I would have is "why in the world did
> > product management make this a requirement" and try
> > to get the requirement changed
> >
> > As a user, I'm having a hard time imagining being
> well
> > served by getting a document in response to a search
> that
> > had no relation to my search, it was just a random
> doc
> > selected from the corpus.
> >
> > All that said, I don't think a single query would do
> the trick.
> > You could include a "very special" document with a
> field
> > that no other document had with very special text in
> it. Say
> > field name "bogusmatch", filled with the text
> "bogustext"
> > then, at least the second query would match one and
> only
> > one document and would take minimal time. Or you
> could
> > tack on to each and every query "OR
> bogusmatch:bogustext^0.001"
> > (which would really be inexpensive) and filter it out
> if there
> > was more than one response. By boosting it really low,
> it should
> > always appear at the end of the list which wouldn't be
> a bad thing.
> >
> > DisMax might help you here...
> >
> > But do ask if it is really a requirement or just
> something nobody's
> > objected to before bothering IMO...
> >
> > Best
> > Erick
> >
> > On Sat, Sep 11, 2010 at 1:10 PM, Satish Kumar <
> > satish.kumar.just.d...@gmail.com>
> wrote:
> >
> > > Hi,
> > >
> > > We have a requirement to show at least one result
> every time -- i.e.,
> > even
> > > if user entered term is not found in any of the
> documents. I was hoping
> > > setting mm to 0 will return results in all cases,
> but it is not.
> > >
> > > For example, if user entered term "alpha" and it
> is *not* in any of the
> > > documents in the index, any document in the index
> can be returned. If
> > term
> > > "alpha" is in the document set, documents having
> the term "alpha" only
> > must
> > > be returned.
> > >
> > > My idea so far is to perform a search using user
> entered term. If there
> > are
> > > any results, return them. If there are no
> results, perform another search
> > > without the query term-- this means doing two
> searches. Any suggestions
> > on
> > > implementing this requirement using only one
> search?
> > >
> > >
> > > Thanks,
> > > Satish
> > >
> >
> 


Re: mm=0?

2010-09-13 Thread Dennis Gearon
I just tried several searches again on google.

I think they've refined the ads placements so that certain kind of searches 
return no ads, the kinds that I've been doing relative to programming being one 
of them.

If OTOH I do some product related search, THEN lots of ads show up, but fairly 
accurate ones.

They've immproved the ads placement a LOT!

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Mon, 9/13/10, Satish Kumar  wrote:

> From: Satish Kumar 
> Subject: Re: mm=0?
> To: solr-user@lucene.apache.org
> Date: Monday, September 13, 2010, 7:41 AM
> Hi Erik,
> 
> I completely agree with you that showing a random document
> for user's query
> would be very poor experience. I have raised this in our
> product review
> meetings before. I was told that because of contractual
> agreement some
> sponsored content needs to be returned even if it meant no
> match. And the
> sponsored content drives the ads displayed on the page-- so
> it is more for
> showing some ad on the page when there is no matching
> result from sponsored
> content for user's query.
> 
> Note that some other content in addition to sponsored
> content is displayed
> on the page, so user is not seeing just one random result
> when there is not
> a good match.
> 
> It looks like I have to do another search to get a random
> result when there
> are no results. In this case I will use RandomSortField to
> generate random
> result (so that a different ad is displayed from set of
> sponsored ads) for
> each no result case.
> 
> Thanks for the comments!
> 
> 
> Satish
> 
> 
> 
> On Sun, Sep 12, 2010 at 10:25 AM, Erick Erickson 
> wrote:
> 
> > Could you explain the use-case a bit? Because the
> very
> > first response I would have is "why in the world did
> > product management make this a requirement" and try
> > to get the requirement changed
> >
> > As a user, I'm having a hard time imagining being
> well
> > served by getting a document in response to a search
> that
> > had no relation to my search, it was just a random
> doc
> > selected from the corpus.
> >
> > All that said, I don't think a single query would do
> the trick.
> > You could include a "very special" document with a
> field
> > that no other document had with very special text in
> it. Say
> > field name "bogusmatch", filled with the text
> "bogustext"
> > then, at least the second query would match one and
> only
> > one document and would take minimal time. Or you
> could
> > tack on to each and every query "OR
> bogusmatch:bogustext^0.001"
> > (which would really be inexpensive) and filter it out
> if there
> > was more than one response. By boosting it really low,
> it should
> > always appear at the end of the list which wouldn't be
> a bad thing.
> >
> > DisMax might help you here...
> >
> > But do ask if it is really a requirement or just
> something nobody's
> > objected to before bothering IMO...
> >
> > Best
> > Erick
> >
> > On Sat, Sep 11, 2010 at 1:10 PM, Satish Kumar <
> > satish.kumar.just.d...@gmail.com>
> wrote:
> >
> > > Hi,
> > >
> > > We have a requirement to show at least one result
> every time -- i.e.,
> > even
> > > if user entered term is not found in any of the
> documents. I was hoping
> > > setting mm to 0 will return results in all cases,
> but it is not.
> > >
> > > For example, if user entered term "alpha" and it
> is *not* in any of the
> > > documents in the index, any document in the index
> can be returned. If
> > term
> > > "alpha" is in the document set, documents having
> the term "alpha" only
> > must
> > > be returned.
> > >
> > > My idea so far is to perform a search using user
> entered term. If there
> > are
> > > any results, return them. If there are no
> results, perform another search
> > > without the query term-- this means doing two
> searches. Any suggestions
> > on
> > > implementing this requirement using only one
> search?
> > >
> > >
> > > Thanks,
> > > Satish
> > >
> >
> 


Re: Tuning Solr caches with high commit rates (NRT)

2010-09-13 Thread Dennis Gearon
Thanks guys for the explanation.

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Mon, 9/13/10, Simon Willnauer  wrote:

> From: Simon Willnauer 
> Subject: Re: Tuning Solr caches with high commit rates (NRT)
> To: solr-user@lucene.apache.org
> Date: Monday, September 13, 2010, 1:33 AM
> On Mon, Sep 13, 2010 at 8:02 AM,
> Dennis Gearon 
> wrote:
> > BTW, what is a segment?
> 
> On the Lucene level an index is composed of one or more
> index
> segments. Each segment is an index by itself and consists
> of several
> files like doc stores, proximity data, term dictionaries
> etc. During
> indexing Lucene / Solr creates those segments depending on
> ram buffer
> / document buffer settings and flushes them to disk (if you
> index to
> disk). Once a segment has been flushed Lucene will never
> change the
> segments (well up to a certain level - lets keep this
> simple) but
> write new ones for new added documents. Since segments have
> a
> write-once policy Lucene merges multiple segments into a
> new segment
> (how and when this happens is different story) from time to
> time to
> get rid of deleted documents and to reduce the number of
> overall
> segments in the index.
> Generally a higher number of segments will also influence
> you search
> performance since Lucene performs almost all operations on
> a
> per-segment level. If you want to reduce the number of
> segment to one
> you need to call optimize and lucene will merge all
> existing ones into
> one single segment.
> 
> hope that answers your question
> 
> simon
> >
> > I've only heard about them in the last 2 weeks here on
> the list.
> > Dennis Gearon
> >
> > Signature Warning
> > 
> > EARTH has a Right To Life,
> >  otherwise we all die.
> >
> > Read 'Hot, Flat, and Crowded'
> > Laugh at http://www.yert.com/film.php
> >
> >
> > --- On Sun, 9/12/10, Jason Rutherglen 
> wrote:
> >
> >> From: Jason Rutherglen 
> >> Subject: Re: Tuning Solr caches with high commit
> rates (NRT)
> >> To: solr-user@lucene.apache.org
> >> Date: Sunday, September 12, 2010, 7:52 PM
> >> Yeah there's no patch... I think
> >> Yonik can write it. :-)  Yah... The
> >> Lucene version shouldn't matter.  The
> distributed
> >> faceting
> >> theoretically can easily be applied to multiple
> segments,
> >> however the
> >> way it's written for me is a challenge to untangle
> and
> >> apply
> >> successfully to a working patch.  Also I don't
> have
> >> this as an itch to
> >> scratch at the moment.
> >>
> >> On Sun, Sep 12, 2010 at 7:18 PM, Peter Sturge
> 
> >> wrote:
> >> > Hi Jason,
> >> >
> >> > I've tried some limited testing with the 4.x
> trunk
> >> using fcs, and I
> >> > must say, I really like the idea of
> per-segment
> >> faceting.
> >> > I was hoping to see it in 3.x, but I don't
> see this
> >> option in the
> >> > branch_3x trunk. Is your SOLR-1606 patch
> referred to
> >> in SOLR-1617 the
> >> > one to use with 3.1?
> >> > There seems to be a number of Solr issues
> tied to this
> >> - one of them
> >> > being Lucene-1785. Can the per-segment
> faceting patch
> >> work with Lucene
> >> > 2.9/branch_3x?
> >> >
> >> > Thanks,
> >> > Peter
> >> >
> >> >
> >> >
> >> > On Mon, Sep 13, 2010 at 12:05 AM, Jason
> Rutherglen
> >> > 
> >> wrote:
> >> >> Peter,
> >> >>
> >> >> Are you using per-segment faceting, eg,
> SOLR-1617?
> >>  That could help
> >> >> your situation.
> >> >>
> >> >> On Sun, Sep 12, 2010 at 12:26 PM, Peter
> Sturge
> >> 
> >> wrote:
> >> >>> Hi,
> >> >>>
> >> >>> Below are some notes regarding Solr
> cache
> >> tuning that should prove
> >> >>> useful for anyone who uses Solr with
> frequent
> >> commits (e.g. <5min).
> >> >>>
> >> >>> Environment:
> >> >>> Solr 1.4.1 or branch_3x trunk.
> >> >>> Note the 4.x trunk has lots of neat
> new
> >> features, so the notes here
> >> >>> are likely less relevant to the 4.x
> >> environment.
> >> >>>
> >> >>> Overview:
> >> >>> Our Solr environment makes extensive
> use of
> >> faceting, we perform
> >> >>> commits every 30secs, and the indexes
> tend be
> >> on the large-ish side
> >> >>> (>20million docs).
> >> >>> Note: For our data, when we commit,
> we are
> >> always adding new data,
> >> >>> never changing existing data.
> >> >>> This type of environment can be
> tricky to
> >> tune, as Solr is more geared
> >> >>> toward fast reads than frequent
> writes.
> >> >>>
> >> >>> Symptoms:
> >> >>> If anyone has used faceting in
> searches where
> >> you are also performing
> >> >>> frequent commits, you've likely
> encountered
> >> the dreaded OutOfMemory or
> >> >>> GC Overhead Exeeded errors.
> >> >>> In high commit rate environments,
> this is
> >> almost always due to
> >> >>> multiple 'onDeck' searchers and
> autowarming -
> >> i.e. new searchers don't
> >> >>> finish autowarming their caches
> before the
> >> next commit()
> >> >>> comes along and invalidates them.

Re: what differents between SolrCloud and Solr+Hadoop

2010-09-13 Thread Lance Norskog
You do not need either addition if you just want to have multiple Solr
instances on different machines, and query them all at once. Look at
this for the simplest way:

http://wiki.apache.org/solr/DistributedSearch

On Mon, Sep 13, 2010 at 12:52 AM, Marc Sturlese  wrote:
>
> Well these are pretty different things. SolrCloud is meant to handle
> distributed search in a more easy way that "raw" solr distributed search.
> You have to build the shards in your own way.
> Solr+hadoop is a way to build these shards/indexes in paralel.
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/what-differents-between-SolrCloud-and-Solr-Hadoop-tp1463809p1464106.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goks...@gmail.com


Re: mm=0?

2010-09-13 Thread Lance Norskog
"Java Swing" no longer gives ads for "swinger's clubs".

On Mon, Sep 13, 2010 at 9:37 AM, Dennis Gearon  wrote:
> I just tried several searches again on google.
>
> I think they've refined the ads placements so that certain kind of searches 
> return no ads, the kinds that I've been doing relative to programming being 
> one of them.
>
> If OTOH I do some product related search, THEN lots of ads show up, but 
> fairly accurate ones.
>
> They've immproved the ads placement a LOT!
>
> Dennis Gearon
>
> Signature Warning
> 
> EARTH has a Right To Life,
>  otherwise we all die.
>
> Read 'Hot, Flat, and Crowded'
> Laugh at http://www.yert.com/film.php
>
>
> --- On Mon, 9/13/10, Satish Kumar  wrote:
>
>> From: Satish Kumar 
>> Subject: Re: mm=0?
>> To: solr-user@lucene.apache.org
>> Date: Monday, September 13, 2010, 7:41 AM
>> Hi Erik,
>>
>> I completely agree with you that showing a random document
>> for user's query
>> would be very poor experience. I have raised this in our
>> product review
>> meetings before. I was told that because of contractual
>> agreement some
>> sponsored content needs to be returned even if it meant no
>> match. And the
>> sponsored content drives the ads displayed on the page-- so
>> it is more for
>> showing some ad on the page when there is no matching
>> result from sponsored
>> content for user's query.
>>
>> Note that some other content in addition to sponsored
>> content is displayed
>> on the page, so user is not seeing just one random result
>> when there is not
>> a good match.
>>
>> It looks like I have to do another search to get a random
>> result when there
>> are no results. In this case I will use RandomSortField to
>> generate random
>> result (so that a different ad is displayed from set of
>> sponsored ads) for
>> each no result case.
>>
>> Thanks for the comments!
>>
>>
>> Satish
>>
>>
>>
>> On Sun, Sep 12, 2010 at 10:25 AM, Erick Erickson 
>> wrote:
>>
>> > Could you explain the use-case a bit? Because the
>> very
>> > first response I would have is "why in the world did
>> > product management make this a requirement" and try
>> > to get the requirement changed
>> >
>> > As a user, I'm having a hard time imagining being
>> well
>> > served by getting a document in response to a search
>> that
>> > had no relation to my search, it was just a random
>> doc
>> > selected from the corpus.
>> >
>> > All that said, I don't think a single query would do
>> the trick.
>> > You could include a "very special" document with a
>> field
>> > that no other document had with very special text in
>> it. Say
>> > field name "bogusmatch", filled with the text
>> "bogustext"
>> > then, at least the second query would match one and
>> only
>> > one document and would take minimal time. Or you
>> could
>> > tack on to each and every query "OR
>> bogusmatch:bogustext^0.001"
>> > (which would really be inexpensive) and filter it out
>> if there
>> > was more than one response. By boosting it really low,
>> it should
>> > always appear at the end of the list which wouldn't be
>> a bad thing.
>> >
>> > DisMax might help you here...
>> >
>> > But do ask if it is really a requirement or just
>> something nobody's
>> > objected to before bothering IMO...
>> >
>> > Best
>> > Erick
>> >
>> > On Sat, Sep 11, 2010 at 1:10 PM, Satish Kumar <
>> > satish.kumar.just.d...@gmail.com>
>> wrote:
>> >
>> > > Hi,
>> > >
>> > > We have a requirement to show at least one result
>> every time -- i.e.,
>> > even
>> > > if user entered term is not found in any of the
>> documents. I was hoping
>> > > setting mm to 0 will return results in all cases,
>> but it is not.
>> > >
>> > > For example, if user entered term "alpha" and it
>> is *not* in any of the
>> > > documents in the index, any document in the index
>> can be returned. If
>> > term
>> > > "alpha" is in the document set, documents having
>> the term "alpha" only
>> > must
>> > > be returned.
>> > >
>> > > My idea so far is to perform a search using user
>> entered term. If there
>> > are
>> > > any results, return them. If there are no
>> results, perform another search
>> > > without the query term-- this means doing two
>> searches. Any suggestions
>> > on
>> > > implementing this requirement using only one
>> search?
>> > >
>> > >
>> > > Thanks,
>> > > Satish
>> > >
>> >
>>
>



-- 
Lance Norskog
goks...@gmail.com


Re: mm=0?

2010-09-13 Thread Simon Willnauer
On Mon, Sep 13, 2010 at 8:07 PM, Lance Norskog  wrote:
> "Java Swing" no longer gives ads for "swinger's clubs".
damned no i have to explicitly enter it?! - argh!

:)

simon
>
> On Mon, Sep 13, 2010 at 9:37 AM, Dennis Gearon  wrote:
>> I just tried several searches again on google.
>>
>> I think they've refined the ads placements so that certain kind of searches 
>> return no ads, the kinds that I've been doing relative to programming being 
>> one of them.
>>
>> If OTOH I do some product related search, THEN lots of ads show up, but 
>> fairly accurate ones.
>>
>> They've immproved the ads placement a LOT!
>>
>> Dennis Gearon
>>
>> Signature Warning
>> 
>> EARTH has a Right To Life,
>>  otherwise we all die.
>>
>> Read 'Hot, Flat, and Crowded'
>> Laugh at http://www.yert.com/film.php
>>
>>
>> --- On Mon, 9/13/10, Satish Kumar  wrote:
>>
>>> From: Satish Kumar 
>>> Subject: Re: mm=0?
>>> To: solr-user@lucene.apache.org
>>> Date: Monday, September 13, 2010, 7:41 AM
>>> Hi Erik,
>>>
>>> I completely agree with you that showing a random document
>>> for user's query
>>> would be very poor experience. I have raised this in our
>>> product review
>>> meetings before. I was told that because of contractual
>>> agreement some
>>> sponsored content needs to be returned even if it meant no
>>> match. And the
>>> sponsored content drives the ads displayed on the page-- so
>>> it is more for
>>> showing some ad on the page when there is no matching
>>> result from sponsored
>>> content for user's query.
>>>
>>> Note that some other content in addition to sponsored
>>> content is displayed
>>> on the page, so user is not seeing just one random result
>>> when there is not
>>> a good match.
>>>
>>> It looks like I have to do another search to get a random
>>> result when there
>>> are no results. In this case I will use RandomSortField to
>>> generate random
>>> result (so that a different ad is displayed from set of
>>> sponsored ads) for
>>> each no result case.
>>>
>>> Thanks for the comments!
>>>
>>>
>>> Satish
>>>
>>>
>>>
>>> On Sun, Sep 12, 2010 at 10:25 AM, Erick Erickson 
>>> wrote:
>>>
>>> > Could you explain the use-case a bit? Because the
>>> very
>>> > first response I would have is "why in the world did
>>> > product management make this a requirement" and try
>>> > to get the requirement changed
>>> >
>>> > As a user, I'm having a hard time imagining being
>>> well
>>> > served by getting a document in response to a search
>>> that
>>> > had no relation to my search, it was just a random
>>> doc
>>> > selected from the corpus.
>>> >
>>> > All that said, I don't think a single query would do
>>> the trick.
>>> > You could include a "very special" document with a
>>> field
>>> > that no other document had with very special text in
>>> it. Say
>>> > field name "bogusmatch", filled with the text
>>> "bogustext"
>>> > then, at least the second query would match one and
>>> only
>>> > one document and would take minimal time. Or you
>>> could
>>> > tack on to each and every query "OR
>>> bogusmatch:bogustext^0.001"
>>> > (which would really be inexpensive) and filter it out
>>> if there
>>> > was more than one response. By boosting it really low,
>>> it should
>>> > always appear at the end of the list which wouldn't be
>>> a bad thing.
>>> >
>>> > DisMax might help you here...
>>> >
>>> > But do ask if it is really a requirement or just
>>> something nobody's
>>> > objected to before bothering IMO...
>>> >
>>> > Best
>>> > Erick
>>> >
>>> > On Sat, Sep 11, 2010 at 1:10 PM, Satish Kumar <
>>> > satish.kumar.just.d...@gmail.com>
>>> wrote:
>>> >
>>> > > Hi,
>>> > >
>>> > > We have a requirement to show at least one result
>>> every time -- i.e.,
>>> > even
>>> > > if user entered term is not found in any of the
>>> documents. I was hoping
>>> > > setting mm to 0 will return results in all cases,
>>> but it is not.
>>> > >
>>> > > For example, if user entered term "alpha" and it
>>> is *not* in any of the
>>> > > documents in the index, any document in the index
>>> can be returned. If
>>> > term
>>> > > "alpha" is in the document set, documents having
>>> the term "alpha" only
>>> > must
>>> > > be returned.
>>> > >
>>> > > My idea so far is to perform a search using user
>>> entered term. If there
>>> > are
>>> > > any results, return them. If there are no
>>> results, perform another search
>>> > > without the query term-- this means doing two
>>> searches. Any suggestions
>>> > on
>>> > > implementing this requirement using only one
>>> search?
>>> > >
>>> > >
>>> > > Thanks,
>>> > > Satish
>>> > >
>>> >
>>>
>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>


Re: How to Update Value of One Field of a Document in Index?

2010-09-13 Thread Zachary Chang

 Hi Savannah,

if you *only want to boost* documents based on the information you 
calculate from the MoreLikeThis results (i.e. numeric measure), you 
might want to take a look at the ExternalFileField type. This field type 
reads its contents from a file which contains key-value pairs, e.g. the 
document ids and the corresponding measure values, resp.
If some values change you still have to regenerate the whole file 
(instead of the whole index). But of course, this file can be generated 
from a DB, which might be updated incrementally.


For setup and usage e.g. see: 
http://dev.tailsweep.com/solr-external-scoring/


Zachary

On 10.09.2010 19:57, Savannah Beckett wrote:

I want to do MoreLikeThis to find documents that are similar to the document
that I am indexing.  Then I want to calculate the average of one of the fields
of all those documents and input this average into a field of the document that
I am indexing.  From my research, it seems that MoreLikeThis can only be used to
find similarity of document that is already in the index.  So, I think I need to
index it first, and then use MoreLikeThis to find similar documents in the index
and then reindex that document.  Any better way?  I try not to reindex a
document because it's not efficient.  I don't have to use MoreLikeThis.
Thanks.




From: Jonathan Rochkind
To: "solr-user@lucene.apache.org"
Sent: Fri, September 10, 2010 9:58:20 AM
Subject: RE: How to Update Value of One Field of a Document in Index?

"More like this" is intended to be run at query time. For what reasons are you
thinking you want to (re-)index each document based on the results of
MoreLikeThis?  You're right that that's not what the component is intended for.


Jonathan

From: Savannah Beckett [savannah_becket...@yahoo.com]
Sent: Friday, September 10, 2010 11:18 AM
To: solr-user@lucene.apache.org
Subject: Re: How to Update Value of One Field of a Document in Index?

Thanks.  I am trying to use MoreLikeThis in Solr to find similar documents in
the solr index and use the data from these similar documents to modify a field
in each document that I am indexing.  I found that MoreLikeThis in Solr only
works when the document is in the index, is it true?  If so, I may have to wait
til the indexing is finished, then run my own command to do MoreLikeThis to each
document in the index, and then reindex each document?  It sounds like it's not
efficient.  Is there a better way?
Thanks.





From: Liam O'Boyle
To: solr-user@lucene.apache.org
Cc: u...@nutch.apache.org
Sent: Thu, September 9, 2010 11:06:36 PM
Subject: Re: How to Update Value of One Field of a Document in Index?

Hi Savannah,

You can only reindex the entire document; if you only have the ID,
then do a search to retrieve the rest of the data, then reindex.  This
assumes that all of the fields you need to index are stored (so that
you can retrieve them) and not just indexed.

Liam

On Fri, Sep 10, 2010 at 3:29 PM, Savannah Beckett
  wrote:

I use nutch to crawl and index to Solr.  My code is working.  Now, I want to
update the value of one of the fields of a document in the solr index after

the

document was already indexed, and I have only the document id.  How do I do
that?

Thanks.








__
Do You Yahoo!?
Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz gegen Massenmails. 
http://mail.yahoo.com 


RE: Solr memory use, jmap and TermInfos/tii

2010-09-13 Thread Burton-West, Tom
Thanks Robert and everyone!

I'm working on changing our JVM settings today, since putting Solr 1.4.1 into 
production will take a bit more work and testing.  Hopefully, I'll be able to 
test the setTermIndexDivisor on our test server tomorrow.

Mike, I've started the process to see if we can provide you with our tii/tis 
data.  I'll let you know as soon as I hear anything.  


Tom

-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Sunday, September 12, 2010 10:48 AM
To: solr-user@lucene.apache.org; simon.willna...@gmail.com
Subject: Re: Solr memory use, jmap and TermInfos/tii

On Sun, Sep 12, 2010 at 9:57 AM, Simon Willnauer <
simon.willna...@googlemail.com> wrote:

> > To change the divisor in your solrconfig, for example to 4, it looks like
> > you need to do this.
> >
> >   > class="org.apache.solr.core.StandardIndexReaderFactory">
> >4
> >  
>
> Ah, thanks robert! I didn't know about that one either!
>
> simon


actually I'm wrong, for solr 1.4, use "setTermIndexDivisor".

i was looking at 3.1/trunk and there is a bug in the name of this parameter:
https://issues.apache.org/jira/browse/SOLR-2118

-- 
Robert Muir
rcm...@gmail.com


RE: Solr and jvm Garbage Collection tuning

2010-09-13 Thread Burton-West, Tom
Thanks Kent for your info.  

We are not doing any faceting, sorting, or much else.  My guess is that most of 
the memory increase is just the data structures created when parts of the frq 
and prx files get read into memory.  Our frq files are about 77GB  and the prx 
files are about 260GB per shard and we are running 3 shards per machine.   I 
suspect that the document cache and query result cache don't take up that much 
space, but will try a run with those caches set to 0, just to see.

We have dual 4 core processors and 74GB total memory.  We want to leave a 
significant amount of memory free for OS disk caching. 

We tried increasing the memory from 20GB to 28GB and adding the 
-XXMaxGCPauseMillis=1000 flag but that seemed to have no effect.  

Currently I'm testing using the ConcurrentMarkSweep and that's looking much 
better although I don't understand why it has sized the Eden space down into 
the 20MB range. However, I am very new to Java memory management.

Anyone know if when using ConcurrentMarkSweep its better to let the JVM size 
the Eden space or better to give it some hints?


Once we get some decent JVM settings we can put into production I'll be testing 
using termIndexInterval with Solr 1.4.1 on our test server.

Tom

-Original Message-
From: Grant Ingersoll [mailto:gsing...@apache.org] 

>.What are your current GC settings?  Also, I guess I'd look at ways you can 
>reduce the heap size needed. 
>> Caching, field type choices, faceting choices.  
>>Also could try playing with the termIndexInterval which will load fewer terms 
>>into memory at the cost of longer seeks. 

 >>At some point, though, you just may need more shards and the resulting 
 >>smaller indexes.  How many CPU cores do you have on each machine?


Field names

2010-09-13 Thread Peter A. Kirk
Hi

is it possible to issue a query to solr, to get a list which contains all the 
field names in the index?

What about to get a list of the freqency of individual words in each field?

thanks,
Peter


Re: How to extend IndexSchema and SchemaField

2010-09-13 Thread Chris Hostetter

: Yes, I have thought of that, or even extending field type. But this does not
: work for my use case, since I can have multiple fields of a same type
: (therefore with the same field type, and same analyzer), but each one of them
: needs specific information. Therefore, I think the only "nice" way to achieve
: this is to have the possibility to add attributes to any field definition.

Right, at the moment custom FieldType classes can specify whatever 
attributes they want to use in the  declaration -- but it's 
not possible to specify arbitrary attributes that can be used in the 
 declaration.

By all means, pelase open an issue requesting this as a feature.

I don't know that anyone explicitly set out to impose this limitation, but 
one of the reasons it likely exists is because SchemaField is not 
something that is intended to be customized -- while FieldType 
objects are constructed once at startup, SchemaField obejcts are 
frequently created on the fly when dealing with dynamicFields, so 
initialization complexity is kept to a minimum.  

That said -- this definitely seems like that type of usecase that we 
should try to find *some* solution for -- even if it just means having 
Solr automaticly create hidden FieldType instances for you on startup 
based on attributes specified in the  that the corrisponding 
FieldType class understands.


-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss  ...  Stump The Chump!



Re: Need Advice for Finding Freelance Solr Expert

2010-09-13 Thread Chris Hostetter

: References: 
:  <4c881061.60...@jhu.edu>
: 
: In-Reply-To:
: 
: Subject: Need Advice for Finding Freelance Solr Expert

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking



-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss  ...  Stump The Chump!



Re: Field names

2010-09-13 Thread Ryan McKinley
check:
http://wiki.apache.org/solr/LukeRequestHandler



On Mon, Sep 13, 2010 at 7:00 PM, Peter A. Kirk  wrote:
> Hi
>
> is it possible to issue a query to solr, to get a list which contains all the 
> field names in the index?
>
> What about to get a list of the freqency of individual words in each field?
>
> thanks,
> Peter
>


RE: Field names

2010-09-13 Thread Peter A. Kirk
Fantastic - that is exactly what I was looking for!

But here is one thing I don't undertstand:

If I call the url:
http://localhost:8983/solr/admin/luke?numTerms=10&fl=name

Some of the result looks like:


  

  18 

Does this mean that the term "gb" occurs 18 times in the name field?

Because if I issue this search:
http://localhost:8983/solr/select/?q=name:gb

I get results like:

  

So it only finds 9?

What do the above results actually tell me?

Thanks,
Peter


From: Ryan McKinley [ryan...@gmail.com]
Sent: Tuesday, 14 September 2010 11:30
To: solr-user@lucene.apache.org
Subject: Re: Field names

check:
http://wiki.apache.org/solr/LukeRequestHandler



On Mon, Sep 13, 2010 at 7:00 PM, Peter A. Kirk  wrote:
> Hi
>
> is it possible to issue a query to solr, to get a list which contains all the 
> field names in the index?
>
> What about to get a list of the freqency of individual words in each field?
>
> thanks,
> Peter
>

Re: Solr memory use, jmap and TermInfos/tii

2010-09-13 Thread Michael McCandless
On Mon, Sep 13, 2010 at 6:29 PM, Burton-West, Tom  wrote:
> Thanks Robert and everyone!
>
> I'm working on changing our JVM settings today, since putting Solr 1.4.1 into 
> production will take a bit more work and testing.  Hopefully, I'll be able to 
> test the setTermIndexDivisor on our test server tomorrow.
>
> Mike, I've started the process to see if we can provide you with our tii/tis 
> data.  I'll let you know as soon as I hear anything.

Super, thanks Tom!

Mike


Re: Distance sorting with spatial filtering

2010-09-13 Thread Scott K
I tracked down the problem and found a workaround. If there is a
wildcard entry in schema.xml such as the following.

   
   

then sort by function fails and returns Error 400 can not sort on
unindexed field: 

Removing the name="*" entry from schema.xml is a workaround. I noted
this in the Solr-1297 JIRA entry.

Scott

On Fri, Sep 10, 2010 at 01:40, Lance Norskog  wrote:
> Since no one has jumped in to give the right syntax- yeah, it's a bug.
> Please file a JIRA.
>
> On Thu, Sep 9, 2010 at 9:44 PM, Scott K  wrote:
>> On Thu, Sep 9, 2010 at 21:00, Lance Norskog  wrote:
>>> I just checked out the trunk, and branch 3.x This query is accepted on both,
>>> but gives no responses:
>>> http://localhost:8983/solr/select/?q=*:*&sort=dist(2,x_dt,y_dt,0,0)+asc
>>
>> So you are saying when you add the sort parameter you get no results
>> back, but do not get the error I am seeing? Should I open a Jira
>> ticket?
>>
>>> x_dt and y_dt are wildcard fields with the tdouble type. "tdouble"
>>> explicitly says it is stored and indexed. Your 'longitude' and 'latitude'
>>> fields may not be stored?
>>
>> No, they are stored.
>> http://localhost:8983/solr/select?q=*:*&rows=1&wt=xml&indent=true
>> 
>> 
>> 
>>  0
>>  9
>> 
>> 
>>  
>> ...
>>    47.6636
>>    -122.3054
>>
>>
>>> Also, this is accepted on both branches:
>>> http://localhost:8983/solr/select/?q=*:*&sort=sum(1)+asc
>>>
>>> The documentation for sum() does not mention single-argument calls.
>>
>> This also fails
>> http://localhost:8983/solr/select/?q=*:*&sort=sum(1,2)+asc
>> http://localhost:8983/solr/select/?q=*:*&sort=sum(latitude,longitude)+asc
>>
>>
>>> Scott K wrote:

 According to the documentation, sorting by function has been a feature
 since Solr 1.5. It seems like a major regression if this no longer
 works.
 http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function

 The _val_ trick does not seem to work if used with a query term,
 although I can try some more things to give 0 value to the query term.

 On Wed, Sep 8, 2010 at 22:21, Lance Norskog  wrote:

>
> It says that the field "sum(1)" is not indexed. You don't have a field
> called 'sum(1)'. I know there has been a lot of changes in query parsing,
> and sorting by functions may be on the list. But the _val_ trick is the
> older one and, and you noted, still works. The _val_ trick sets the
> ranking
> value to the output of the function, thus indirectly doing what sort=
> does.
>
> Lance
>
> Scott K wrote:
>
>>
>> I get the error on all functions.
>> GET 'http://localhost:8983/solr/select?q=*:*&sort=sum(1)+asc'
>> Error 400 can not sort on unindexed field: sum(1)
>>
>> I tried another nightly build from today, Sep 7th, with the same
>> results. I attached the schema.xml
>>
>> Thanks for the help!
>> Scott
>>
>> On Wed, Sep 1, 2010 at 18:43, Lance Norskog    wrote:
>>
>>
>>>
>>> Post your schema.
>>>
>>> On Mon, Aug 30, 2010 at 2:04 PM, Scott K    wrote:
>>>
>>>

 The new spatial filtering (SOLR-1586) works great and is much faster
 than fq={!frange. However, I am having problems sorting by distance.
 If I try
 GET

 'http://localhost:8983/solr/select/?q=*:*&sort=dist(2,latitude,longitude,0,0)+asc'
 I get an error:
 Error 400 can not sort on unindexed field:
 dist(2,latitude,longitude,0,0)

 I was able to work around this with
 GET 'http://localhost:8983/solr/select/?q=*:* AND _val_:"recip(dist(2,
 latitude, longitude, 0,0),1,1,1)"&fl=*,score'

 But why isn't sorting by functions working? I get this error with any
 function I try to sort on.This is a nightly trunk build from Aug 25th.
 I see SOLR-1297 was reopened, but that seems to be for edge cases.

 Second question: I am using the LatLonType from the Spatial Filtering
 wiki, http://wiki.apache.org/solr/SpatialSearch
 Are there any distance sorting functions that use this field, or do I
 need to have three indexed fields, store_lat_lon, latitude, and
 longitude, if I want both filtering and sorting by distance.

 Thanks, Scott



>>>
>>> --
>>> Lance Norskog
>>> goks...@gmail.com
>>>
>>>
>>>
>
>
>>>
>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>


Re: Solr and jvm Garbage Collection tuning

2010-09-13 Thread Stephen Green
On Mon, Sep 13, 2010 at 6:45 PM, Burton-West, Tom  wrote:
> Thanks Kent for your info.
>
> We are not doing any faceting, sorting, or much else.  My guess is that most 
> of the memory increase is just the data structures created when parts of the 
> frq and prx files get read into memory.  Our frq files are about 77GB  and 
> the prx files are about 260GB per shard and we are running 3 shards per 
> machine.   I suspect that the document cache and query result cache don't 
> take up that much space, but will try a run with those caches set to 0, just 
> to see.
>
> We have dual 4 core processors and 74GB total memory.  We want to leave a 
> significant amount of memory free for OS disk caching.
>
> We tried increasing the memory from 20GB to 28GB and adding the 
> -XXMaxGCPauseMillis=1000 flag but that seemed to have no effect.
>
> Currently I'm testing using the ConcurrentMarkSweep and that's looking much 
> better although I don't understand why it has sized the Eden space down into 
> the 20MB range. However, I am very new to Java memory management.
>
> Anyone know if when using ConcurrentMarkSweep its better to let the JVM size 
> the Eden space or better to give it some hints?

Really the best thing to do is to run the system for a while with GC
logging on and then look at how often the young generation GC is
occurring.  A set of parameters like:

-verbose:gc -XX:+PrintGCTimeStamps  -XX:+PrintGCDetails

Should give you some indication how often the young gen GC is
occurring.  If it's often, you can try increasing the size of the
young generation.  The option:

-Xloggc:

will dump this information to the specified file rather than sending
it to the standard error.

I've done this a few times with a variety of systems:  some times you
want to make the young gen bigger and some times you don't.

Steve
-- 
Stephen Green
http://thesearchguy.wordpress.com


geographic sharding . . . or not

2010-09-13 Thread Dennis Gearon
Think about THE big one - google.

(First, China for this example is avoided because much Chinese data is
ILLEGAL to be
provided for search outside of China)

If there is data generated by people in Europe, in various languages:
  1/ Is it stored close to where it is generated?
  2/ Are sharding and replication also close to where it is
generated?
  3/ How accessible IS that data to someone from the US who speaks one
of those languages?
  4/ How much is sharding and replication done AWAY from where data is
geographically generated?
  5/ What if a set of linked documents, from a relational database, has half of 
it's documents in one language AND related to people/places/or things in one 
country, and half in another country and it's language. There's a parent record 
for the two sets.= in the country of the user originating the parent/dual sets. 
 A/ is the parent record replicated in both countries, so that searches 
finding the child records can easily get to the parent record, vs 
transatlantic/pacific fetches?
 B/ Any thoughts about machine translation of said parent record?
 

What are people's thoughts on making sites that cater to people
interested in web pages, etc in other countries? Any examples out
there?

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


Our SOLR instance seems to be single-threading and therefore not taking advantage of its multi-proc host

2010-09-13 Thread David Crane

We are running SOLR 1.4.1 (Lucene 2.9.3) on a 2-CPU Linux host, but it seems
that only 1 CPU is ever being used. It almost seems like something is
single-threading inside the SOLR application. The CPU utilization is very
seldom over 0.9 even under load.

We are running on virtual Linux hosts and our other apps in the same cluster
are multi-threading w/o issue. Some more info on our stack and versions:

  Linux 2.6.16.33-xenU 
  Apache 2.2.3 
  Tomcat 6.0.16 
  Java SE Runtime Environment (build 1.6.0_10-ea-b11)

Has anyone else noticed this problem?

Might there be some SOLR config aspect to enable multi-threading that we're
missing? Any suggestions for troubleshooting?

Judging by SOLR's logs, we do see that multiple requests are processing
simultaneously inside SOLR so we do not believe we're sequentially feeding
requests to SOLR, ie. bottle-necking things outside of SOLR.

Thanks,
David Crane
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Our-SOLR-instance-seems-to-be-single-threading-and-therefore-not-taking-advantage-of-its-multi-proc-t-tp1470282p1470282.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Field names

2010-09-13 Thread Simon Willnauer
On Tue, Sep 14, 2010 at 1:39 AM, Peter A. Kirk  wrote:
> Fantastic - that is exactly what I was looking for!
>
> But here is one thing I don't undertstand:
>
> If I call the url:
> http://localhost:8983/solr/admin/luke?numTerms=10&fl=name
>
> Some of the result looks like:
>
> 
>  
>    
>      18
>
> Does this mean that the term "gb" occurs 18 times in the name field?
Yes that is the Doc Frequency of the term "gb". Remember that deleted
/ updated documents and their terms contribute to the doc frequency
until they are expunged from the index. That either happens through a
segment merge in the background or due to an explicit call to
optimize.
>
> Because if I issue this search:
> http://localhost:8983/solr/select/?q=name:gb
>
> I get results like:
> 
>  
>
> So it only finds 9?
Since the "gb" term says 18 occurrences throughout the index I suspect
you updated you docs once without optimizing or indexing a lot of docs
so that segments are merged. Try to call optimize if you can afford it
and see if the doc-freq count goes back to 9

simon
>
> What do the above results actually tell me?
>
> Thanks,
> Peter
>
> 
> From: Ryan McKinley [ryan...@gmail.com]
> Sent: Tuesday, 14 September 2010 11:30
> To: solr-user@lucene.apache.org
> Subject: Re: Field names
>
> check:
> http://wiki.apache.org/solr/LukeRequestHandler
>
>
>
> On Mon, Sep 13, 2010 at 7:00 PM, Peter A. Kirk  
> wrote:
>> Hi
>>
>> is it possible to issue a query to solr, to get a list which contains all 
>> the field names in the index?
>>
>> What about to get a list of the freqency of individual words in each field?
>>
>> thanks,
>> Peter
>>


Spell checking and keyword tokenizer

2010-09-13 Thread Glen Stampoultzis
Hi,

I'm trying to spell check a whole field using a lowercasing keyword
tokenizer [1].

for example if I query for "furntree gully" I'm hoping to get back
"ferntree gully" as a suggestion.  Unfortunately the spell checker
seems to be recognizing this as two tokens and returning suggestions
for both.  Query [2] and result [3] below.  In this case ferntree
actually does end up with ferntree gully as a suggestion however it
also gives bulla as a suggestion for gully (go figure :-) ).

Any suggestions?

Regards,

Glen


[1] -








[2] -

Query

q=locality_lc%3A%22furntree+gully%22&spellcheck=true&spellcheck.build=true&spellcheck.reload=true&spellcheck.accuracy=0.5&spellcheck.dictionary=locality_spellchecker&spellcheck.collate=true&fl=street_name%2Clocality%2Cstate

[3] -


  

  0


  379


  
true
  
  
street_name,locality,state
  
  
0.5
  
  
locality_lc:"furntree gully"
  
  
locality_spellchecker
  
  
true
  
  
true
  
  
true
  

  
  
build
  
  
  

  

  1


  13


  21


  
ferntree gully
  

  
  

  1


  22


  27


  
bulla
  

  
  
locality_lc:"ferntree gully bulla"
  

  



Re: Spell checking and keyword tokenizer

2010-09-13 Thread Glen Stampoultzis
Nevermind this one... With a bit more research I discovered I can use
spellcheck.q to provide the correct suggestion.

On 14 September 2010 16:02, Glen Stampoultzis  wrote:
> Hi,
>
> I'm trying to spell check a whole field using a lowercasing keyword
> tokenizer [1].
>
> for example if I query for "furntree gully" I'm hoping to get back
> "ferntree gully" as a suggestion.  Unfortunately the spell checker
> seems to be recognizing this as two tokens and returning suggestions
> for both.  Query [2] and result [3] below.  In this case ferntree
> actually does end up with ferntree gully as a suggestion however it
> also gives bulla as a suggestion for gully (go figure :-) ).
>
> Any suggestions?
>
> Regards,
>
> Glen
>
>
> [1] -
>
>         positionIncrementGap="100">
>            
>                
>                
>            
>        
>
> [2] -
>
> Query
>
> q=locality_lc%3A%22furntree+gully%22&spellcheck=true&spellcheck.build=true&spellcheck.reload=true&spellcheck.accuracy=0.5&spellcheck.dictionary=locality_spellchecker&spellcheck.collate=true&fl=street_name%2Clocality%2Cstate
>
> [3] -
>
> 
>  
>    
>      0
>    
>    
>      379
>    
>    
>      
>        true
>      
>      
>        street_name,locality,state
>      
>      
>        0.5
>      
>      
>        locality_lc:"furntree gully"
>      
>      
>        locality_spellchecker
>      
>      
>        true
>      
>      
>        true
>      
>      
>        true
>      
>    
>  
>  
>    build
>  
>  
>  
>    
>      
>        
>          1
>        
>        
>          13
>        
>        
>          21
>        
>        
>          
>            ferntree gully
>          
>        
>      
>      
>        
>          1
>        
>        
>          22
>        
>        
>          27
>        
>        
>          
>            bulla
>          
>        
>      
>      
>        locality_lc:"ferntree gully bulla"
>      
>    
>  
> 
>