Query on Level of Access to lucene in Solr

2009-02-04 Thread Nick
Hello there,

 I'm a solr newbie but i've used lucene for some complex
IR projects before.
 Can someone please help me understand the extent to which solr allows
access to lucene?
To elaborate, say, i'm considering the use of solr for all its wonderful
properties like scaling,
 distributed search, ease of updates,etc. I've a corpus of data that i'd
like lucene to index.
Further, I'm working on some graph research, where i'd like to disjunctively
query keyword
terms and use the indepent result sets as entry points into my graph of
documents.
I have my own data structures (in java) that handle efficient graph
walks,etc and eventually
apply a whole bunch of math to re-rank results/result trees.
In a more traditional setting, i can imagine using lucene as an external jar
dependency, hook
it up with the rest of my code in java and ship it off into Tomcat.

Is this doable with solr? Please help with comments on the specifc mechanics
of hooking up
custom java application logic with lucene before integrating with the rest
of the tomcat ecosystem.

Thank you very much.
Nick.


Solr Managed Schema by Default in 5.5

2016-03-11 Thread Nick Vasilyev
Hi,

I started playing around with Solr 5.5 and created a collection using the
following:

./solr create_collection -c test -p 9000 -replicationFactor 2 -d
basic_configs -shards 2

The collection created fine, however I see that although I specified
basic_configs, it was deployed in managed schema mode.

I was able to follow instructions here:
https://cwiki.apache.org/confluence/display/solr/Managed+Schema+Definition+in+SolrConfig

To get it back to basic mode, which required me to modify solrconfig and
remove the manged schema file from zookeeper manually.

I checked the configuration files for basic_configs for Solr 5.5 and it
looks like it is managed, however Solr 5.4 still has the classic as the
default parameters.

Is this now the default behavior for basic_configs? I would really like to
maintain an option to easily create collection with classic schema settings
without jumping through all of these hoops.

Thanks


Re: Solr Managed Schema by Default in 5.5

2016-03-11 Thread Nick Vasilyev
Hi Shawn,

Maybe I am missing something, if that is the case what is the difference
between data_driven_schema_configs and basic_configs? I thought that the
only difference was that the data_driven_schema_configs comes with the
managed schema and the basic_configs come with regular?

Also, I haven't really dived into the schema less mode so far, I know
elastic uses it and it has been kind of a turn off for me. Can you provide
some guidance around best practices on how to use it?

For example, now I have all of my configuration files in version control,
if I need to make a change, I upload a new schema to version control, then
the server pulls them down, uploads to zk and reloads collections. This is
almost fully automated and since all configuration is in a single file it
is easy to review and track previous changes. I like this process and it
works well; if I have to start using managed schemas; I would like some
feedback on how to implement it with minimal disruption to this.

If I am sending all schema changes via the API, I would need to have still
have some file with the schema configuration, it would just be a different
format. I would then need to have some code to read it and send specific
items to Solr, right?  When I need to make a change, do I have to then make
this change individually and include that configuration as part of the
config file? Or should I be able to just send the entire schema in again?

Previously when I tried to upload the entire schema again I ran into
problems; for example if there is already field copying from field1 to
field 2, when I resend the config it would add another "copy field set". So
copying would occur twice and error out if the field is not multi-valued.
If future changes need to be made atomically and then included back into
this other config it just introduces more room for error.

Also, with classic schema if I wanted to revert a change or delete a field,
I would simply remove it from the schema and re-upload. Now it looks like I
need to add additional functionality into whatever my new process will be
to delete fields / copy fields, etc...

I know the point of this is to be able to easily make a UI for these
changes, but UI changes are hard to automate and version control. Please
let me know if I am missing something.

On Fri, Mar 11, 2016 at 10:41 AM, Shawn Heisey  wrote:

> On 3/11/2016 7:01 AM, Nick Vasilyev wrote:
> > Is this now the default behavior for basic_configs? I would really like
> to
> > maintain an option to easily create collection with classic schema
> settings
> > without jumping through all of these hoops.
>
> Starting in 5.5, all examples now use the managed schema.
>
> https://issues.apache.org/jira/browse/SOLR-8131
>
> The classic schema factory still exists, and probably will exist for all
> 6.x versions, so you will not need to migrate any existing setup yet.
>
> I don't mind putting more emphasis on the new factory or using it by
> default.  I expect that eventually the classic factory will get
> deprecated.  When that happens, I would like to see an option to mimic
> the classic version, where making changes via API won't work.  One
> person has already come into the IRC channel and asked how they can
> disable schema editing.
>
> Although I don't have a problem with the managed schema, I still don't
> like schemaless mode, which requires the managed schema.  It looks like
> the basic_configs and sample_techproducts_configs examples have NOT
> enabled that feature.
>
> Thanks,
> Shawn
>
>


Re: Solr Managed Schema by Default in 5.5

2016-03-11 Thread Nick Vasilyev
Got it.

Thank you for clarifying this, I was under impression that I would only be
able to make changes via the API. I will look into this some more.

On Fri, Mar 11, 2016 at 11:51 AM, Shawn Heisey  wrote:

> On 3/11/2016 9:28 AM, Nick Vasilyev wrote:
> > Maybe I am missing something, if that is the case what is the difference
> > between data_driven_schema_configs and basic_configs? I thought that the
> > only difference was that the data_driven_schema_configs comes with the
> > managed schema and the basic_configs come with regular?
> >
> > Also, I haven't really dived into the schema less mode so far, I know
> > elastic uses it and it has been kind of a turn off for me. Can you
> provide
> > some guidance around best practices on how to use it?
>
> Schemaless mode is implemented with an update processor chain.  If you
> look in the data_driven_schema_configs solrconfig.xml file, you will
> find an updateRequestProcessorChain named
> "add-unknown-fields-to-the-schema".  This update chain is then enabled
> with an initParams config.
>
> I personally would not recommend using it.  It would be fine to use
> during prototyping, but I would definitely turn it off for production.
>
> > For example, now I have all of my configuration files in version control,
> > if I need to make a change, I upload a new schema to version control,
> then
> > the server pulls them down, uploads to zk and reloads collections. This
> is
> > almost fully automated and since all configuration is in a single file it
> > is easy to review and track previous changes. I like this process and it
> > works well; if I have to start using managed schemas; I would like some
> > feedback on how to implement it with minimal disruption to this.
>
> There's no reason you can't continue to use this method, even with the
> managed schema.  Editing the managed-schema is discouraged if you
> actually intend to use the Schema API, but there's nothing in place to
> prevent you from doing it that way.
>
> > If I am sending all schema changes via the API, I would need to have
> still
> > have some file with the schema configuration, it would just be a
> different
> > format. I would then need to have some code to read it and send specific
> > items to Solr, right?  When I need to make a change, do I have to then
> make
> > this change individually and include that configuration as part of the
> > config file? Or should I be able to just send the entire schema in again?
>
> Using the Schema API changes the managed-schema file in place.  You
> wouldn't need to upload anything to zookeeper, the change would already
> be there -- but you'd have to take an extra step (retrieving from
> zookeeper) to make sure it's in version control.
>
> My recommendation is to just keep using version control as you have
> been, which you can do with either the Classic or Managed schema.  The
> filename for the schema would change with the managed version, but
> nothing else.
>
> Thanks,
> Shawn
>
>


Inconsistent Shard Usage for Distributed Queries

2016-03-15 Thread Nick Vasilyev
Hello,

I have a brand new installation of Solr 5.4.1 and I am running into a
strange problem with one of my collections. Collection *products* has 5
shards and replication factor of two. Both replicas are up and show green
status on the Cloud page in the UI.

When I run a default search on the query page (q=*:*) I always get a
different numFound although there is no active indexing and everything is
committed. I checked the logs and it looks like every time it runs a
search, it is sent to different shards. Below, search1 went to shard 5, 2
and 4, search2 went to shard 5, 3, 1 and search 3 went to shard 3, 4, 1, 5.

To confirm this, I ran a &distrib=false query on shard 5 and got 8,928,379
items, 8,917,318 for shard 2, and 9,005,295 for shard 4. The results from
shard 2 distrib=false query did not match the results that were in the
distributed query (from the logs). The query returned 8917318. Here is the
log entry for the query.

214467874 INFO  (qtp1013423070-21019) [c:products s:shard2 r:core_node7
x:products_shard2_replica2] o.a.s.c.S.Request [products_shard2_replica2]
webapp=/solr path=/select
params={q=*:*&distrib=false&indent=true&wt=json&_=1458056340020}
hits=8917318 status=0 QTime=0


Here are the logs from other queries.

Search 1 - numFound 18309764

213941984 INFO  (qtp1013423070-21046) [c:products s:shard5 r:core_node4
x:products_shard5_replica2] o.a.s.c.S.Request [products_shard5_replica2]
webapp=/solr path=/select
params={df=text&distrib=false&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=
http://192.168.1.211:9000/solr/products_shard5_replica2/|http://192.168.1.212:9000/solr/products_shard5_replica1/&rows=10&version=2&q=*:*&NOW=1458055805759&isShard=true&wt=javabin&_=1458055814096}
hits=8928379 status=0 QTime=3
213941985 INFO  (qtp1013423070-21028) [c:products s:shard4 r:core_node6
x:products_shard4_replica2] o.a.s.c.S.Request [products_shard4_replica2]
webapp=/solr path=/select
params={df=text&distrib=false&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=
http://192.168.1.212:9000/solr/products_shard4_replica1/|http://192.168.1.211:9000/solr/products_shard4_replica2/&rows=10&version=2&q=*:*&NOW=1458055805759&isShard=true&wt=javabin&_=1458055814096}
hits=9005295 status=0 QTime=3
213942045 INFO  (qtp1013423070-21042) [c:products s:shard2 r:core_node7
x:products_shard2_replica2] o.a.s.c.S.Request [products_shard2_replica2]
webapp=/solr path=/select
params={q=*:*&indent=true&wt=json&_=1458055814096} hits=18309764 status=0
QTime=81


Search 2 - numFound 27072144
213995779 INFO  (qtp1013423070-21046) [c:products s:shard5 r:core_node4
x:products_shard5_replica2] o.a.s.c.S.Request [products_shard5_replica2]
webapp=/solr path=/select
params={df=text&distrib=false&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=
http://192.168.1.211:9000/solr/products_shard5_replica2/|http://192.168.1.212:9000/solr/products_shard5_replica1/&rows=10&version=2&q=*:*&NOW=1458055859563&isShard=true&wt=javabin&_=1458055867894}
hits=8928379 status=0 QTime=1
213995781 INFO  (qtp1013423070-20985) [c:products s:shard3 r:core_node10
x:products_shard3_replica2] o.a.s.c.S.Request [products_shard3_replica2]
webapp=/solr path=/select
params={df=text&distrib=false&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=
http://192.168.1.212:9000/solr/products_shard3_replica1/|http://192.168.1.211:9000/solr/products_shard3_replica2/&rows=10&version=2&q=*:*&NOW=1458055859563&isShard=true&wt=javabin&_=1458055867894}
hits=8980542 status=0 QTime=3
213995785 INFO  (qtp1013423070-21042) [c:products s:shard1 r:core_node9
x:products_shard1_replica2] o.a.s.c.S.Request [products_shard1_replica2]
webapp=/solr path=/select
params={df=text&distrib=false&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=
http://192.168.1.212:9000/solr/products_shard1_replica1/|http://192.168.1.211:9000/solr/products_shard1_replica2/&rows=10&version=2&q=*:*&NOW=1458055859563&isShard=true&wt=javabin&_=1458055867894}
hits=8914801 status=0 QTime=3
213995798 INFO  (qtp1013423070-21028) [c:products s:shard2 r:core_node7
x:products_shard2_replica2] o.a.s.c.S.Request [products_shard2_replica2]
webapp=/solr path=/select
params={q=*:*&indent=true&wt=json&_=1458055867894} hits=27072144 status=0
QTime=30


Search 3 - numFound 35953734

214022457 INFO  (qtp1013423070-21019) [c:products s:shard3 r:core_node10
x:products_shard3_replica2] o.a.s.c.S.Request [products_shard3_replica2]
webapp=/solr path=/select
params={df=text&distrib=false&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=
http://192.168.1.212:9000/solr/products_shard3_replica1/|http://192.168.1.211:9000/solr/products_shard3_replica2/&rows=10&version=2&q=*:*&NOW=1458055886247&isShard=true&wt=javabin&_=1458055894580}
hits=8980542 status=0 QTime=0
214022458 INFO  (qtp1013423070-21036) [c:products s:shard4 r:core_node6
x:products_shard4_replica2] o.a.s.c.S.Request [products_shard4_replica2]
webapp=/solr path=/select
params={df=text&distrib=false&fl=id&fl=score&

Re: Inconsistent Shard Usage for Distributed Queries

2016-03-15 Thread Nick Vasilyev
I reloaded the collection and ran distrib=false query for several shards on
both replicas. The counts matched exactly.

I then reloaded the second replica (through the UI) and now it seems like
it is working fine, I am getting consistent matches.

Not sure what the issue was, in previous versions of Solr, clicking reload
would send a commit to all replicas, right? Is that still the case?



On Tue, Mar 15, 2016 at 11:53 AM, Erick Erickson 
wrote:

> This is very strange. What are the results you get when
> you compare replicas in th e_same_ shard? It doesn't really
> mean anything when you say
> "shard1 has X docs, shard2 has Y docs". The only way
> you should be getting different results from
> the match all docs query is if different replicas within the
> _same_ shard have different counts.
>
> And just as a sanity check, issue a commit. It's highly unlikely
> that you have uncommitted changes, but it never hurts to try.
>
> All distributed queries should have a sub query sent to one
> replica of each shard, is that what you're seeing? And I'd ping
> the cores  directly rather than provide shards parameters,
> something like:
>
> blha blah blah/products/query/shard1_core3/query?q=*:*. That
> addresses the specific core rather than rely on any internal query
> routing logic..
>
> Best,
> Erick
>
> On Tue, Mar 15, 2016 at 8:43 AM, Nick Vasilyev 
> wrote:
> > Hello,
> >
> > I have a brand new installation of Solr 5.4.1 and I am running into a
> > strange problem with one of my collections. Collection *products* has 5
> > shards and replication factor of two. Both replicas are up and show green
> > status on the Cloud page in the UI.
> >
> > When I run a default search on the query page (q=*:*) I always get a
> > different numFound although there is no active indexing and everything is
> > committed. I checked the logs and it looks like every time it runs a
> > search, it is sent to different shards. Below, search1 went to shard 5, 2
> > and 4, search2 went to shard 5, 3, 1 and search 3 went to shard 3, 4, 1,
> 5.
> >
> > To confirm this, I ran a &distrib=false query on shard 5 and got
> 8,928,379
> > items, 8,917,318 for shard 2, and 9,005,295 for shard 4. The results from
> > shard 2 distrib=false query did not match the results that were in the
> > distributed query (from the logs). The query returned 8917318. Here is
> the
> > log entry for the query.
> >
> > 214467874 INFO  (qtp1013423070-21019) [c:products s:shard2 r:core_node7
> > x:products_shard2_replica2] o.a.s.c.S.Request [products_shard2_replica2]
> > webapp=/solr path=/select
> > params={q=*:*&distrib=false&indent=true&wt=json&_=1458056340020}
> > hits=8917318 status=0 QTime=0
> >
> >
> > Here are the logs from other queries.
> >
> > Search 1 - numFound 18309764
> >
> > 213941984 INFO  (qtp1013423070-21046) [c:products s:shard5 r:core_node4
> > x:products_shard5_replica2] o.a.s.c.S.Request [products_shard5_replica2]
> > webapp=/solr path=/select
> >
> params={df=text&distrib=false&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=
> >
> http://192.168.1.211:9000/solr/products_shard5_replica2/|http://192.168.1.212:9000/solr/products_shard5_replica1/&rows=10&version=2&q=*:*&NOW=1458055805759&isShard=true&wt=javabin&_=1458055814096
> }
> > hits=8928379 status=0 QTime=3
> > 213941985 INFO  (qtp1013423070-21028) [c:products s:shard4 r:core_node6
> > x:products_shard4_replica2] o.a.s.c.S.Request [products_shard4_replica2]
> > webapp=/solr path=/select
> >
> params={df=text&distrib=false&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=
> >
> http://192.168.1.212:9000/solr/products_shard4_replica1/|http://192.168.1.211:9000/solr/products_shard4_replica2/&rows=10&version=2&q=*:*&NOW=1458055805759&isShard=true&wt=javabin&_=1458055814096
> }
> > hits=9005295 status=0 QTime=3
> > 213942045 INFO  (qtp1013423070-21042) [c:products s:shard2 r:core_node7
> > x:products_shard2_replica2] o.a.s.c.S.Request [products_shard2_replica2]
> > webapp=/solr path=/select
> > params={q=*:*&indent=true&wt=json&_=1458055814096} hits=18309764 status=0
> > QTime=81
> >
> >
> > Search 2 - numFound 27072144
> > 213995779 INFO  (qtp1013423070-21046) [c:products s:shard5 r:core_node4
> > x:products_shard5_replica2] o.a.s.c.S.Request [products_shard5_replica2]
> > webapp=/solr path=/select
> >
> params={df=text&distrib=false&fl=id&fl=score&shards.purpos

Re: Inconsistent Shard Usage for Distributed Queries

2016-03-15 Thread Nick Vasilyev
Yea, the code sends actual commits, but I hate typing so usually just click
the reload button unless it's production.
On Mar 15, 2016 12:22 PM, "Erick Erickson"  wrote:

> bq: Not sure what the issue was, in previous versions of Solr, clicking
> reload
> would send a commit to all replicas, right
>
> Reloading doesn't really have anything to do with commits. Reload
> would certainly
> cause a new searcher to be opened and thus would pick up any changes
> that hat been hard-committed (openSearcher=false), but that's a complete
> side-effect. Simply issuing a commit on the url to the _collection_ will
> cause
> commits to happen on all replicas, as:
>
> blah/solr/collection/update?commit=true
>
> Best,
> Erick
>
> On Tue, Mar 15, 2016 at 9:11 AM, Nick Vasilyev 
> wrote:
> > I reloaded the collection and ran distrib=false query for several shards
> on
> > both replicas. The counts matched exactly.
> >
> > I then reloaded the second replica (through the UI) and now it seems like
> > it is working fine, I am getting consistent matches.
> >
> > Not sure what the issue was, in previous versions of Solr, clicking
> reload
> > would send a commit to all replicas, right? Is that still the case?
> >
> >
> >
> > On Tue, Mar 15, 2016 at 11:53 AM, Erick Erickson <
> erickerick...@gmail.com>
> > wrote:
> >
> >> This is very strange. What are the results you get when
> >> you compare replicas in th e_same_ shard? It doesn't really
> >> mean anything when you say
> >> "shard1 has X docs, shard2 has Y docs". The only way
> >> you should be getting different results from
> >> the match all docs query is if different replicas within the
> >> _same_ shard have different counts.
> >>
> >> And just as a sanity check, issue a commit. It's highly unlikely
> >> that you have uncommitted changes, but it never hurts to try.
> >>
> >> All distributed queries should have a sub query sent to one
> >> replica of each shard, is that what you're seeing? And I'd ping
> >> the cores  directly rather than provide shards parameters,
> >> something like:
> >>
> >> blha blah blah/products/query/shard1_core3/query?q=*:*. That
> >> addresses the specific core rather than rely on any internal query
> >> routing logic..
> >>
> >> Best,
> >> Erick
> >>
> >> On Tue, Mar 15, 2016 at 8:43 AM, Nick Vasilyev <
> nick.vasily...@gmail.com>
> >> wrote:
> >> > Hello,
> >> >
> >> > I have a brand new installation of Solr 5.4.1 and I am running into a
> >> > strange problem with one of my collections. Collection *products* has
> 5
> >> > shards and replication factor of two. Both replicas are up and show
> green
> >> > status on the Cloud page in the UI.
> >> >
> >> > When I run a default search on the query page (q=*:*) I always get a
> >> > different numFound although there is no active indexing and
> everything is
> >> > committed. I checked the logs and it looks like every time it runs a
> >> > search, it is sent to different shards. Below, search1 went to shard
> 5, 2
> >> > and 4, search2 went to shard 5, 3, 1 and search 3 went to shard 3, 4,
> 1,
> >> 5.
> >> >
> >> > To confirm this, I ran a &distrib=false query on shard 5 and got
> >> 8,928,379
> >> > items, 8,917,318 for shard 2, and 9,005,295 for shard 4. The results
> from
> >> > shard 2 distrib=false query did not match the results that were in the
> >> > distributed query (from the logs). The query returned 8917318. Here is
> >> the
> >> > log entry for the query.
> >> >
> >> > 214467874 INFO  (qtp1013423070-21019) [c:products s:shard2
> r:core_node7
> >> > x:products_shard2_replica2] o.a.s.c.S.Request
> [products_shard2_replica2]
> >> > webapp=/solr path=/select
> >> > params={q=*:*&distrib=false&indent=true&wt=json&_=1458056340020}
> >> > hits=8917318 status=0 QTime=0
> >> >
> >> >
> >> > Here are the logs from other queries.
> >> >
> >> > Search 1 - numFound 18309764
> >> >
> >> > 213941984 INFO  (qtp1013423070-21046) [c:products s:shard5
> r:core_node4
> >> > x:products_shard5_replica2] o.a.s.c.S.Request
> [products_shard5_replica2]
> >> > webapp=/solr path=/select
> >> &

Re: Inconsistent Shard Usage for Distributed Queries

2016-03-15 Thread Nick Vasilyev
I had another collection I was running into this issue with, so I decided
to play around with it. This one had active indexing going on, so I was
able to confirm how the counts get updated. Basically, it looks like
clicking the reload button will only send a commit to that one core, it
will not be propagated to other shards and the same shard on the other
replica. Full commit update?commit=true&openSearcher=true works fine. I
know that the reload button was not intended to issue commits, but it's
quicker than typing out the command.

On Tue, Mar 15, 2016 at 12:24 PM, Nick Vasilyev 
wrote:

> Yea, the code sends actual commits, but I hate typing so usually just
> click the reload button unless it's production.
> On Mar 15, 2016 12:22 PM, "Erick Erickson" 
> wrote:
>
>> bq: Not sure what the issue was, in previous versions of Solr, clicking
>> reload
>> would send a commit to all replicas, right
>>
>> Reloading doesn't really have anything to do with commits. Reload
>> would certainly
>> cause a new searcher to be opened and thus would pick up any changes
>> that hat been hard-committed (openSearcher=false), but that's a complete
>> side-effect. Simply issuing a commit on the url to the _collection_ will
>> cause
>> commits to happen on all replicas, as:
>>
>> blah/solr/collection/update?commit=true
>>
>> Best,
>> Erick
>>
>> On Tue, Mar 15, 2016 at 9:11 AM, Nick Vasilyev 
>> wrote:
>> > I reloaded the collection and ran distrib=false query for several
>> shards on
>> > both replicas. The counts matched exactly.
>> >
>> > I then reloaded the second replica (through the UI) and now it seems
>> like
>> > it is working fine, I am getting consistent matches.
>> >
>> > Not sure what the issue was, in previous versions of Solr, clicking
>> reload
>> > would send a commit to all replicas, right? Is that still the case?
>> >
>> >
>> >
>> > On Tue, Mar 15, 2016 at 11:53 AM, Erick Erickson <
>> erickerick...@gmail.com>
>> > wrote:
>> >
>> >> This is very strange. What are the results you get when
>> >> you compare replicas in th e_same_ shard? It doesn't really
>> >> mean anything when you say
>> >> "shard1 has X docs, shard2 has Y docs". The only way
>> >> you should be getting different results from
>> >> the match all docs query is if different replicas within the
>> >> _same_ shard have different counts.
>> >>
>> >> And just as a sanity check, issue a commit. It's highly unlikely
>> >> that you have uncommitted changes, but it never hurts to try.
>> >>
>> >> All distributed queries should have a sub query sent to one
>> >> replica of each shard, is that what you're seeing? And I'd ping
>> >> the cores  directly rather than provide shards parameters,
>> >> something like:
>> >>
>> >> blha blah blah/products/query/shard1_core3/query?q=*:*. That
>> >> addresses the specific core rather than rely on any internal query
>> >> routing logic..
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Tue, Mar 15, 2016 at 8:43 AM, Nick Vasilyev <
>> nick.vasily...@gmail.com>
>> >> wrote:
>> >> > Hello,
>> >> >
>> >> > I have a brand new installation of Solr 5.4.1 and I am running into a
>> >> > strange problem with one of my collections. Collection *products*
>> has 5
>> >> > shards and replication factor of two. Both replicas are up and show
>> green
>> >> > status on the Cloud page in the UI.
>> >> >
>> >> > When I run a default search on the query page (q=*:*) I always get a
>> >> > different numFound although there is no active indexing and
>> everything is
>> >> > committed. I checked the logs and it looks like every time it runs a
>> >> > search, it is sent to different shards. Below, search1 went to shard
>> 5, 2
>> >> > and 4, search2 went to shard 5, 3, 1 and search 3 went to shard 3,
>> 4, 1,
>> >> 5.
>> >> >
>> >> > To confirm this, I ran a &distrib=false query on shard 5 and got
>> >> 8,928,379
>> >> > items, 8,917,318 for shard 2, and 9,005,295 for shard 4. The results
>> from
>> >> > shard 2 distrib=false query did not match the results that were in
>> the
&

Re: Boosts for relevancy (shopping products)

2016-03-18 Thread Nick Vasilyev
Tie does quite a bit, without it only the highest weighted field that has
the term will be included in relevance score. Tie let's you include the
other fields that match as well.
On Mar 18, 2016 10:40 AM, "Robert Brown"  wrote:

> Thanks for the added input.
>
> I'll certainly look into the machine learning aspect, will be good to put
> some basic knowledge I have into practice.
>
> I'd been led to believe the tie parameter didn't actually do a lot. :-/
>
>
>
> On 03/18/2016 12:07 PM, Nick Vasilyev wrote:
>
>> I work with a similar catalog; except our data is especially bad.  We've
>> found that several things helped:
>>
>> - Item level grouping (group same item sold by multiple vendors). Rank
>> items with more vendors a bit higher.
>> - Include a boost function for other attributes, such as an original image
>> of the product
>> - Rank items a bit higher if they have data from an external catalog like
>> IceCat
>> - For relevance and performance, we have several fields that we copy data
>> into. High value fields get copied into a high weighted field, while lower
>> value fields like description get copied into a lower weighted field.
>> These
>> fields are the backbone of our qf parameter, with other fields adding
>> additional boost.
>> - Play around with the tie parameter for edismax, we found that it makes
>> quite a big difference.
>>
>> Hope this helps.
>>
>> On Fri, Mar 18, 2016 at 6:19 AM, Alessandro Benedetti <
>> abenede...@apache.org
>>
>>> wrote:
>>> In a relevancy problem I would repeat what my colleagues already pointed
>>> out :
>>> Data is key. We need to understand first of all our data before we can
>>> understand what is relevant and what is not.
>>> Once we specify a groundfloor which make sense ( and your basic approach
>>> +
>>> proper schema configuration as suggested + properly configured request
>>> handler , seems a good start to me ) .
>>>
>>> At this point if you are still not happy with the relevancy (i.e. you are
>>> not happy with the different boosts you assigned ) my strongest
>>> suggestion
>>> at this time is to move to machine learning.
>>> You need a good amount of data to feed the learner and make it your Super
>>> Business Expert) .
>>> I have been recently working with the Learn To Rank Bloomberg Plugin [1]
>>> .
>>> In  my opinion will be key for all the business that have many features
>>> in
>>> the game, that can help to evaluate a proper ranking.
>>> For that you need to be able to collect and process signals, and you need
>>> to carefully tune the features of your interest.
>>> But the results could be surprising .
>>>
>>> [1] https://issues.apache.org/jira/browse/SOLR-8542
>>> [2] Learning to Rank in Solr <
>>> https://www.youtube.com/watch?v=M7BKwJoh96s>
>>>
>>> Cheers
>>>
>>> On Thu, Mar 17, 2016 at 10:15 AM, Robert Brown 
>>> wrote:
>>>
>>> Thanks Scott and John,
>>>>
>>>> As luck would have it I've got a PhD graduate coming for an interview
>>>> today, who just happened to do her research thesis on information
>>>>
>>> retrieval
>>>
>>>> with quantum theory and machine learning  :)
>>>>
>>>> John, it sounds like you're describing my system!  Shopping products
>>>> from
>>>> multiple sources.  (De-duplication is going to be fun soon).
>>>>
>>>> I already copy fields like merchant, brand, category, to string fields
>>>> to
>>>> use them as facets/filters.  I was contemplating removing the
>>>> description
>>>> due to the spammy issue you mentioned, I didn't know about the
>>>> RemoveDuplicatesTokenFilterFactory, so I'm sure that's going to be a
>>>> huge
>>>> help.
>>>>
>>>> Thanks a lot,
>>>> Rob
>>>>
>>>>
>>>>
>>>> On 03/17/2016 10:01 AM, John Smith wrote:
>>>>
>>>> Hi,
>>>>>
>>>>> For once I might be of some help: I've had a similar configuration
>>>>> (large set of products from various sources). It's very difficult to
>>>>> find the right balance between all parameters and requires a lot of
>>>>> tweaking, most often in the dark unfortunately.
>>>>>
>>>

Re: Boosts for relevancy (shopping products)

2016-03-19 Thread Nick Vasilyev
I work with a similar catalog; except our data is especially bad.  We've
found that several things helped:

- Item level grouping (group same item sold by multiple vendors). Rank
items with more vendors a bit higher.
- Include a boost function for other attributes, such as an original image
of the product
- Rank items a bit higher if they have data from an external catalog like
IceCat
- For relevance and performance, we have several fields that we copy data
into. High value fields get copied into a high weighted field, while lower
value fields like description get copied into a lower weighted field. These
fields are the backbone of our qf parameter, with other fields adding
additional boost.
- Play around with the tie parameter for edismax, we found that it makes
quite a big difference.

Hope this helps.

On Fri, Mar 18, 2016 at 6:19 AM, Alessandro Benedetti  wrote:

> In a relevancy problem I would repeat what my colleagues already pointed
> out :
> Data is key. We need to understand first of all our data before we can
> understand what is relevant and what is not.
> Once we specify a groundfloor which make sense ( and your basic approach +
> proper schema configuration as suggested + properly configured request
> handler , seems a good start to me ) .
>
> At this point if you are still not happy with the relevancy (i.e. you are
> not happy with the different boosts you assigned ) my strongest suggestion
> at this time is to move to machine learning.
> You need a good amount of data to feed the learner and make it your Super
> Business Expert) .
> I have been recently working with the Learn To Rank Bloomberg Plugin [1] .
> In  my opinion will be key for all the business that have many features in
> the game, that can help to evaluate a proper ranking.
> For that you need to be able to collect and process signals, and you need
> to carefully tune the features of your interest.
> But the results could be surprising .
>
> [1] https://issues.apache.org/jira/browse/SOLR-8542
> [2] Learning to Rank in Solr 
>
> Cheers
>
> On Thu, Mar 17, 2016 at 10:15 AM, Robert Brown 
> wrote:
>
> > Thanks Scott and John,
> >
> > As luck would have it I've got a PhD graduate coming for an interview
> > today, who just happened to do her research thesis on information
> retrieval
> > with quantum theory and machine learning  :)
> >
> > John, it sounds like you're describing my system!  Shopping products from
> > multiple sources.  (De-duplication is going to be fun soon).
> >
> > I already copy fields like merchant, brand, category, to string fields to
> > use them as facets/filters.  I was contemplating removing the description
> > due to the spammy issue you mentioned, I didn't know about the
> > RemoveDuplicatesTokenFilterFactory, so I'm sure that's going to be a huge
> > help.
> >
> > Thanks a lot,
> > Rob
> >
> >
> >
> > On 03/17/2016 10:01 AM, John Smith wrote:
> >
> >> Hi,
> >>
> >> For once I might be of some help: I've had a similar configuration
> >> (large set of products from various sources). It's very difficult to
> >> find the right balance between all parameters and requires a lot of
> >> tweaking, most often in the dark unfortunately.
> >>
> >> What I've found is that omitNorms=true is a real breakthrough: without
> >> it results tend to favor small texts, which is not what's wanted for
> >> product names. I also added a RemoveDuplicatesTokenFilterFactory for the
> >> name as it's a common practice for spammers to repeat some key words in
> >> order to be better placed in results. Stemming and custom stop words
> >> (e.g. "cheap", "sale", ...) are other potential ideas.
> >>
> >> I've also ended up in removing the description field as it's often too
> >> broad, and name is now the only field left: brand, category and merchant
> >> (as well as other fields) are offered as additional filters using
> >> facets. Note that you'd have to re-index them as plain strings.
> >>
> >> It's more difficult to achieve but popularity boost can also be useful:
> >> you can measure it by sales or by number of clicks. I use a combination
> >> of both, and store those values using partial updates.
> >>
> >> Hope it helps,
> >> John
> >>
> >>
> >> On 17/03/16 09:36, Robert Brown wrote:
> >>
> >>> Hi,
> >>>
> >>> I currently have an index of ~50m docs representing shopping products:
> >>> name, description, brand, category, etc.
> >>>
> >>> Our "qf" is currently setup as:
> >>>
> >>> name^5
> >>> brand^2
> >>> category^3
> >>> merchant^2
> >>> description^1
> >>>
> >>> mm: 100%
> >>> ps: 5
> >>>
> >>> I'm getting complaints from the business concerning relevancy, and was
> >>> hoping to get some constructive ideas/thoughts on whether these boosts
> >>> look semi-sensible or not, I think they were put in place pretty much
> >>> at random.
> >>>
> >>> I know it's going to be a case of rounds upon rounds of testing, but
> >>> maybe there's a good starting point that will save me some time?
>

Re: How fast indexing?

2016-03-20 Thread Nick Vasilyev
There can be a lot of factors, can you provide a bit of additional
information to get started?

- How many items are you indexing per second?
- How does the indexing process look like?
- How large is each item?
- What hardware are you using?
- How is your Solr set up? JVM memory, collection layout, etc...
- What is your current commit frequency?
- What is the query volume while you are indexing?

On Sun, Mar 20, 2016 at 6:25 PM, fabigol  wrote:

> hi,
> i have a soir project where i do the indexing since a database postgre.
> the indexation is very long.
> How i can accelerate it.
> I can modify autocommit in the file solrconfig.xml?
> someone has some ideas. I looking on google but I found little
> help me please
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-fast-indexing-tp4264994.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


JSON Facet Stats Mincount

2016-04-14 Thread Nick Vasilyev
Hello, I am trying to get a list of items that have more than one
manufacturer using the following json facet query. This works fine without
mincount, but errors out as soon as I add it.

Is this possible or am I doing something wrong?

json.facet={
   groupID: {
  type: terms,
  field: groupID,
  facet:{ y: "unique(mfr)",
mincount: 2}
   }
}

Error:
"error": { "msg": "expected Map but got 2 ,path=facet/groupID", "code": 400
}

Thanks in advance


Re: block join rollups

2016-04-18 Thread Nick Vasilyev
Hi Yonik,

Well, no one replied to this yet, so I thought I'd chime in with some of
the use cases that I am working with. Please note that I am lagging a big
behind the last few releases, so I haven't had time to experiment with Solr
5.3+, I am sure that some of this is included in there already and I am
very excited to play around with the new streaming API, json facets and SQL
interface when I have a bit more time.

I am indexing click stream data into Solr. Each set of records represents a
user's unique visit to our website. They all share a common session id, as
well as several session attributes, such as IP and user attributes if they
log in. Each record represents an individual action, such as a search,
product view or a visit to a particular page, all attributes and data
elements of each request are stored with each record, additionally, session
attributes get copied down to each event item. The current goal of this
system is to provide less tech savvy users with easy access to this data in
a way they can explore it and drill down on particular elements; we are
using Banana for this.

Currently, I have to copy a lot of session fields to each event so I can
filter on them, for example, show all searches for users associated with
organization X. This is super redundant and I am really looking for a
better way. It would be great if I could make parent document fields appear
as if they are a part of child documents.

Additionally, I am counting various events for each session during
processing. For example, I count the number of searches, product views, add
to carts, etc... This information is also indexed in each record. This
allows me to pull up specific events (like product views) where the number
of searches in a given session is greater than X. However, again, indexing
this information for each event creates a lot of redundancy.

Finally, a slightly different use cases involves running functions on items
in a group (even if they aren't a part of the result set) and returning
that as a part of the document. Almost like a dynamically generated
document, based on aggregations from child documents. This is currently
somewhat available, but I can't include it in sort. For example, I am
grouping items on a field, I want to get the minimum value of a field per
group and sort the result (of groups) on that calculated value.

I am not sure if this helps you at all, but wanted to share some of my pain
points, hope it helps.

On Sun, Apr 17, 2016 at 6:50 PM, Yonik Seeley  wrote:

> Hey folks, we're at the point of figuring out the API for block join
> child rollups for the JSON Facet API.
> We already have simple block join faceting:
> http://yonik.com/solr-nested-objects/
> So now we need an API to carry over more information from children to
> parents (say rolling up average rating of all the reviews to the
> corresponding parent book objects).
>
> I've gathered some of my notes/thoughts on the API here:
> https://issues.apache.org/jira/browse/SOLR-8998
>
> Feedback welcome, and we can discuss here in this thread rather than
> cluttering the JIRA.
>
> -Yonik
>


Solr 5.2.1 on Java 8 GC

2016-04-28 Thread Nick Vasilyev
Hello,

We recently upgraded to Solr 5.2.1 with jre1.8.0_74 and are seeing long GC
pauses when running jobs that do some hairy faceting. The same jobs worked
fine with our previous 4.6 Solr.

The JVM is configured with 32GB heap with default GC settings, however I've
been tweaking the GC settings to no avail. The latest version had the
following differences from the default config:

XX:ConcGCThreads and XX:ParallelGCThreads are increased from 4 to 7

XX:CMSInitiatingOccupancyFraction increased from 50 to 70


Here is a sample output from the gc_log

2016-04-28T04:36:47.240-0400: 27905.535: Total time for which application
threads were stopped: 0.1667520 seconds, Stopping threads took: 0.0171900
seconds
{Heap before GC invocations=2051 (full 59):
 par new generation   total 6990528K, used 2626705K [0x2b16c000,
0x2b18c000, 0x2b18c000)
  eden space 5592448K,  44% used [0x2b16c000, 0x2b17571b9948,
0x2b181556)
  from space 1398080K,  10% used [0x2b181556, 0x2b181e8cac28,
0x2b186aab)
  to   space 1398080K,   0% used [0x2b186aab, 0x2b186aab,
0x2b18c000)
 concurrent mark-sweep generation total 25165824K, used 25122205K
[0x2b18c000, 0x2b1ec000, 0x2b1ec000)
 Metaspace   used 41840K, capacity 42284K, committed 42680K, reserved
43008K
2016-04-28T04:36:49.828-0400: 27908.123: [GC (Allocation Failure)
2016-04-28T04:36:49.828-0400: 27908.124: [CMS2016-04-28T04:36:49.912-0400:
27908.207: [CMS-concurr
ent-abortable-preclean: 5.615/5.862 secs] [Times: user=17.70 sys=2.77,
real=5.86 secs]
 (concurrent mode failure): 25122205K->15103706K(25165824K), 8.5567560
secs] 27748910K->15103706K(32156352K), [Metaspace: 41840K->41840K(43008K)],
8.5657830 secs] [
Times: user=8.56 sys=0.01, real=8.57 secs]
Heap after GC invocations=2052 (full 60):
 par new generation   total 6990528K, used 0K [0x2b16c000,
0x2b18c000, 0x2b18c000)
  eden space 5592448K,   0% used [0x2b16c000, 0x2b16c000,
0x2b181556)
  from space 1398080K,   0% used [0x2b181556, 0x2b181556,
0x2b186aab)
  to   space 1398080K,   0% used [0x2b186aab, 0x2b186aab,
0x2b18c000)
 concurrent mark-sweep generation total 25165824K, used 15103706K
[0x2b18c000, 0x2b1ec000, 0x2b1ec000)
 Metaspace   used 41840K, capacity 42284K, committed 42680K, reserved
43008K
}
2016-04-28T04:36:58.395-0400: 27916.690: Total time for which application
threads were stopped: 8.5676090 seconds, Stopping threads took: 0.0003930
seconds

I read the instructions here, https://wiki.apache.org/solr/ShawnHeisey, but
they seem to be specific to Java 7. Are there any updated recommendations
for Java 8?


Re: Solr 5.2.1 on Java 8 GC

2016-04-28 Thread Nick Vasilyev
Hi Yonik,

I forgot to mention that the index is approximately 50 million docs split
across 4 shards (replication factor 2) on 2 solr replicas.

This particular script will filter items based on a category (10-~1,000,000
items in each) and run facets on top X terms for particular fields. Query
looks like this:

{
   q => "cat:$code",
   rows => 0,
   facet => 'true',
   'facet.field' => [ 'key_phrases', 'mmfr_exact' ],
   'f.key_phrases.facet.limit' => 100,
   'f.mmfr_exact.facet.limit' => 20,
   'facet.mincount' => 5,
   distrib => 'false',
 }

I know it can be re-worked some, especially considering there are thousands
of similar requests going out. However we didn't have this issue before and
I am worried that it may be a symptom of a larger underlying problem.

On Thu, Apr 28, 2016 at 11:34 AM, Yonik Seeley  wrote:

> On Thu, Apr 28, 2016 at 11:29 AM, Nick Vasilyev
>  wrote:
> > Hello,
> >
> > We recently upgraded to Solr 5.2.1 with jre1.8.0_74 and are seeing long
> GC
> > pauses when running jobs that do some hairy faceting. The same jobs
> worked
> > fine with our previous 4.6 Solr.
>
> What does a typical request look like, and what are the field types
> that faceting is done on?
>
> -Yonik
>
>
> > The JVM is configured with 32GB heap with default GC settings, however
> I've
> > been tweaking the GC settings to no avail. The latest version had the
> > following differences from the default config:
> >
> > XX:ConcGCThreads and XX:ParallelGCThreads are increased from 4 to 7
> >
> > XX:CMSInitiatingOccupancyFraction increased from 50 to 70
> >
> >
> > Here is a sample output from the gc_log
> >
> > 2016-04-28T04:36:47.240-0400: 27905.535: Total time for which application
> > threads were stopped: 0.1667520 seconds, Stopping threads took: 0.0171900
> > seconds
> > {Heap before GC invocations=2051 (full 59):
> >  par new generation   total 6990528K, used 2626705K [0x2b16c000,
> > 0x2b18c000, 0x2b18c000)
> >   eden space 5592448K,  44% used [0x2b16c000, 0x2b17571b9948,
> > 0x2b181556)
> >   from space 1398080K,  10% used [0x2b181556, 0x2b181e8cac28,
> > 0x2b186aab)
> >   to   space 1398080K,   0% used [0x2b186aab, 0x2b186aab,
> > 0x2b18c000)
> >  concurrent mark-sweep generation total 25165824K, used 25122205K
> > [0x2b18c000, 0x2b1ec000, 0x2b1ec000)
> >  Metaspace   used 41840K, capacity 42284K, committed 42680K, reserved
> > 43008K
> > 2016-04-28T04:36:49.828-0400: 27908.123: [GC (Allocation Failure)
> > 2016-04-28T04:36:49.828-0400: 27908.124:
> [CMS2016-04-28T04:36:49.912-0400:
> > 27908.207: [CMS-concurr
> > ent-abortable-preclean: 5.615/5.862 secs] [Times: user=17.70 sys=2.77,
> > real=5.86 secs]
> >  (concurrent mode failure): 25122205K->15103706K(25165824K), 8.5567560
> > secs] 27748910K->15103706K(32156352K), [Metaspace:
> 41840K->41840K(43008K)],
> > 8.5657830 secs] [
> > Times: user=8.56 sys=0.01, real=8.57 secs]
> > Heap after GC invocations=2052 (full 60):
> >  par new generation   total 6990528K, used 0K [0x2b16c000,
> > 0x2b18c000, 0x2b18c000)
> >   eden space 5592448K,   0% used [0x2b16c000, 0x2b16c000,
> > 0x2b181556)
> >   from space 1398080K,   0% used [0x2b181556, 0x2b181556,
> > 0x2b186aab)
> >   to   space 1398080K,   0% used [0x2b186aab, 0x2b186aab,
> > 0x2b18c000)
> >  concurrent mark-sweep generation total 25165824K, used 15103706K
> > [0x2b18c000, 0x2b1ec000, 0x2b1ec000)
> >  Metaspace   used 41840K, capacity 42284K, committed 42680K, reserved
> > 43008K
> > }
> > 2016-04-28T04:36:58.395-0400: 27916.690: Total time for which application
> > threads were stopped: 8.5676090 seconds, Stopping threads took: 0.0003930
> > seconds
> >
> > I read the instructions here, https://wiki.apache.org/solr/ShawnHeisey,
> but
> > they seem to be specific to Java 7. Are there any updated recommendations
> > for Java 8?
>


Re: Solr 5.2.1 on Java 8 GC

2016-04-28 Thread Nick Vasilyev
mmfr_exact is a string field. key_phrases is a multivalued string field.

On Thu, Apr 28, 2016 at 11:47 AM, Yonik Seeley  wrote:

> What about the field types though... are they single valued or multi
> valued, string, text, numeric?
>
> -Yonik
>
>
> On Thu, Apr 28, 2016 at 11:43 AM, Nick Vasilyev
>  wrote:
> > Hi Yonik,
> >
> > I forgot to mention that the index is approximately 50 million docs split
> > across 4 shards (replication factor 2) on 2 solr replicas.
> >
> > This particular script will filter items based on a category
> (10-~1,000,000
> > items in each) and run facets on top X terms for particular fields. Query
> > looks like this:
> >
> > {
> >q => "cat:$code",
> >rows => 0,
> >facet => 'true',
> >'facet.field' => [ 'key_phrases', 'mmfr_exact' ],
> >'f.key_phrases.facet.limit' => 100,
> >'f.mmfr_exact.facet.limit' => 20,
> >'facet.mincount' => 5,
> >distrib => 'false',
> >  }
> >
> > I know it can be re-worked some, especially considering there are
> thousands
> > of similar requests going out. However we didn't have this issue before
> and
> > I am worried that it may be a symptom of a larger underlying problem.
> >
> > On Thu, Apr 28, 2016 at 11:34 AM, Yonik Seeley 
> wrote:
> >
> >> On Thu, Apr 28, 2016 at 11:29 AM, Nick Vasilyev
> >>  wrote:
> >> > Hello,
> >> >
> >> > We recently upgraded to Solr 5.2.1 with jre1.8.0_74 and are seeing
> long
> >> GC
> >> > pauses when running jobs that do some hairy faceting. The same jobs
> >> worked
> >> > fine with our previous 4.6 Solr.
> >>
> >> What does a typical request look like, and what are the field types
> >> that faceting is done on?
> >>
> >> -Yonik
> >>
> >>
> >> > The JVM is configured with 32GB heap with default GC settings, however
> >> I've
> >> > been tweaking the GC settings to no avail. The latest version had the
> >> > following differences from the default config:
> >> >
> >> > XX:ConcGCThreads and XX:ParallelGCThreads are increased from 4 to 7
> >> >
> >> > XX:CMSInitiatingOccupancyFraction increased from 50 to 70
> >> >
> >> >
> >> > Here is a sample output from the gc_log
> >> >
> >> > 2016-04-28T04:36:47.240-0400: 27905.535: Total time for which
> application
> >> > threads were stopped: 0.1667520 seconds, Stopping threads took:
> 0.0171900
> >> > seconds
> >> > {Heap before GC invocations=2051 (full 59):
> >> >  par new generation   total 6990528K, used 2626705K
> [0x2b16c000,
> >> > 0x2b18c000, 0x2b18c000)
> >> >   eden space 5592448K,  44% used [0x2b16c000,
> 0x2b17571b9948,
> >> > 0x2b181556)
> >> >   from space 1398080K,  10% used [0x2b181556,
> 0x2b181e8cac28,
> >> > 0x2b186aab)
> >> >   to   space 1398080K,   0% used [0x2b186aab,
> 0x2b186aab,
> >> > 0x2b18c000)
> >> >  concurrent mark-sweep generation total 25165824K, used 25122205K
> >> > [0x2b18c000, 0x2b1ec000, 0x2b1ec000)
> >> >  Metaspace   used 41840K, capacity 42284K, committed 42680K,
> reserved
> >> > 43008K
> >> > 2016-04-28T04:36:49.828-0400: 27908.123: [GC (Allocation Failure)
> >> > 2016-04-28T04:36:49.828-0400: 27908.124:
> >> [CMS2016-04-28T04:36:49.912-0400:
> >> > 27908.207: [CMS-concurr
> >> > ent-abortable-preclean: 5.615/5.862 secs] [Times: user=17.70 sys=2.77,
> >> > real=5.86 secs]
> >> >  (concurrent mode failure): 25122205K->15103706K(25165824K), 8.5567560
> >> > secs] 27748910K->15103706K(32156352K), [Metaspace:
> >> 41840K->41840K(43008K)],
> >> > 8.5657830 secs] [
> >> > Times: user=8.56 sys=0.01, real=8.57 secs]
> >> > Heap after GC invocations=2052 (full 60):
> >> >  par new generation   total 6990528K, used 0K [0x2b16c000,
> >> > 0x2b18c000, 0x2b18c000)
> >> >   eden space 5592448K,   0% used [0x2b16c000,
> 0x2b16c000,
> >> > 0x2b181556)
> >> >   from space 1398080K,   0% used [0x2b181556,
> 0x2b181556,
> >> > 0x2b186aab)
> >> >   to   space 1398080K,   0% used [0x2b186aab,
> 0x2b186aab,
> >> > 0x2b18c000)
> >> >  concurrent mark-sweep generation total 25165824K, used 15103706K
> >> > [0x2b18c000, 0x2b1ec000, 0x2b1ec000)
> >> >  Metaspace   used 41840K, capacity 42284K, committed 42680K,
> reserved
> >> > 43008K
> >> > }
> >> > 2016-04-28T04:36:58.395-0400: 27916.690: Total time for which
> application
> >> > threads were stopped: 8.5676090 seconds, Stopping threads took:
> 0.0003930
> >> > seconds
> >> >
> >> > I read the instructions here,
> https://wiki.apache.org/solr/ShawnHeisey,
> >> but
> >> > they seem to be specific to Java 7. Are there any updated
> recommendations
> >> > for Java 8?
> >>
>


Re: Solr 5.2.1 on Java 8 GC

2016-04-28 Thread Nick Vasilyev
The working set is larger than the heap. This is our largest collection and
all shards combined would probably be around 60GB in total, there are also
a few other much smaller collections.

During normal operations the JVM memory utilization hangs between 17GB and
22GB if we aren't indexing any data.

Either way, this wasn't a problem before. I suspect that it is because we
are now on Java 8 so I wanted to reach out to the community to see if there
are any new best practices around GC tuning since the current
recommendation seems to be for Java 7.


On Thu, Apr 28, 2016 at 11:54 AM, Walter Underwood 
wrote:

> 32 GB is a pretty big heap. If the working set is really smaller than
> that, the extra heap just makes a full GC take longer.
>
> How much heap is used after a full GC? Take the largest value you see
> there, then add a bit more, maybe 25% more or 2 GB more.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Apr 28, 2016, at 8:50 AM, Nick Vasilyev 
> wrote:
> >
> > mmfr_exact is a string field. key_phrases is a multivalued string field.
> >
> > On Thu, Apr 28, 2016 at 11:47 AM, Yonik Seeley 
> wrote:
> >
> >> What about the field types though... are they single valued or multi
> >> valued, string, text, numeric?
> >>
> >> -Yonik
> >>
> >>
> >> On Thu, Apr 28, 2016 at 11:43 AM, Nick Vasilyev
> >>  wrote:
> >>> Hi Yonik,
> >>>
> >>> I forgot to mention that the index is approximately 50 million docs
> split
> >>> across 4 shards (replication factor 2) on 2 solr replicas.
> >>>
> >>> This particular script will filter items based on a category
> >> (10-~1,000,000
> >>> items in each) and run facets on top X terms for particular fields.
> Query
> >>> looks like this:
> >>>
> >>> {
> >>>   q => "cat:$code",
> >>>   rows => 0,
> >>>   facet => 'true',
> >>>   'facet.field' => [ 'key_phrases', 'mmfr_exact' ],
> >>>   'f.key_phrases.facet.limit' => 100,
> >>>   'f.mmfr_exact.facet.limit' => 20,
> >>>   'facet.mincount' => 5,
> >>>   distrib => 'false',
> >>> }
> >>>
> >>> I know it can be re-worked some, especially considering there are
> >> thousands
> >>> of similar requests going out. However we didn't have this issue before
> >> and
> >>> I am worried that it may be a symptom of a larger underlying problem.
> >>>
> >>> On Thu, Apr 28, 2016 at 11:34 AM, Yonik Seeley 
> >> wrote:
> >>>
> >>>> On Thu, Apr 28, 2016 at 11:29 AM, Nick Vasilyev
> >>>>  wrote:
> >>>>> Hello,
> >>>>>
> >>>>> We recently upgraded to Solr 5.2.1 with jre1.8.0_74 and are seeing
> >> long
> >>>> GC
> >>>>> pauses when running jobs that do some hairy faceting. The same jobs
> >>>> worked
> >>>>> fine with our previous 4.6 Solr.
> >>>>
> >>>> What does a typical request look like, and what are the field types
> >>>> that faceting is done on?
> >>>>
> >>>> -Yonik
> >>>>
> >>>>
> >>>>> The JVM is configured with 32GB heap with default GC settings,
> however
> >>>> I've
> >>>>> been tweaking the GC settings to no avail. The latest version had the
> >>>>> following differences from the default config:
> >>>>>
> >>>>> XX:ConcGCThreads and XX:ParallelGCThreads are increased from 4 to 7
> >>>>>
> >>>>> XX:CMSInitiatingOccupancyFraction increased from 50 to 70
> >>>>>
> >>>>>
> >>>>> Here is a sample output from the gc_log
> >>>>>
> >>>>> 2016-04-28T04:36:47.240-0400: 27905.535: Total time for which
> >> application
> >>>>> threads were stopped: 0.1667520 seconds, Stopping threads took:
> >> 0.0171900
> >>>>> seconds
> >>>>> {Heap before GC invocations=2051 (full 59):
> >>>>> par new generation   total 6990528K, used 2626705K
> >> [0x2b16c000,
> >>>>> 0x2b18c000, 0x2b18c000)
> >>>>>  eden space 5592448K,  

Re: Solr 5.2.1 on Java 8 GC

2016-04-28 Thread Nick Vasilyev
Correction, the key_phrases is set up as follows:




   

  
  
  
  
  
  

  

On Thu, Apr 28, 2016 at 12:03 PM, Nick Vasilyev 
wrote:

> The working set is larger than the heap. This is our largest collection
> and all shards combined would probably be around 60GB in total, there are
> also a few other much smaller collections.
>
> During normal operations the JVM memory utilization hangs between 17GB and
> 22GB if we aren't indexing any data.
>
> Either way, this wasn't a problem before. I suspect that it is because we
> are now on Java 8 so I wanted to reach out to the community to see if there
> are any new best practices around GC tuning since the current
> recommendation seems to be for Java 7.
>
>
> On Thu, Apr 28, 2016 at 11:54 AM, Walter Underwood 
> wrote:
>
>> 32 GB is a pretty big heap. If the working set is really smaller than
>> that, the extra heap just makes a full GC take longer.
>>
>> How much heap is used after a full GC? Take the largest value you see
>> there, then add a bit more, maybe 25% more or 2 GB more.
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>>
>> > On Apr 28, 2016, at 8:50 AM, Nick Vasilyev 
>> wrote:
>> >
>> > mmfr_exact is a string field. key_phrases is a multivalued string field.
>> >
>> > On Thu, Apr 28, 2016 at 11:47 AM, Yonik Seeley 
>> wrote:
>> >
>> >> What about the field types though... are they single valued or multi
>> >> valued, string, text, numeric?
>> >>
>> >> -Yonik
>> >>
>> >>
>> >> On Thu, Apr 28, 2016 at 11:43 AM, Nick Vasilyev
>> >>  wrote:
>> >>> Hi Yonik,
>> >>>
>> >>> I forgot to mention that the index is approximately 50 million docs
>> split
>> >>> across 4 shards (replication factor 2) on 2 solr replicas.
>> >>>
>> >>> This particular script will filter items based on a category
>> >> (10-~1,000,000
>> >>> items in each) and run facets on top X terms for particular fields.
>> Query
>> >>> looks like this:
>> >>>
>> >>> {
>> >>>   q => "cat:$code",
>> >>>   rows => 0,
>> >>>   facet => 'true',
>> >>>   'facet.field' => [ 'key_phrases', 'mmfr_exact' ],
>> >>>   'f.key_phrases.facet.limit' => 100,
>> >>>   'f.mmfr_exact.facet.limit' => 20,
>> >>>   'facet.mincount' => 5,
>> >>>   distrib => 'false',
>> >>> }
>> >>>
>> >>> I know it can be re-worked some, especially considering there are
>> >> thousands
>> >>> of similar requests going out. However we didn't have this issue
>> before
>> >> and
>> >>> I am worried that it may be a symptom of a larger underlying problem.
>> >>>
>> >>> On Thu, Apr 28, 2016 at 11:34 AM, Yonik Seeley 
>> >> wrote:
>> >>>
>> >>>> On Thu, Apr 28, 2016 at 11:29 AM, Nick Vasilyev
>> >>>>  wrote:
>> >>>>> Hello,
>> >>>>>
>> >>>>> We recently upgraded to Solr 5.2.1 with jre1.8.0_74 and are seeing
>> >> long
>> >>>> GC
>> >>>>> pauses when running jobs that do some hairy faceting. The same jobs
>> >>>> worked
>> >>>>> fine with our previous 4.6 Solr.
>> >>>>
>> >>>> What does a typical request look like, and what are the field types
>> >>>> that faceting is done on?
>> >>>>
>> >>>> -Yonik
>> >>>>
>> >>>>
>> >>>>> The JVM is configured with 32GB heap with default GC settings,
>> however
>> >>>> I've
>> >>>>> been tweaking the GC settings to no avail. The latest version had
>> the
>> >>>>> following differences from the default config:
>> >>>>>
>> >>>>> XX:ConcGCThreads and XX:ParallelGCThreads are increased from 4 to 7
>> >>>>>
>> >>>>> XX:CMSInitiatingOccupancyFraction increased from 50 to 70
>> >>>>>
>> >>>>>
>> >>>>> Here is 

Re: Solr 5.2.1 on Java 8 GC

2016-04-28 Thread Nick Vasilyev
Hi Yonik,

There are a lot of logistics involved with re-indexing and naturally
upgrading Solr. I was hoping that there is an easier alternative since this
is only a single back end script that is having problems.

Is there any room for improvement with tweaking GC params?

On Thu, Apr 28, 2016 at 12:06 PM, Yonik Seeley  wrote:

> On Thu, Apr 28, 2016 at 11:50 AM, Nick Vasilyev
>  wrote:
> > mmfr_exact is a string field. key_phrases is a multivalued string field.
>
> One guess is that top-level field caches (and UnInvertedField use)
> were removed in
> https://issues.apache.org/jira/browse/LUCENE-5666
>
> While this is better for NRT (a quickly changing index), it's worse in
> CPU, and can be worse in memory overhead for very static indexes.
>
> Multi-valued string faceting was hit hardest:
> https://issues.apache.org/jira/browse/SOLR-8096
> Although I only measured the CPU impact, and not memory.
>
> The 4.x method of faceting was restored as part of
> https://issues.apache.org/jira/browse/SOLR-8466
>
> If this is the issue, you can:
> - try reindexing with docValues... that should solve memory issues at
> the expense of some speed
> - upgrade to a more recent Solr version and use facet.method=uif for
> your multi-valued string fields
>
> -Yonik
>


Re: Solr 5.2.1 on Java 8 GC

2016-04-29 Thread Nick Vasilyev
topping threads took: 0.0001340
seconds
2016-04-26T04:42:36.033-0400: 245437.715: Total time for which application
threads were stopped: 9.9446430 seconds, Stopping threads took: 0.0007500
seconds
2016-04-26T04:43:02.409-0400: 245464.091: Total time for which application
threads were stopped: 10.4197000 seconds, Stopping threads took: 0.260
seconds
2016-04-26T04:43:29.559-0400: 245491.241: Total time for which application
threads were stopped: 9.6712880 seconds, Stopping threads took: 0.0001080
seconds
2016-04-26T04:43:56.648-0400: 245518.330: Total time for which application
threads were stopped: 9.8339590 seconds, Stopping threads took: 0.0011820
seconds
2016-04-26T04:45:35.358-0400: 245617.040: Total time for which application
threads were stopped: 9.5853210 seconds, Stopping threads took: 0.0001760
seconds
2016-04-26T04:54:58.764-0400: 246180.446: Total time for which application
threads were stopped: 2.9048350 seconds, Stopping threads took: 0.0008180
seconds
2016-04-26T04:55:06.107-0400: 246187.789: Total time for which application
threads were stopped: 1.1189760 seconds, Stopping threads took: 0.0011390
seconds

After:
2016-04-29T04:30:05.758-0400: 29962.077: Total time for which application
threads were stopped: 1.0823960 seconds, Stopping threads took: 0.0005840
seconds
2016-04-29T04:30:11.349-0400: 29967.668: Total time for which application
threads were stopped: 1.4147830 seconds, Stopping threads took: 0.0008980
seconds
2016-04-29T04:30:17.198-0400: 29973.517: Total time for which application
threads were stopped: 1.6294590 seconds, Stopping threads took: 0.0009380
seconds
2016-04-29T04:30:22.350-0400: 29978.669: Total time for which application
threads were stopped: 1.6787880 seconds, Stopping threads took: 0.0012320
seconds
2016-04-29T04:30:28.230-0400: 29984.549: Total time for which application
threads were stopped: 1.6895760 seconds, Stopping threads took: 0.0010270
seconds
2016-04-29T04:30:29.944-0400: 29986.263: Total time for which application
threads were stopped: 1.5271500 seconds, Stopping threads took: 0.0009670
seconds
2016-04-29T04:30:35.282-0400: 29991.601: Total time for which application
threads were stopped: 1.6575670 seconds, Stopping threads took: 0.0006200
seconds
2016-04-29T04:30:51.011-0400: 30007.329: Total time for which application
threads were stopped: 2.0383550 seconds, Stopping threads took: 0.0004640
seconds
2016-04-29T04:31:03.032-0400: 30019.351: Total time for which application
threads were stopped: 2.1963570 seconds, Stopping threads took: 0.0004650
seconds
2016-04-29T04:31:07.679-0400: 30023.998: Total time for which application
threads were stopped: 1.2220760 seconds, Stopping threads took: 0.0004720
seconds

On Thu, Apr 28, 2016 at 1:02 PM, Jeff Wartes  wrote:

>
> Shawn Heisey’s page is the usual reference guide for GC settings:
> https://wiki.apache.org/solr/ShawnHeisey
> Most of the learnings from that are in the Solr 5.x startup scripts
> already, but your heap is bigger, so your mileage may vary.
>
> Some tools I’ve used while doing GC tuning:
>
> * VisualVM - Comes with the jdk. It has a Visual GC plug-in that’s pretty
> nice for visualizing what’s going on in realtime, but you need to connect
> it via jstatd for that to work.
> * GCViewer - Visualizes a GC log. The UI leaves a lot to be desired, but
> it’s the best tool I’ve found for this purpose. Use this fork for jdk 6+ -
> https://github.com/chewiebug/GCViewer
> * Swiss Java Knife has a bunch of useful features -
> https://github.com/aragozin/jvm-tools
> * YourKit - I’ve been using this lately to analyze where garbage comes
> from. It’s not free though.
> * Eclipse Memory Analyzer - I used this to analyze heap dumps before I got
> a YourKit license: http://www.eclipse.org/mat/
>
> Good luck!
>
>
>
>
>
>
> On 4/28/16, 9:27 AM, "Yonik Seeley"  wrote:
>
> >On Thu, Apr 28, 2016 at 12:21 PM, Nick Vasilyev
> > wrote:
> >> Hi Yonik,
> >>
> >> There are a lot of logistics involved with re-indexing and naturally
> >> upgrading Solr. I was hoping that there is an easier alternative since
> this
> >> is only a single back end script that is having problems.
> >>
> >> Is there any room for improvement with tweaking GC params?
> >
> >There always is ;-)  But I'm not a GC tuning expert.  I prefer to
> >attack memory problems more head-on (i.e. with code to use less
> >memory).
> >
> >-Yonik
>


Re: Solr5.5:DocValues/CopyField does not work with Atomic updates

2016-04-30 Thread Nick Vasilyev
I am also running into this problem on Solr 6.

On Sun, Apr 24, 2016 at 6:10 PM, Karthik Ramachandran <
kramachand...@commvault.com> wrote:

> I have opened JIRA
>
> https://issues.apache.org/jira/browse/SOLR-9034
>
> I will upload the patch soon.
>
> With Thanks & Regards
> Karthik Ramachandran
> CommVault
> Direct: (732) 923-2197
>  Please don't print this e-mail unless you really need to
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Friday, April 22, 2016 8:24 PM
> To: solr-user 
> Subject: Re: Solr5.5:DocValues/CopyField does not work with Atomic updates
>
> I think I just added the right person, let us know if you don't have
> access and/or if you need access to the LUCENE JIRA.
>
> Erick
>
> On Fri, Apr 22, 2016 at 5:17 PM, Karthik Ramachandran <
> kramachand...@commvault.com> wrote:
> > Eric
> >   I have created a JIRA id (kramachand...@commvault.com).  Once I get
> > access I will create the JIRA and submit the patch.
> >
> > With Thanks & Regards
> > Karthik Ramachandran
> > CommVault
> > Direct: (732) 923-2197
> > P Please don't print this e-mail unless you really need to
> >
> >
> >
> > On 4/22/16, 8:04 PM, "Erick Erickson"  wrote:
> >
> >>Karthik:
> >>
> >>The Apache mailing list is pretty aggressive about removing
> >>attachments. Could you possibly open a JIRA and attach the file as a
> >>patch? If at all possible a patch file with just the diffs would be
> >>best.
> >>
> >>One problem is that it'll be a two-step process. The JIRAs have been
> >>being hit with spam, so you'll have to request access once you create
> >>a JIRA ID (this list would be fine).
> >>
> >>Best,
> >>Erick
> >>
> >>On Thu, Apr 21, 2016 at 9:09 PM, Karthik Ramachandran
> >> wrote:
> >>> We feel the issue is in
> >>>RealTimeGetComponent.getInputDocument(SolrCore
> >>>core,
> >>> BytesRef idBytes) where solr calls getNonStoredDVs and add the
> >>>fields to the  original document without excluding the copyFields.
> >>>
> >>>
> >>>
> >>> We made changes to send the filteredList to
> >>>searcher.decorateDocValueFields
> >>> and it started working.
> >>>
> >>>
> >>>
> >>> Attached is the modified file.
> >>>
> >>>
> >>>
> >>> With Thanks & Regards
> >>> Karthik Ramachandran
> >>> CommVault
> >>> P Please don't print this e-mail unless you really need to
> >>>
> >>>
> >>>
> >>> -Original Message-
> >>> From: Karthik Ramachandran [mailto:mrk...@gmail.com]
> >>> Sent: Friday, April 22, 2016 12:08 AM
> >>> To: solr-user@lucene.apache.org
> >>> Subject: Re: Solr5.5:DocValues/CopyField does not work with Atomic
> >>>updates
> >>>
> >>>
> >>>
> >>> We are trying to update Field A.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> -Karthik
> >>>
> >>>
> >>>
> >>> On Thu, Apr 21, 2016 at 10:36 PM, John Bickerstaff
> >>> >>>
>  wrote:
> >>>
> >>>
> >>>
>  Which field do you try to atomically update?  A or B or some other?
> >>>
>  On Apr 21, 2016 8:29 PM, "Tirthankar Chatterjee" <
> >>>
>  tchatter...@commvault.com>
> >>>
>  wrote:
> >>>
> 
> >>>
>  > Hi,
> >>>
>  > Here is the scenario for SOLR5.5:
> >>>
>  >
> >>>
>  > FieldA type= stored=true indexed=true
> >>>
>  >
> >>>
>  > FieldB type= stored=false indexed=true docValue=true
> >>>
>  > usedocvalueasstored=false
> >>>
>  >
> >>>
>  > FieldA copyTo FieldB
> >>>
>  >
> >>>
>  > Try an Atomic update and we are getting this error:
> >>>
>  >
> >>>
>  > possible analysis error: DocValuesField "mtmround" appears more
>  > than
> >>>
>  > once in this document (only one value is allowed per field)
> >>>
>  >
> >>>
>  > How do we resolve this.
> >>>
>  >
> >>>
>  >
> >>>
>  >
> >>>
>  > ***Legal
> >>>
>  > Disclaimer***
> >>>
>  > "This communication may contain confidential and privileged
>  > material
> >>>
>  > for the sole use of the intended recipient. Any unauthorized
>  > review,
> >>>
>  > use or distribution by others is strictly prohibited. If you have
> >>>
>  > received the message by mistake, please advise the sender by
>  > reply
> >>>
>  > email and delete the message. Thank
> >>>
>  you."
> >>>
>  > *
>  > ***
> >>>
>  > **
> >>>
> 
> >>>
> >>> ***Legal
> >>>Disclaimer***
> >>> "This communication may contain confidential and privileged material
> >>>for the  sole use of the intended recipient. Any unauthorized review,
> >>>use or  distribution  by others is strictly prohibited. If you have
> >>>received the message by  mistake,  please advise the sender by reply
> >>>email and delete the message. Thank you."
> >>>
> >>>*
> >>>*
> >>
> >
> >
> >
> >
> > ***Legal Disclaimer***
> > "This communicatio

Re: Solr 5.2.1 on Java 8 GC

2016-05-01 Thread Nick Vasilyev
How do you log GC frequency and time to compare it with other GC
configurations?

Also, do you tweak parameters automatically or is there a set of
configuration that get tested?

Lastly, I was under impression that G1 is not recommended to be used based
on some issues with Lucene, so I haven't tried it. Are you guys seeing any
significant performance benefits with it and Java 8? Any issues?
On May 1, 2016 12:57 PM, "Bram Van Dam"  wrote:

> On 30/04/16 17:34, Davis, Daniel (NIH/NLM) [C] wrote:
> > Bram, on the subject of brute force - if your script is "clever" and
> uses binary first search, I'd love to adapt it to my environment.  I am
> trying to build a truly multi-tenant Solr because each of our indexes is
> tiny, but all together they will eventually be big, and so I'll have to
> repeat this experiment, many, many times.
>
> Sorry to disappoint, the script is very dumb, and it doesn't just
> start/stop Solr, it installs our application suite, picks a GC profile
> at random, indexes a boatload of data and then runs a bunch of query tests.
>
> Three pointers I can give you:
>
> 1) beware of JVM versions, especially when using the G1 collector, it
> behaves horribly on older JVMs but rather nicely on newer versions.
>
> 2) At the very least you'll want to test the G1 and CMS collectors.
>
> 3) One large index vs many small indexes: the behaviour is very
> different. Depending on how many indexes you have, it might be worth to
> run each one in a different JVM. Of course that's not practical if you
> have thousands of indexes.
>
>  - Bram
>
>


Re: Solr cloud 6.0.0 with ZooKeeper 3.4.8 Errors

2016-05-04 Thread Nick Vasilyev
It looks like you have too many open files, try increasing the file
descriptor limit.

On Wed, May 4, 2016 at 3:48 PM, Susheel Kumar  wrote:

> Hello,
>
> I am trying to setup 2 node Solr cloud 6 cluster with ZK 3.4.8 and used the
> install service to setup solr.
>
> After launching Solr Admin Panel on server1, it looses connections in few
> seconds and then comes back and other node server2 is marked as Down in
> cloud graph. After few seconds its loosing the connection and comes back.
>
> Any idea what may be going wrong? Has anyone used Solr 6 with ZK 3.4.8.
> Have never seen this error before with solr 5.x with ZK 3.4.6.
>
> Below log from server1 & server2.  The ZK has 3 nodes with chroot enabled.
>
> Thanks,
> Susheel
>
> server1/solr.log
>
> 
>
>
> 2016-05-04 19:20:53.804 INFO  (qtp1989972246-14) [   ]
> o.a.s.c.c.ZkStateReader path=[/collections/collection1]
> [configName]=[collection1] specified config exists in ZooKeeper
>
> 2016-05-04 19:20:53.806 INFO  (qtp1989972246-14) [   ] o.a.s.s.HttpSolrCall
> [admin] webapp=null path=/admin/collections
> params={action=CLUSTERSTATUS&wt=json&_=1462389588125} status=0 QTime=25
>
> 2016-05-04 19:20:53.859 INFO  (qtp1989972246-19) [   ]
> o.a.s.h.a.CollectionsHandler Invoked Collection Action :list with params
> action=LIST&wt=json&_=1462389588125 and sendToOCPQueue=true
>
> 2016-05-04 19:20:53.861 INFO  (qtp1989972246-19) [   ] o.a.s.s.HttpSolrCall
> [admin] webapp=null path=/admin/collections
> params={action=LIST&wt=json&_=1462389588125} status=0 QTime=2
>
> 2016-05-04 19:20:57.520 INFO  (qtp1989972246-13) [   ] o.a.s.s.HttpSolrCall
> [admin] webapp=null path=/admin/cores
> params={indexInfo=false&wt=json&_=1462389588124} status=0 QTime=0
>
> 2016-05-04 19:20:57.546 INFO  (qtp1989972246-15) [   ] o.a.s.s.HttpSolrCall
> [admin] webapp=null path=/admin/info/system
> params={wt=json&_=1462389588126} status=0 QTime=25
>
> 2016-05-04 19:20:57.610 INFO  (qtp1989972246-13) [   ]
> o.a.s.h.a.CollectionsHandler Invoked Collection Action :list with params
> action=LIST&wt=json&_=1462389588125 and sendToOCPQueue=true
>
> 2016-05-04 19:20:57.613 INFO  (qtp1989972246-13) [   ] o.a.s.s.HttpSolrCall
> [admin] webapp=null path=/admin/collections
> params={action=LIST&wt=json&_=1462389588125} status=0 QTime=3
>
> 2016-05-04 19:21:29.139 INFO  (qtp1989972246-5980) [   ]
> o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException) caught
> when connecting to {}->http://server2:8983: Too many open files
>
> 2016-05-04 19:21:29.139 INFO  (qtp1989972246-5983) [   ]
> o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException) caught
> when connecting to {}->http://server2:8983: Too many open files
>
> 2016-05-04 19:21:29.139 INFO  (qtp1989972246-5984) [   ]
> o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException) caught
> when connecting to {}->http://server2:8983: Too many open files
>
> 2016-05-04 19:21:29.141 INFO  (qtp1989972246-5984) [   ]
> o.a.h.i.c.DefaultHttpClient Retrying connect to {}->http://server2:8983
>
> 2016-05-04 19:21:29.141 INFO  (qtp1989972246-5984) [   ]
> o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException) caught
> when connecting to {}->http://server2:8983: Too many open files
>
> 2016-05-04 19:21:29.142 INFO  (qtp1989972246-5984) [   ]
> o.a.h.i.c.DefaultHttpClient Retrying connect to {}->http://server2:8983
>
> 2016-05-04 19:21:29.142 INFO  (qtp1989972246-5984) [   ]
> o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException) caught
> when connecting to {}->http://server2:8983: Too many open files
>
> 2016-05-04 19:21:29.142 INFO  (qtp1989972246-5984) [   ]
> o.a.h.i.c.DefaultHttpClient Retrying connect to {}->http://server2:8983
>
> 2016-05-04 19:21:29.140 INFO  (qtp1989972246-5983) [   ]
> o.a.h.i.c.DefaultHttpClient Retrying connect to {}->http://server2:8983
>
> 2016-05-04 19:21:29.140 INFO  (qtp1989972246-5980) [   ]
> o.a.h.i.c.DefaultHttpClient Retrying connect to {}->http://server2:8983
>
> 2016-05-04 19:21:29.143 INFO  (qtp1989972246-5983) [   ]
> o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException) caught
> when connecting to {}->http://server2:8983: Too many open files
>
> 2016-05-04 19:21:29.144 INFO  (qtp1989972246-5983) [   ]
> o.a.h.i.c.DefaultHttpClient Retrying connect to {}->http://server2:8983
>
> 2016-05-04 19:21:29.144 INFO  (qtp1989972246-5980) [   ]
> o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException) caught
> when connecting to {}->http://server2:8983: Too many open files
>
> 2016-05-04 19:21:29.144 INFO  (qtp1989972246-5983) [   ]
> o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException) caught
> when connecting to {}->http://server2:8983: Too many open files
>
> 2016-05-04 19:20:53.806 INFO  (qtp1989972246-14) [   ] o.a.s.s.HttpSolrCall
> [admin] webapp=null path=/admin/collections
> params={action=CLUSTERSTATUS&wt=json&_=1462389588125} status=0 QTime=25
>
> 2016-05-04 19:20:53.859 INFO  (qtp1989972246-19) [   ]
> 

Re: Solr cloud 6.0.0 with ZooKeeper 3.4.8 Errors

2016-05-04 Thread Nick Vasilyev
Not sure about your environment so it's hard to say why you haven't ran
into this issue before.

As for the suggested limit, I am not sure, it would depend on your system
and if you really want to limit it. I personally just jack it up to 5.

On Wed, May 4, 2016 at 6:13 PM, Susheel Kumar  wrote:

> Thanks, Nick. Do we know any suggested # for file descriptor limit with
> Solr6?  Also wondering why i haven't seen this problem before with Solr
> 5.x?
>
> On Wed, May 4, 2016 at 4:54 PM, Nick Vasilyev 
> wrote:
>
> > It looks like you have too many open files, try increasing the file
> > descriptor limit.
> >
> > On Wed, May 4, 2016 at 3:48 PM, Susheel Kumar 
> > wrote:
> >
> > > Hello,
> > >
> > > I am trying to setup 2 node Solr cloud 6 cluster with ZK 3.4.8 and used
> > the
> > > install service to setup solr.
> > >
> > > After launching Solr Admin Panel on server1, it looses connections in
> few
> > > seconds and then comes back and other node server2 is marked as Down in
> > > cloud graph. After few seconds its loosing the connection and comes
> back.
> > >
> > > Any idea what may be going wrong? Has anyone used Solr 6 with ZK 3.4.8.
> > > Have never seen this error before with solr 5.x with ZK 3.4.6.
> > >
> > > Below log from server1 & server2.  The ZK has 3 nodes with chroot
> > enabled.
> > >
> > > Thanks,
> > > Susheel
> > >
> > > server1/solr.log
> > >
> > > 
> > >
> > >
> > > 2016-05-04 19:20:53.804 INFO  (qtp1989972246-14) [   ]
> > > o.a.s.c.c.ZkStateReader path=[/collections/collection1]
> > > [configName]=[collection1] specified config exists in ZooKeeper
> > >
> > > 2016-05-04 19:20:53.806 INFO  (qtp1989972246-14) [   ]
> > o.a.s.s.HttpSolrCall
> > > [admin] webapp=null path=/admin/collections
> > > params={action=CLUSTERSTATUS&wt=json&_=1462389588125} status=0 QTime=25
> > >
> > > 2016-05-04 19:20:53.859 INFO  (qtp1989972246-19) [   ]
> > > o.a.s.h.a.CollectionsHandler Invoked Collection Action :list with
> params
> > > action=LIST&wt=json&_=1462389588125 and sendToOCPQueue=true
> > >
> > > 2016-05-04 19:20:53.861 INFO  (qtp1989972246-19) [   ]
> > o.a.s.s.HttpSolrCall
> > > [admin] webapp=null path=/admin/collections
> > > params={action=LIST&wt=json&_=1462389588125} status=0 QTime=2
> > >
> > > 2016-05-04 19:20:57.520 INFO  (qtp1989972246-13) [   ]
> > o.a.s.s.HttpSolrCall
> > > [admin] webapp=null path=/admin/cores
> > > params={indexInfo=false&wt=json&_=1462389588124} status=0 QTime=0
> > >
> > > 2016-05-04 19:20:57.546 INFO  (qtp1989972246-15) [   ]
> > o.a.s.s.HttpSolrCall
> > > [admin] webapp=null path=/admin/info/system
> > > params={wt=json&_=1462389588126} status=0 QTime=25
> > >
> > > 2016-05-04 19:20:57.610 INFO  (qtp1989972246-13) [   ]
> > > o.a.s.h.a.CollectionsHandler Invoked Collection Action :list with
> params
> > > action=LIST&wt=json&_=1462389588125 and sendToOCPQueue=true
> > >
> > > 2016-05-04 19:20:57.613 INFO  (qtp1989972246-13) [   ]
> > o.a.s.s.HttpSolrCall
> > > [admin] webapp=null path=/admin/collections
> > > params={action=LIST&wt=json&_=1462389588125} status=0 QTime=3
> > >
> > > 2016-05-04 19:21:29.139 INFO  (qtp1989972246-5980) [   ]
> > > o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException)
> > caught
> > > when connecting to {}->http://server2:8983: Too many open files
> > >
> > > 2016-05-04 19:21:29.139 INFO  (qtp1989972246-5983) [   ]
> > > o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException)
> > caught
> > > when connecting to {}->http://server2:8983: Too many open files
> > >
> > > 2016-05-04 19:21:29.139 INFO  (qtp1989972246-5984) [   ]
> > > o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException)
> > caught
> > > when connecting to {}->http://server2:8983: Too many open files
> > >
> > > 2016-05-04 19:21:29.141 INFO  (qtp1989972246-5984) [   ]
> > > o.a.h.i.c.DefaultHttpClient Retrying connect to {}->
> http://server2:8983
> > >
> > > 2016-05-04 19:21:29.141 INFO  (qtp1989972246-5984) [   ]
> > > o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException)
> > caught
> > > when connecting to {}->http://server2:8983: Too many open

Re: Solr cloud 6.0.0 with ZooKeeper 3.4.8 Errors

2016-05-05 Thread Nick Vasilyev
Just out of curiosity, are you using sharing the zookeepers between the
different versions of Solr? If So, are you specifying a zookeeper chroot?
On May 5, 2016 2:05 PM, "Susheel Kumar"  wrote:

> Nick, Hoss -  Things are back to normal with ZK 3.4.8 and ZK-6.0.0.  I
> switched to Solr 5.5.0 with ZK 3.4.8 which worked fine and then installed
> 6.0.0.  I suspect (not 100% sure) i left ZK dataDir / Solr collection
> directory data from previous ZK/solr version which probably was making Solr
> 6 in unstable state.
>
> Thanks,
> Susheel
>
> On Wed, May 4, 2016 at 9:56 PM, Susheel Kumar 
> wrote:
>
> > Thanks, Nick & Hoss.  I am using the exact same machine, have wiped out
> > solr 5.5.0 and installed solr-6.0.0 with external ZK 3.4.8.  I checked
> the
> > File Description limit for user solr, which is 12000 and increased to
> > 52000. Don't see "too many files open..." error now in Solr log but still
> > Solr connection getting lost in Admin panel.
> >
> > Let me do some more tests and install older version back to confirm and
> > will share the findings.
> >
> > Thanks,
> > Susheel
> >
> > On Wed, May 4, 2016 at 8:11 PM, Chris Hostetter <
> hossman_luc...@fucit.org>
> > wrote:
> >
> >>
> >> : Thanks, Nick. Do we know any suggested # for file descriptor limit
> with
> >> : Solr6?  Also wondering why i haven't seen this problem before with
> Solr
> >> 5.x?
> >>
> >> are you running Solr6 on the exact same host OS that you were running
> >> Solr5 on?
> >>
> >> even if you are using the "same OS version" on a diff machine, that
> could
> >> explain the discrepency if you (or someone else) increased the file
> >> descriptor limit on the "old machine" but that neverh appened on the
> 'new
> >> machine"
> >>
> >>
> >>
> >> : On Wed, May 4, 2016 at 4:54 PM, Nick Vasilyev <
> nick.vasily...@gmail.com
> >> >
> >> : wrote:
> >> :
> >> : > It looks like you have too many open files, try increasing the file
> >> : > descriptor limit.
> >> : >
> >> : > On Wed, May 4, 2016 at 3:48 PM, Susheel Kumar <
> susheel2...@gmail.com>
> >> : > wrote:
> >> : >
> >> : > > Hello,
> >> : > >
> >> : > > I am trying to setup 2 node Solr cloud 6 cluster with ZK 3.4.8 and
> >> used
> >> : > the
> >> : > > install service to setup solr.
> >> : > >
> >> : > > After launching Solr Admin Panel on server1, it looses connections
> >> in few
> >> : > > seconds and then comes back and other node server2 is marked as
> >> Down in
> >> : > > cloud graph. After few seconds its loosing the connection and
> comes
> >> back.
> >> : > >
> >> : > > Any idea what may be going wrong? Has anyone used Solr 6 with ZK
> >> 3.4.8.
> >> : > > Have never seen this error before with solr 5.x with ZK 3.4.6.
> >> : > >
> >> : > > Below log from server1 & server2.  The ZK has 3 nodes with chroot
> >> : > enabled.
> >> : > >
> >> : > > Thanks,
> >> : > > Susheel
> >> : > >
> >> : > > server1/solr.log
> >> : > >
> >> : > > 
> >> : > >
> >> : > >
> >> : > > 2016-05-04 19:20:53.804 INFO  (qtp1989972246-14) [   ]
> >> : > > o.a.s.c.c.ZkStateReader path=[/collections/collection1]
> >> : > > [configName]=[collection1] specified config exists in ZooKeeper
> >> : > >
> >> : > > 2016-05-04 19:20:53.806 INFO  (qtp1989972246-14) [   ]
> >> : > o.a.s.s.HttpSolrCall
> >> : > > [admin] webapp=null path=/admin/collections
> >> : > > params={action=CLUSTERSTATUS&wt=json&_=1462389588125} status=0
> >> QTime=25
> >> : > >
> >> : > > 2016-05-04 19:20:53.859 INFO  (qtp1989972246-19) [   ]
> >> : > > o.a.s.h.a.CollectionsHandler Invoked Collection Action :list with
> >> params
> >> : > > action=LIST&wt=json&_=1462389588125 and sendToOCPQueue=true
> >> : > >
> >> : > > 2016-05-04 19:20:53.861 INFO  (qtp1989972246-19) [   ]
> >> : > o.a.s.s.HttpSolrCall
> >> : > > [admin] webapp=null path=/admin/collections
> >> : > > params={action=LIST

Filtering on nGroups

2016-05-05 Thread Nick Vasilyev
I am grouping documents on a field and would like to retrieve documents
where the number of items in a group matches a specific value or a range.

I haven't been able to experiment with all new functionality, but I wanted
to see if this is possible without having to calculate the count and add it
at index time as a field.

Does anyone have any ideas?

Thanks in advance


Re: Filtering on nGroups

2016-05-06 Thread Nick Vasilyev
I am on 6.1 preview, I just need this to gather some one time metrics so
performance isn't an issue.
On May 6, 2016 1:13 PM, "Erick Erickson"  wrote:

What version of Solr? Regardless, if you can pre-process
at index time it'll be faster than anything else (probably).

pre-processing isn't very dynamic though so there are lots
of situations where that's just not viable.

Best,
Erick

On Thu, May 5, 2016 at 6:05 PM, Nick Vasilyev 
wrote:
> I am grouping documents on a field and would like to retrieve documents
> where the number of items in a group matches a specific value or a range.
>
> I haven't been able to experiment with all new functionality, but I wanted
> to see if this is possible without having to calculate the count and add
it
> at index time as a field.
>
> Does anyone have any ideas?
>
> Thanks in advance


Re: Filtering on nGroups

2016-05-06 Thread Nick Vasilyev
I guess it would also work if I could facet on the group counts. I just
need to know how many groups of different sizes there are.

On Fri, May 6, 2016 at 2:10 PM, Nick Vasilyev 
wrote:

> I am on 6.1 preview, I just need this to gather some one time metrics so
> performance isn't an issue.
> On May 6, 2016 1:13 PM, "Erick Erickson"  wrote:
>
> What version of Solr? Regardless, if you can pre-process
> at index time it'll be faster than anything else (probably).
>
> pre-processing isn't very dynamic though so there are lots
> of situations where that's just not viable.
>
> Best,
> Erick
>
> On Thu, May 5, 2016 at 6:05 PM, Nick Vasilyev 
> wrote:
> > I am grouping documents on a field and would like to retrieve documents
> > where the number of items in a group matches a specific value or a range.
> >
> > I haven't been able to experiment with all new functionality, but I
> wanted
> > to see if this is possible without having to calculate the count and add
> it
> > at index time as a field.
> >
> > Does anyone have any ideas?
> >
> > Thanks in advance
>
>


Re: Solr edismax field boosting

2016-05-09 Thread Nick D
You can add the debug flag to the end of the request and see exactly what
the scoring is and why things are happening.

&debug=ALL will show you everything including the scoring.

Showing the result of the debug query should help you, or adding that into
your question here, decipher what is going on with your scoring and how the
boosts are(n't) working.

Nick

On Mon, May 9, 2016 at 7:22 PM, Megha Bhandari 
wrote:

> Hi
>
> We are trying to boost certain fields with relevancy. However we are not
> getting results as per expectation. Below is the configuration in
> solr-config.xml.
> Even though the title field has a lesser boost than metatag.description
> results for title field are coming higher.
>
> We even created test data that have data only in description in
> metatag.description and title. Example , page 1 has foo in description and
> page 2 has foo in title. Solr is still returning page 2 before page 1.
>
> We are using Solr 5.5 and Nutch 1.11 currently.
>
> Following is the configuration we are using. Any ideas on what we are
> missing to enable correct field boosting?
>
> 
> 
>   
> metatag.keywords^10 metatag.description^9 title^8 h1^7 h2^6 h3^5
> h4^4 id _text_^1
>   
>   explicit
>   10
>
>   
>
>   explicit
>   _text_
>   default
>   on
>   false
>   10
>   5
>   5
>   false
>   true
>   10
>   5
> 
>   id title metatag.description itemtype
> lang metatag.hideininternalsearch metatag.topresultthumbnailalt
> metatag.topresultthumbnailurl playerid playerkey
>   on
>   0
>   title metatag.description
>   
>   
> 
> 
>   spellcheck
> elevator
> 
>   
>
> Thanks
> Megha
>


Re: Solr edismax field boosting

2016-05-09 Thread Nick D
One thing to note: you can also take on wt=ruby&indent=true it makes the
debug explain data look better for pasting.

But what I am seeing is a score is being all based on that fact that it
found the content you were looking for in an unboosted field, i.e.
*_text_*, so your boosts don't look to be having any value in scoring the
way you are setup currently.  Next if you look at what is creating the
score difference you can see its being computed from the tf-Norm values.

But maybe pasting in a cleaning version of the debug cause getting all
scoring lined up is a bit of a pain. You can get it to look something like
this with the wt=ruby&indent=true:

'
10.541302 = (MATCH) sum of:
  10.541302 = (MATCH) max plus 0.01 times others of:
10.518621 = (MATCH) weight(ngram_tags_a:"developer group"~1 in 88)
[DefaultSimilarity], result of:
  10.518621 = score(doc=88,freq=1.0), product of:
0.64834845 = queryWeight, product of:
  16.223717 = idf(), sum of:
9.416559 = idf(docFreq=21, maxDocs=99469)
6.8071575 = idf(docFreq=298, maxDocs=99469)
  0.039963003 = queryNorm
16.223717 = fieldWeight in 88, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = phraseFreq=1.0
  16.223717 = idf(), sum of:
9.416559 = idf(docFreq=21, maxDocs=99469)
6.8071575 = idf(docFreq=298, maxDocs=99469)
  1.0 = fieldNorm(doc=88)
0.32740274 = (MATCH) weight(ngram_content:"developer group"~1 in 88)
[DefaultSimilarity], result of:
  0.32740274 = score(doc=88,freq=2.0), product of:
0.38474476 = queryWeight, product of:
  9.627523 = idf(), sum of:
6.387304 = idf(docFreq=454, maxDocs=99469)
3.240219 = idf(docFreq=10586, maxDocs=99469)
  0.039963003 = queryNorm
0.85096085 = fieldWeight in 88, product of:
  1.4142135 = tf(freq=2.0), with freq of:
2.0 = phraseFreq=2.0
  9.627523 = idf(), sum of:
6.387304 = idf(docFreq=454, maxDocs=99469)
3.240219 = idf(docFreq=10586, maxDocs=99469)
  0.0625 = fieldNorm(doc=88)
1.9406005 = (MATCH) weight(ngram_label:"developer group"~1 in 88)
[DefaultSimilarity], result of:
  1.9406005 = score(doc=88,freq=1.0), product of:
0.556964 = queryWeight, product of:
  13.936991 = idf(), sum of:
9.3721075 = idf(docFreq=22, maxDocs=99469)
4.5648837 = idf(docFreq=2814, maxDocs=99469)
  0.039963003 = queryNorm
3.4842477 = fieldWeight in 88, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = phraseFreq=1.0
  13.936991 = idf(), sum of:
9.3721075 = idf(docFreq=22, maxDocs=99469)
4.5648837 = idf(docFreq=2814, maxDocs=99469)
  0.25 = fieldNorm(doc=88)
'

Also dont know what Solr version you may be using so you explain data might
look a bit different.

This link is a bit out of date but may help you understand how the scoring
works:
https://wiki.apache.org/solr/SolrRelevancyFAQ#How_are_documents_scored

Nick



On Mon, May 9, 2016 at 8:08 PM, Megha Bhandari 
wrote:

> To clarify on the debug information given earlier , we changed the query
> factor to the following to ignore title field completely
>
> metatag.description^9 h1^7 h2^6 h3^5 h4^4 _text_^1 id^0.5"
>
> But still title results are coming on top
>
> Full response with debug on:
>
> Full response
>
> {
>   "responseHeader":{
> "status":0,
> "QTime":13,
> "params":{
>   "mm":"100%",
>   "q":"Foo",
>   "tie":"0.99",
>   "defType":"edismax",
>   "q.alt":"Foo",
>   "indent":"on",
>   "qf":"metatag.description^9 h1^7 h2^6 h3^5 h4^4 _text_^1 id^0.5",
>   "wt":"json",
>   "debugQuery":"on",
>   "_":"1462810987788"}},
>   "response":{"numFound":3,"start":0,"maxScore":0.8430033,"docs":[
>   {
> "h2":["Looks like your browser is a little out-of-date."],
> "h3":["Already a member?"],
> "title":"Foo Custon",
> "id":"
> http://localhost:4503/content/uhcdotcom/en/home/waysin/poc/Foo-custon.html
> ",
> "tstamp":"2016-05-09T17:15:57.604Z",
> "metatag.hideininternalsearch":[false],
> "segment":[20160509224553],
> "digest":["844296a63233b3e4089424fe1ec9d036"],
> "boost

Re: Solr edismax field boosting

2016-05-10 Thread Nick D
Megha,

What are the field types for the fields you are trying to search through?
Grab a copy of the schema.xml and paste the relevant fields.

My guess is you have _text_ as some copy field for everything else and have
it stored=false correct? I am no seeing that field in the output above.
Also in you first post you show the /elevate requestHandler definition, is
that your default request handler or did you paste in the incorrect
handler.

The simple reason the boosting isn't working is Solr isnt finding a match
in that your query fields that you are applying a boost too it is only
finding the values in the _text_ field.

Also you probably should read up on BM25Similarity as this is the default
in the version of solr you are using.


Nick




On Tue, May 10, 2016 at 12:27 AM, Megha Bhandari 
wrote:

> Thanks Nick, got the response formatted. We are using Solr 5.5.
> Not able to understand why it is ignoring the boosts completely. What
> configuration is being missed? As you correctly pointed out it is only
> calculating based on the _text_ field.
>
> Query:
>
> http://10.203.101.42:8983/solr/uhc/select?defType=edismax&indent=on&mm=1&q=upendra&qf=h1
> ^9.0%20_text_^1.0&wt=ruby&debug=true
>
> Response with debug on:
> {
>   'responseHeader'=>{
> 'status'=>0,
> 'QTime'=>6,
> 'params'=>{
>   'mm'=>'1',
>   'q'=>'upendra',
>   'defType'=>'edismax',
>   'debug'=>'true',
>   'indent'=>'on',
>   'qf'=>'h1^9.0 _text_^1.0',
>   'wt'=>'ruby'}},
>   'response'=>{'numFound'=>6,'start'=>0,'maxScore'=>0.14641379,'docs'=>[
>   {
> 'h2'=>['Looks like your browser is a little out-of-date.'],
> 'h3'=>['Already a member?'],
> 'strtitle'=>['I m increasiing the the page title content Upendra
> Custon'],
> 'id'=>'http://localhost:4503/baseurl/upendra-custon.html',
> 'tstamp'=>'2016-05-10T05:50:22.316Z',
> 'metataghideininternalsearch'=>false,
> 'metatagtopresultthumbnailalt'=>',',
> 'segment'=>[20160510112017],
> 'digest'=>['fb988351afceb26a835fba68e2bcc33f'],
> 'boost'=>[1.4142135],
> 'lang'=>'en',
> 'metatagkeywords'=>[','],
> '_version_'=>1533919301006786560,
> 'host'=>'localhost',
> 'url'=>'http://localhost:4503/baseurl/upendra-custon.html',
> 'score'=>0.14641379},
>   {
> 'metatagdescription'=>['test'],
> 'h1'=>['Upendra'],
> 'h2'=>['Looks like your browser is a little out-of-date.'],
> 'h3'=>['Already a member?'],
> 'strtitle'=>['health care body content'],
> 'id'=>'
> http://localhost:4503/baseurl/upendra-custon/care-body-content.html',
> 'tstamp'=>'2016-05-10T05:50:22.269Z',
> 'metataghideininternalsearch'=>false,
> 'metatagtopresultthumbnailalt'=>',',
> 'segment'=>[20160510112017],
> 'digest'=>['dd4ef8879be2d4d3f28e24928e9b84c5'],
> 'boost'=>[1.4142135],
> 'lang'=>'en',
> 'metatagkeywords'=>[','],
> '_version_'=>1533919301071798272,
> 'host'=>'localhost',
> 'url'=>'
> http://localhost:4503/baseurl/upendra-custon/care-body-content.html',
> 'score'=>0.13738367},
>   {
> 'metatagdescription'=>['test'],
> 'h1'=>['health care keyword'],
> 'h2'=>['Looks like your browser is a little out-of-date.'],
> 'h3'=>['Already a member?'],
> 'strtitle'=>['health care keyword'],
> 'id'=>'
> http://localhost:4503/baseurl/upendra-custon/care-keyword.html',
> 'tstamp'=>'2016-05-10T05:50:22.3

Re: How to search in solr for words like %rek Dr%

2016-05-10 Thread Nick D
You can use a combination of ngram or edgengram fields and possibly the
shingle factory if you want to combine words. Also might want to have it as
exact text with no query sloop if the two words, even the partial text,
need to be right next to each other. Edge is great for left to right ngram
is great just to splitup by a size.  There are a number of tokenizers you
can try out.

Nick
On May 10, 2016 9:22 AM, "Thrinadh Kuppili"  wrote:

> I am trying to search a field named Address which has a space in it.
> Example :
> Address has the below values in it.
> 1. 2000 North Derek Dr Fullerton
> 2. 2011 N Derek Drive Fullerton
> 3. 2108 N Derek Drive Fullerton
> 4. 2100 N Derek Drive Fullerton
> 5. 2001 N Drive Derek Fullerton
>
> Search Query:- Derek Drive or rek Dr
> Expectation is it should return all  2,3,4 and it should not return 1 & 5 .
>
> Finally i am trying to find a word which can search similar to database
> search of %N Derek%
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-search-in-solr-for-words-like-rek-Dr-tp4275854.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: How to search in solr for words like %rek Dr%

2016-05-10 Thread Nick D
Don't really get what "Q= {!dismax qf=address} "rek Dr*" - It is not
allowed since perfix in Quotes is not allowed" means, why cant you use
exact phrase matching? Do you have some limitation of quoting as you are
specifically looking for an exact phrase I dont see why you wouldn't want
exact matching.


Anyways

You can look into using another type of tokenizer, my guess is you are
probably using the standard tokenizer or possibly the whitespace tokenizer.
You may want to try a different one a see what result you get. Also you
probably wont need to use the wildcards if you setup you gram sizes the way
you want.

The shingle factory can do stuff like (now my memory is a bit fuzzy on this
but I play with it in the admin page).

This is a sentence
shingle = 4
this_is_a_sentence

Combine that with your ngram factory and you can do something like. Mingram
= 4 max =50
this
this_i
this_is

this_is_a_sentence

his_i
his_is

his_is_a_sentence

etc.


Then apply the shingle factory on query to take something like

his is-> his_is and you will get that phrase back.

My personal favorite is just using edgengram and fixing something like but
the concept is the same with regular old ngram:

2001 N Drive Derek Fullerton

2
[32]
0
1
1
word
1
20
[32 30]
0
2
1
word
1
200
[32 30 30]
0
3
1
word
1
2001
[32 30 30 31]
0
4
1
word
1
n
[6e]
5
6
1
word
2
d
[64]
7
8
1
word
3
dr
[64 72]
7
9
1
word
3
dri
[64 72 69]
7
10
1
word
3
driv
[64 72 69 76]
7
11
1
word
3
drive
[64 72 69 76 65]
7
12
1
word
3
d
[64]
13
14
1
word
4
de
[64 65]
13
15
1
word
4
der
[64 65 72]
13
16
1
word
4
dere
[64 65 72 65]
13
17
1
word
4
derek
[64 65 72 65 6b]
13
18
1
word
4
f
[66]
19
20
1
word
5
fu
[66 75]
19
21
1
word
5
ful
[66 75 6c]
19
22
1
word
5
full
[66 75 6c 6c]
19
23
1
word
5
fulle
[66 75 6c 6c 65]
19
24
1
word
5
fuller
[66 75 6c 6c 65 72]
19
25
1
word
5
fullert
[66 75 6c 6c 65 72 74]
19
26
1
word
5
fullerto
[66 75 6c 6c 65 72 74 6f]
19
27
1
word
5
fullerton
[66 75 6c 6c 65 72 74 6f 6e]
19
28
1
word
5

Works great for a quick type-ahead field type.

Oh and by the way your ngram size is two small for _rek_ to be split up
from _derek_


Setting up a few different field types and playing with the analyzer in
admin page can give you a good idea about what both index and query time
results can be and with your tiny data set is the best way I can think of
to see instant results with your new field types.

Nick

On Tue, May 10, 2016 at 10:01 AM, Thrinadh Kuppili 
wrote:

> I have tried with  maxGramSize="12"/> and search using the Extended Dismax
>
> Q= {!dismax qf=address} rek Dr* - It did not work as expected since i am
> getting all the records which has rek, Dr .
>
> Q= {!dismax qf=address} "rek Dr*" - It is not allowed since perfix in
> Quotes
> is not allowed.
>
> Q= {!complexphrase inOrder=true}address:"rek dr*" - It did not work since
> it
> is searching for words starts with rek
>
> I am not aware of shingle factory as of now will try to use and findout how
> i can use.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-search-in-solr-for-words-like-rek-Dr-tp4275854p4275859.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Dynamically change solr suggest field

2016-05-11 Thread Nick D
There are only two ways I can think of to accomplish this and neither of
them are dynamically setting the suggester field as is looks according to
the Doc (which does sometimes have lacking info so I might be wrong) you
cannot set something like *suggest.fl=combo_box_field* at query time. But
maybe they can help you get started.

1. Multiple suggester request handlers for each option in combo box. This
way you just change the request handler in the query you submit based on
the context.

2. Use copy fields to put all possible suggestions into same field name, so
no more dynamic field settings, with another field defining whatever the
option would be for that document out of the combo box and use context
filters which can be passed at query time to limit the suggestions to those
filtered by whats in the combo box.
https://cwiki.apache.org/confluence/display/solr/Suggester#Suggester-ContextFiltering

Hope this helps a bit

Nick

On Wed, May 11, 2016 at 7:05 AM, Lasitha Wattaladeniya 
wrote:

> Hello devs,
>
> I'm trying to implement auto complete text suggestions using solr. I have a
> text box and next to that there's a combo box. So the auto complete should
> suggest based on the value selected in the combo box.
>
> Basically I should be able to change the suggest field based on the value
> selected in the combo box. I was trying to solve this problem whole day but
> not much luck. Can anybody tell me is there a way of doing this ?
>
> Regards,
> Lasitha.
>
> Lasitha Wattaladeniya
> Software Engineer
>
> Mobile : +6593896893
> Blog : techreadme.blogspot.com
>


Re: Re-indexing in SolRCloud while keeping the collection online -- Best practice?

2016-05-11 Thread Nick Vasilyev
Aliasing works great, I implemented it after upgrading to Solr 5 and it
allows us to do this exact thing. The only thing you have to watch out for
is indexing new items (if they overwrite old ones) while you are
re-indexing.

I took it a step further for another collection that stores a lot of time
based data from logs. I have two aliases for that collection logs and
logs_indexing, every month a new collection gets created called logs_201605
or something like that and both aliases get updated. logs_indexing now only
points to the newest collection, thats where all the indexing is going, the
logs alias gets updated to include the new collection as well (since
aliases can point to multiple collections).

Here is the link to the documentation.
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api4

On Tue, May 10, 2016 at 12:55 PM, Horváth Péter Gergely <
peter.gergely.horv...@gmail.com> wrote:

> Hi Erick,
>
> Most of the time we have to do a full re-index: I do love your second idea,
> I will take a look at the details of that. Thank you! :)
>
> Cheers,
> Peter
>
> 2016-05-10 17:10 GMT+02:00 Erick Erickson :
>
> > Peter:
> >
> > Yeah, that would work, but there are a couple of alternatives:
> > 1> If there's any way to know what the subset of docs that's
> >  changed, just re-index _them_. The problem here is
> >  picking up deletes. In the RDBMS case this is often done
> >  by creating a trigger for deletes and then the last step
> >  in your update is to remove the docs since the last time
> >  you indexed using the deleted_docs table (or whatever).
> >  This falls down if a> you require an instantaneous switch
> >  from _all_ the old data to the new or b> you can't get a
> >  list of deleted docs.
> >
> > 2> Use collection aliasing. The pattern is this: you have your
> >  "Hot" collection (col1) serving queries that is pointed to
> >  by alias "hot". You create a new collection (col2) and index
> >  to it in the background. When done, use CREATEALIAS
> >  to point "hot" to "col2". Now you can delete col1. There are
> >  no restrictions on where these collections live, so this
> >  allows you to move your collections around as you want. Plus
> >  this keeps a better separation of old and new data...
> >
> > Best,
> > Erick
> >
> > On Tue, May 10, 2016 at 4:32 AM, Horváth Péter Gergely
> >  wrote:
> > > Hi Everyone,
> > >
> > > I am wondering if there is any best practice regarding re-indexing
> > > documents in SolrCloud 6.0.0 without making the data (or the underlying
> > > collection) temporarily unavailable. Wiping all documents in a
> collection
> > > and performing a full re-indexing is not a viable alternative for us.
> > >
> > > Say we had a massive Solr Cloud cluster with a number of separate nodes
> > > that are used to host *multiple hundreds* of collections, with document
> > > counts ranging from a couple of thousands to multiple (say up to 20)
> > > millions of documents, each with 200-300 fields and a background batch
> > > loader job that fetches data from a variety of source systems.
> > >
> > > We have to retain the cluster and ALL collections online all the time
> > (365
> > > x 24): We cannot allow queries to be blocked while data in a collection
> > is
> > > being updated and we cannot load everything in a single-shot jumbo
> commit
> > > (the replication could overload the cluster).
> > >
> > > One solution I could imagine is storing an additional field "load
> > > time-stamp" in all documents and the client (interactive query)
> > application
> > > extending all queries with an additional restriction, which requires
> > > documents "load time-stamp" to be the latest known completed "load
> > > time-stamp".
> > >
> > > This concept would work according to the following:
> > > 1.) The batch job would simply start loading new documents, with the
> new
> > > "load time-stamp". Existing documents would not be touched.
> > > 2.) The client (interactive query) application would still use the old
> > data
> > > from the previous load (since all queries are restricted with the old
> > "load
> > > time-stamp")
> > > 3.) The batch job would store the new "load time-stamp" as the one to
> be
> > > used (e.g. in a separate collection etc.) -- after this, all queries
> > would
> > > return the most up-to-data documents
> > > 4.) The batch job would purge all documents from the collection, where
> > > the "load time-stamp" is not the same as the last one.
> > >
> > > This approach seems to be implementable, however, I definitely want to
> > > avoid reinventing the wheel myself and wondering if there is any better
> > > solution or built-in Solr Cloud feature to achieve the same or
> something
> > > similar.
> > >
> > > Thanks,
> > > Peter
> >
>


Re: More Like This on not new documents

2016-05-13 Thread Nick D
https://wiki.apache.org/solr/MoreLikeThisHandler

Bottom of the page, using context streams. I believe this still works in
newer versions of Solr. Although I have not tested it on a new version of
Solr.

But if you plan on indexing the document anyways then just indexing and
then passing the ID to mlt isn't a bad thing at all.

Nick

On Fri, May 13, 2016 at 2:23 AM, Vincenzo D'Amore 
wrote:

> Hi all,
>
> anybody know if is there a chance to use the mlt component with a new
> document not existing in the collection?
>
> In other words, if I have a new document, should I always first add it to
> my collection and only then, using the mlt component, have the list of
> similar documents?
>
>
> Best regards,
> Vincenzo
>
>
> --
> Vincenzo D'Amore
> email: v.dam...@gmail.com
> skype: free.dev
> mobile: +39 349 8513251
>


json.facet streaming

2016-05-17 Thread Nick Vasilyev
I am on the nightly build of 6.1 and I am experimenting with json.facet
streaming, however the response I am getting back looks like regular query
response. I was expecting something like the streaming api. Is this right
or am I missing something?

Hhere is the json.facet string.

'json.facet':str({ "groups":{
"type": "terms",
"field": "group",
"method":"stream"
}}),

The group field is a string field with DocValues enabled.

Thanks


Re: json.facet streaming

2016-05-17 Thread Nick Vasilyev
I enabled query debugging, here is the facet-trace snippet.

"facet-trace":{
  "processor":"FacetQueryProcessor",
  "elapse":0,
  "query":null,
  "domainSize":43046041,
  "sub-facet":[{
  "processor":"FacetFieldProcessorStream",
  "elapse":0,
  "field":"group",
  "limit":10,
  "domainSize":8980542},
{
  "processor":"FacetFieldProcessorStream",
  "elapse":0,
  "field":"group",
  "limit":10,
  "domainSize":9005295},
{
  "processor":"FacetFieldProcessorStream",
  "elapse":0,
  "field":"group",
  "limit":10,
  "domainSize":7555021},
{
  "processor":"FacetFieldProcessorStream",
  "elapse":0,
  "field":"group",
  "limit":10,
  "domainSize":8928379},
{
  "processor":"FacetFieldProcessorStream",
  "elapse":0,
  "field":"group",
  "limit":10,
  "domainSize":8576804}]},
"json":{"facet":{"groups":{
  "type":"terms",
  "field":"group",
  "method":"stream"}}},

On Tue, May 17, 2016 at 8:42 AM, Yonik Seeley  wrote:

> Perhaps try turning on request debugging and see what is actually
> being received by Solr?
>
> -Yonik
>
>
> On Tue, May 17, 2016 at 8:33 AM, Nick Vasilyev 
> wrote:
> > I am on the nightly build of 6.1 and I am experimenting with json.facet
> > streaming, however the response I am getting back looks like regular
> query
> > response. I was expecting something like the streaming api. Is this right
> > or am I missing something?
> >
> > Hhere is the json.facet string.
> >
> > 'json.facet':str({ "groups":{
> > "type": "terms",
> > "field": "group",
> > "method":"stream"
> > }}),
> >
> > The group field is a string field with DocValues enabled.
> >
> > Thanks
>


Re: json.facet streaming

2016-05-17 Thread Nick Vasilyev
Hi Yonik, I do see them in the response, but the JSON format is like
standard facet output. I am not sure what streaming facet response would
look like, but I expected it to be similar to the streaming API. Is this
the case?

On Tue, May 17, 2016 at 9:35 AM, Yonik Seeley  wrote:

> So it looks like facets are being computed... do you not see them in
> the response?
> -Yonik
>
>
> On Tue, May 17, 2016 at 9:12 AM, Nick Vasilyev 
> wrote:
> > I enabled query debugging, here is the facet-trace snippet.
> >
> > "facet-trace":{
> >   "processor":"FacetQueryProcessor",
> >   "elapse":0,
> >   "query":null,
> >   "domainSize":43046041,
> >   "sub-facet":[{
> >   "processor":"FacetFieldProcessorStream",
> >   "elapse":0,
> >   "field":"group",
> >   "limit":10,
> >   "domainSize":8980542},
> > {
> >   "processor":"FacetFieldProcessorStream",
> >   "elapse":0,
> >   "field":"group",
> >   "limit":10,
> >   "domainSize":9005295},
> > {
> >   "processor":"FacetFieldProcessorStream",
> >   "elapse":0,
> >   "field":"group",
> >   "limit":10,
> >   "domainSize":7555021},
> > {
> >   "processor":"FacetFieldProcessorStream",
> >   "elapse":0,
> >   "field":"group",
> >   "limit":10,
> >   "domainSize":8928379},
> > {
> >   "processor":"FacetFieldProcessorStream",
> >   "elapse":0,
> >   "field":"group",
> >   "limit":10,
> >   "domainSize":8576804}]},
> > "json":{"facet":{"groups":{
> >   "type":"terms",
> >   "field":"group",
> >   "method":"stream"}}},
> >
> > On Tue, May 17, 2016 at 8:42 AM, Yonik Seeley  wrote:
> >
> >> Perhaps try turning on request debugging and see what is actually
> >> being received by Solr?
> >>
> >> -Yonik
> >>
> >>
> >> On Tue, May 17, 2016 at 8:33 AM, Nick Vasilyev <
> nick.vasily...@gmail.com>
> >> wrote:
> >> > I am on the nightly build of 6.1 and I am experimenting with
> json.facet
> >> > streaming, however the response I am getting back looks like regular
> >> query
> >> > response. I was expecting something like the streaming api. Is this
> right
> >> > or am I missing something?
> >> >
> >> > Hhere is the json.facet string.
> >> >
> >> > 'json.facet':str({ "groups":{
> >> > "type": "terms",
> >> > "field": "group",
> >> > "method":"stream"
> >> > }}),
> >> >
> >> > The group field is a string field with DocValues enabled.
> >> >
> >> > Thanks
> >>
>


Re: json.facet streaming

2016-05-17 Thread Nick Vasilyev
Got it. Thanks for clarifying.

On Tue, May 17, 2016 at 9:58 AM, Yonik Seeley  wrote:

> On Tue, May 17, 2016 at 9:41 AM, Nick Vasilyev 
> wrote:
> > Hi Yonik, I do see them in the response, but the JSON format is like
> > standard facet output. I am not sure what streaming facet response would
> > look like, but I expected it to be similar to the streaming API. Is this
> > the case?
>
> Nope.
> The method is an execution hint (calculate the facets via this
> method), and should not normally affect what the response looks like.
>
> -Yonik
>


Re: API call for optimising a collection

2016-05-17 Thread Nick Vasilyev
As far as I know, you have to run it on each core.
On May 18, 2016 1:04 AM, "Binoy Dalal"  wrote:

> Is there no api call that can optimize an entire collection?
>
> I tried the collections api page on the confluence wiki but couldn't find
> anything, and a Google search also yielded no meaningful results.
> --
> Regards,
> Binoy Dalal
>


Solr 5.5.2

2016-05-26 Thread Nick Vasilyev
Is there an anticipated release date for 5.5.2? I know 5.5.1 was just
released a while ago and although it fixes the faceting performance
(SOLR-8096), distributed grouping is broken (SOLR-8940).

I just need a solid 5.x release that is stable and with all core
functionality working.

Thanks


Re: Solr 5.5.2

2016-05-26 Thread Nick Vasilyev
Thanks Erik, option 4 is my favorite so far :)

On Thu, May 26, 2016 at 2:15 PM, Erick Erickson 
wrote:

> There is no plan to release 5.5.2, development has moved to trunk and
> 6.x. Also, while there
> is a patch for that JIRA it hasn't been committed even in trunk/6.0.
>
> So I think your choices are:
> 1> find a work-around
> 2> see about moving to Solr 6.0.1 (in release process now),
> assuming that it solves the problem.
> 3> See if the patch supplied with SOLR-8940 works for you and compile
> it locally.
> 4> agitate for a 5.5.2 that includes this fix (after the fix has been
> vetted).
>
> Best,
> Erick
>
> On Thu, May 26, 2016 at 11:08 AM, Nick Vasilyev
>  wrote:
> > Is there an anticipated release date for 5.5.2? I know 5.5.1 was just
> > released a while ago and although it fixes the faceting performance
> > (SOLR-8096), distributed grouping is broken (SOLR-8940).
> >
> > I just need a solid 5.x release that is stable and with all core
> > functionality working.
> >
> > Thanks
>


Re: Facet data type

2016-05-26 Thread Nick D
Although you did mention that you wont need to sort and you are using
mutlivalued=true. On the off chance you do change something like
multivalued=false docValues=false then this will come in to play:

https://issues.apache.org/jira/browse/SOLR-7495

This has been a rather large pain to deal with in terms of faceting. (the
Lucene change that caused a number of Issues is also referenced in this
Jira).

Nick


On Thu, May 26, 2016 at 11:45 AM, Erick Erickson 
wrote:

> I always prefer ints to strings, they can't help but take
> up less memory, comparing two ints is much faster than
> two strings etc. Although Lucene can play some tricks
> to make that less noticeable.
>
> Although if these are just a few values, it'll be hard to
> actually measure the perf difference.
>
> And if it's a _lot_ of unique values, you have other problems
> than the int/string distinction. Faceting on very high
> cardinality fields is something that can have performance
> implications.
>
> But I'd certainly add docValues="true" to the definition no matter
> which you decide on.
>
> Best,
> Erick
>
> On Wed, May 25, 2016 at 9:29 AM, Steven White 
> wrote:
> > Hi everyone,
> >
> > I will be faceting on data of type integers and I'm wonder if there is
> any
> > difference on how I design my schema.  I have no need to sort or use
> range
> > facet, given this, in terms of Lucene performance and index size, does it
> > make any difference if I use:
> >
> > #1:  indexed="true"
> > required="true" stored="false"/>
> >
> > Or
> >
> > #2:  > required="true" stored="false"/>
> >
> > (notice how I changed the "type" from "string" to "int" in #2)
> >
> > Thanks in advanced.
> >
> > Steve
>


Re: Facet data type

2016-05-27 Thread Nick D
Steven,

The case that I was pointing to was specifically talking about the need for
a int to be set to multivalued=true for the field to be used as a
facet.field. I personally ran into it when upgrading to 5.x from 4.10.2. I
believe setting docValues=true will not have an affect (untested by me but
there was mention of that in the Jira). Also there are some linking Jiras
that talk about other issues with Facets in 5.x but my guess is if you
aren't upgrading from 4.x to 5.x then you will probably wont hit the issue
but there are some things people are finding with Doc values and
performance with 4.x upgrades.

I think there are some even more knowledgeable people on here who could
chime in with a more detailed explanation or correct me if I misspoke.

Nick

On Fri, May 27, 2016 at 12:11 PM, Steven White  wrote:

> Thanks Erick.
>
> What about Solr defect SOLR-7495 that Nick mentioned?  It sounds like
> because of this defect, I should NOT set docValues="true" on a filed when:
> a) type="int" and b) multiValued="true".  Can you confirm that I got this
> right?  I'm on Solr 5.2.1
>
> Steve
>
>
> On Fri, May 27, 2016 at 1:30 PM, Erick Erickson 
> wrote:
>
> > bq: my index size grew by 20%.  Is this expected
> >
> > Yes. But don't worry about it ;). Basically, you've serialized
> > to disk the "uninverted" form of the field. But, that is
> > accessed through Lucene by MMapDirectory, see:
> > http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
> >
> > If you don't use DocValues, the uninverted version
> > is built in Java's memory, which is much more expensive
> > for a variety of reasons. What you lose in disk size you gain
> > in a lower JVM footprint, fewer GC problems etc.
> >
> > But the implication is, indeed, that you should use DocValues
> > for field you intend to facet and/or sort etc on. If you only search
> > it's just wasted space.
> >
> > Best,
> > Erick
> >
> > On Fri, May 27, 2016 at 6:25 AM, Steven White 
> > wrote:
> > > Thank you Erick for pointing out about DocValues.  I re-indexed my data
> > > with it set to true and my index size grew by 20%.  Is this expected?
> > >
> > > Hi Nick, I'm not clear about SOLR-7495.  Are you saying I should not
> use
> > > docValues=true if:type="int"and multiValued="true"?  I'm on Solr 5.2.1.
> > > Thanks.
> > >
> > > Steve
> > >
> > > On Thu, May 26, 2016 at 9:29 PM, Nick D  wrote:
> > >
> > >> Although you did mention that you wont need to sort and you are using
> > >> mutlivalued=true. On the off chance you do change something like
> > >> multivalued=false docValues=false then this will come in to play:
> > >>
> > >> https://issues.apache.org/jira/browse/SOLR-7495
> > >>
> > >> This has been a rather large pain to deal with in terms of faceting.
> > (the
> > >> Lucene change that caused a number of Issues is also referenced in
> this
> > >> Jira).
> > >>
> > >> Nick
> > >>
> > >>
> > >> On Thu, May 26, 2016 at 11:45 AM, Erick Erickson <
> > erickerick...@gmail.com>
> > >> wrote:
> > >>
> > >> > I always prefer ints to strings, they can't help but take
> > >> > up less memory, comparing two ints is much faster than
> > >> > two strings etc. Although Lucene can play some tricks
> > >> > to make that less noticeable.
> > >> >
> > >> > Although if these are just a few values, it'll be hard to
> > >> > actually measure the perf difference.
> > >> >
> > >> > And if it's a _lot_ of unique values, you have other problems
> > >> > than the int/string distinction. Faceting on very high
> > >> > cardinality fields is something that can have performance
> > >> > implications.
> > >> >
> > >> > But I'd certainly add docValues="true" to the definition no matter
> > >> > which you decide on.
> > >> >
> > >> > Best,
> > >> > Erick
> > >> >
> > >> > On Wed, May 25, 2016 at 9:29 AM, Steven White  >
> > >> > wrote:
> > >> > > Hi everyone,
> > >> > >
> > >> > > I will be faceting on data of type integers and I'm wonder if
> there
> > is
> > >> > any
> > >> > > difference on how I design my schema.  I have no need to sort or
> use
> > >> > range
> > >> > > facet, given this, in terms of Lucene performance and index size,
> > does
> > >> it
> > >> > > make any difference if I use:
> > >> > >
> > >> > > #1:  > >> > indexed="true"
> > >> > > required="true" stored="false"/>
> > >> > >
> > >> > > Or
> > >> > >
> > >> > > #2:  > indexed="true"
> > >> > > required="true" stored="false"/>
> > >> > >
> > >> > > (notice how I changed the "type" from "string" to "int" in #2)
> > >> > >
> > >> > > Thanks in advanced.
> > >> > >
> > >> > > Steve
> > >> >
> > >>
> >
>


Re: Use of solr + banana for faceted search

2016-07-20 Thread Nick Vasilyev
Banana has a facet panel that allows you to configure several fields to
facet on, you can have multiple fields and they will show up as an
accordion. However, keep in mind that the field needs to be tokenized for
faceting (i.e. string) and upon selection the filter is added to the fq
parameter in the Solr query. Let me know if that helps.

On Wed, Jul 20, 2016 at 12:40 PM, Darshan Pandya 
wrote:

> Hello folks,
>
> I am fairly new to solr + banana, especially banana.
>
>
> I am trying to configure banana for faceted search for a collection in
> solr.
> I want to be able to have multiple facets parameters on the left and see
> the results of selections on my data table on the right. Exactly like
> guided Nav.
>
> Please let me know if anyone has done this and/or if there is a tutorial
> for this.
>
> --
> Sincerely,
> Darshan
>


Re: Use of solr + banana for faceted search

2016-07-21 Thread Nick Vasilyev
Not that I know of, but it is an open source project so its easy to extend.

On Jul 21, 2016 11:01 AM, "Darshan Pandya"  wrote:

> Thanks Nick, once again.
> I was able to use Facet panel.
>
> I also wanted to ask the group if there is a repository of custom panels
> for Banana which we can benefit from ?
>
> Sincerely,
> Darshan
>
> On Wed, Jul 20, 2016 at 11:55 AM, Darshan Pandya 
> wrote:
>
> > Nick, Thanks for your help. I'll test it out and respond back.
> >
> > On Wed, Jul 20, 2016 at 11:52 AM, Nick Vasilyev <
> nick.vasily...@gmail.com>
> > wrote:
> >
> >> Banana has a facet panel that allows you to configure several fields to
> >> facet on, you can have multiple fields and they will show up as an
> >> accordion. However, keep in mind that the field needs to be tokenized
> for
> >> faceting (i.e. string) and upon selection the filter is added to the fq
> >> parameter in the Solr query. Let me know if that helps.
> >>
> >> On Wed, Jul 20, 2016 at 12:40 PM, Darshan Pandya <
> darshanpan...@gmail.com
> >> >
> >> wrote:
> >>
> >> > Hello folks,
> >> >
> >> > I am fairly new to solr + banana, especially banana.
> >> >
> >> >
> >> > I am trying to configure banana for faceted search for a collection in
> >> > solr.
> >> > I want to be able to have multiple facets parameters on the left and
> see
> >> > the results of selections on my data table on the right. Exactly like
> >> > guided Nav.
> >> >
> >> > Please let me know if anyone has done this and/or if there is a
> tutorial
> >> > for this.
> >> >
> >> > --
> >> > Sincerely,
> >> > Darshan
> >> >
> >>
> >
> >
> >
> > --
> > Sincerely,
> > Darshan
> >
> >
>
>
> --
> Sincerely,
> Darshan
>


Solr Rounding Issue On Float fields.

2016-07-21 Thread Nick Vasilyev
Hi, I am running into a weird rounding issue on Solr 5.2.1. I have a float
field (also tried tfloat), I am indexing 154035.26 into it (confirmed in
the data),  but at query time, I get back 154035.27 (.01 more).
Additionally when I query for the document and include this number in the q
parameter, it comes up with both values, .26 and .27.

I've fed the values through the analyzer and I get this bizarre behavior
per the screenshot below. The field is a single value float or tfloat
field.

Any help would be much appreciated, thanks in advance

[image: Inline image 1]


Re: Solr Rounding Issue On Float fields.

2016-07-21 Thread Nick Vasilyev
I did a bit more investigating here is something that may help
troubleshooting:

- Seems that numbers above 131071 - are impacted. 131071.26 is fine,
but 131072.26 is not. 131071 is a large prime and also a Mersenne prime.

- 131072.24 gets rounded down to 131072.23. While 131072.26 gets rounded up
to 131072.27. Similarily, 131072.76 gets rounded up to 131072.77
and 131072.74 gets rounded down to 131072.73. 131072.49 gets rounded down
to 131072.48 and 131072.51 gets rounded up to 131072.52.

I haven't validated this code, just doing some manual digging.



On Thu, Jul 21, 2016 at 1:48 PM, Nick Vasilyev 
wrote:

> Hi, I am running into a weird rounding issue on Solr 5.2.1. I have a float
> field (also tried tfloat), I am indexing 154035.26 into it (confirmed in
> the data),  but at query time, I get back 154035.27 (.01 more).
> Additionally when I query for the document and include this number in the q
> parameter, it comes up with both values, .26 and .27.
>
> I've fed the values through the analyzer and I get this bizarre behavior
> per the screenshot below. The field is a single value float or tfloat
> field.
>
> Any help would be much appreciated, thanks in advance
>
> [image: Inline image 1]
>


Re: Solr Rounding Issue On Float fields.

2016-07-21 Thread Nick Vasilyev
Thanks Chris.

Searching for both values and retrieving the documents would be alright as
long as the data was correct. In this case, the data that I am indexing
into Solr is not the same data that I am pulling out at query time. That is
the real impact here.

On Thu, Jul 21, 2016 at 6:12 PM, Chris Hostetter 
wrote:

>
> : Hi, I am running into a weird rounding issue on Solr 5.2.1. I have a
> float
> : field (also tried tfloat), I am indexing 154035.26 into it (confirmed in
> : the data),  but at query time, I get back 154035.27 (.01 more).
> : Additionally when I query for the document and include this number in
> the q
> : parameter, it comes up with both values, .26 and .27.
>
> Pretty sure what you are observing is just the normal consequences of IEEE
> floats (as used by java) being base2 -- not every base10 decimal value
> has a precise base2 representation.
>
> Quering for 154035.27 and 154035.26 will both match the same docs, because
> the String->Float parsing in both cases will produce the closest *legal*
> float value, which is identical for both inputs.
>
> If need precise decimal values in solr, you need to either use 2
> ints/longs (ie num_base="154035", num_decimal="26") or use one int/long
> and multiply/divide by a power of 10 corisponding to the number of
> significant digits you want in the client (ie: "15403526" divide by 100)
>
>
> Some good reading linked to from here...
>
> http://perlmonks.org/?node_id=203257
>
> And of course, if you really want to bang java against your head,
> this is a classic (all of which is still appliable i believe) ...
>
> https://people.eecs.berkeley.edu/~wkahan/JAVAhurt.pdf
>
>
>
>
>
> -Hoss
> http://www.lucidworks.com/
>


Re: How to re-index SOLR data

2016-08-09 Thread Nick Vasilyev
Hi, I work on a python Solr Client
 library and there is a
reindexing helper module that you can use if you are on Solr 4.9+. I use it
all the time and I think it works pretty well. You can re-index all
documents from a collection into another collection or dump them to the
filesystem as JSON. It also supports parallel execution and can run
independently on each shard. There is also a way to resume if your job
craps out half way through if your existing schema is set up with a good
date field and unique id.

You can read the documentation here:
http://solrclient.readthedocs.io/en/latest/Reindexer.html

Code is pretty short and is here:
https://github.com/moonlitesolutions/SolrClient/blob/master/SolrClient/helpers/reindexer.py

Here is sample:
from SolrClient import SolrClient
from SolrClient.helpers import Reindexer

r = Reindexer(SolrClient('http://source_solr:8983/solr'), SolrClient('
http://destination_solr:8983/solr') , source_coll='source_collection',
dest_coll='destination-collection')
r.reindex()






On Tue, Aug 9, 2016 at 9:56 AM, Shawn Heisey  wrote:

> On 8/9/2016 1:48 AM, bharath.mvkumar wrote:
> > What would be the best way to re-index the data in the SOLR cloud? We
> > have around 65 million data and we are planning to change the schema
> > by changing the unique key type from long to string. How long does it
> > take to re-index 65 million documents in SOLR and can you please
> > suggest how to do that?
>
> There is no magic bullet.  And there's no way for anybody but you to
> determine how long it's going to take.  There are people who have
> achieved over 50K inserts per second, and others who have difficulty
> reaching 1000 per second.  Many factors affect indexing speed, including
> the size of your documents, the complexity of your analysis, the
> capabilities of your hardware, and how many threads/processes you are
> using at the same time when you index.
>
> Here's some more detailed info about reindexing, but it's probably not
> what you wanted to hear:
>
> https://wiki.apache.org/solr/HowToReindex
>
> Thanks,
> Shawn
>
>


Discreptancy in json.facet uniqe and group.ngroups

2016-09-05 Thread Nick Vasilyev
Hi, I need to get the number of distinct values of a field and I am getting
different counts between the json.facet interface and group.ngroups. Here
are the two queries:

{'q': '*:*',
 'rows': 0,
 'json.facet': '{'mfr': "unique('mfr')"}'
})

This brings up around 6,000 in the mfr field.

However, if I run the following query, I get around 22,000:
{'q': '*:*',
 'rows': 0,
 'group': 'true',
 'group.ngroups': 'true',
 'group.field': 'mfr' }

I am running solr 6.1.0 with 4 shards, I ran through some estimates and it
looks like each shard has around 6k manufacturers. Does anyone have any
ideas why this is happening?

Thanks


Re: Discreptancy in json.facet uniqe and group.ngroups

2016-09-06 Thread Nick Vasilyev
Thanks Alexandre, that does sound related. I wouldn't imagine the
discrepancy would be that much, but I also realized that related items
aren't grouped on the same shard. This may be why my grouped counts are
off.

I will do some manual verification of the counts.

On Mon, Sep 5, 2016 at 12:22 PM, Alexandre Rafalovitch 
wrote:

> Perhaps https://issues.apache.org/jira/browse/SOLR-7452 ?
> 
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 5 September 2016 at 23:07, Nick Vasilyev 
> wrote:
> > Hi, I need to get the number of distinct values of a field and I am
> getting
> > different counts between the json.facet interface and group.ngroups. Here
> > are the two queries:
> >
> > {'q': '*:*',
> >  'rows': 0,
> >  'json.facet': '{'mfr': "unique('mfr')"}'
> > })
> >
> > This brings up around 6,000 in the mfr field.
> >
> > However, if I run the following query, I get around 22,000:
> > {'q': '*:*',
> >  'rows': 0,
> >  'group': 'true',
> >  'group.ngroups': 'true',
> >  'group.field': 'mfr' }
> >
> > I am running solr 6.1.0 with 4 shards, I ran through some estimates and
> it
> > looks like each shard has around 6k manufacturers. Does anyone have any
> > ideas why this is happening?
> >
> > Thanks
>


Re: Miserable Experience Using Solr. Again.

2016-09-15 Thread Nick Vasilyev
Just wanted to chime in on the technical set-up of the Solr "petting zoo",
I think I can help here; just let me know what you need.

Here is the idea; just have a vagrant box with ansible provisioning Zoo
keepers and Solr, creating collections, and etc That way anyone
starting out can just clone the repo, 'vagrant up' and have a fully
functional environment in no time. Setting up Solr is not the hard part and
I think it takes a little something from the experience, but if it would
help someone get started. Just send me an e-mail off line and let me know.

I do some work on an open source Solr python library and I use a similar
instance to run through unit tests on supported versions of python with
some of the latest versions of Solr; it works great and most of the work is
already done.


On Thu, Sep 15, 2016 at 2:39 PM, Shawn Heisey  wrote:

> On 9/15/2016 8:24 AM, Alexandre Rafalovitch wrote:
> > The WIKI may be an official community-contributing forum, but its
> > technological implementation has gotten so bad it is impossible to
> > update. Every time I change the page, it takes minutes (and feels like
> > hours) for the update to come through. No clue what to do about that
> > though.
>
> Interestingly, even though it takes several minutes for the change
> request to finish, the wiki actually updates almost immediately after
> pushing the button.  The page load (and the resulting email to the
> mailing list) just takes forever.  I discovered this by looking at the
> page in another tab while waiting for the page load to get done.
>
> As I understand it, MoinMoin is entirely filesystem-based, a typical
> config doesn't use a database.  Apache has a LOT of MoinMoin installs
> running on wiki.apache.org.  I think the performance woes are a case of
> a technology that's not scalable enough for how it's being used.
>
> > I feel that it would be cool to have a live tutorial. Perhaps a
> > special collection that, when bootstrapped from, provides tutorial,
> > supporting data, smart interface to play with that data against that
> > same instance, etc. It could also have a static read-only export, but
> > the default experience should be interactive ("bin/solr start -e
> > tutorial" or even "bin/solr start -e
> > http://www.example.com/tutorial";).
>
> That is an interesting idea.  I can envision a tutorial example, a
> canned source directory for indexing data into it, and a third volume of
> documentation, specifically for learning with that index.  It could
> include a section on changing the schema, reindexing, and seeing how
> those changes affect indexing and queries.
>
> > And it should be something that very strongly focuses on teaching new
> > users to fish, not just use the variety of seafood Solr comes with. A
> > narrative showing how different parts of Solr come together and how to
> > troubleshoot those, as opposed to taking each element (e.g. Query
> > Parser) individually and covering them super-comprehensively. That
> > last one is perfect in the reference guide, but less than friendly to
> > a beginner.
>
> Yes, yes, yes.
>
> Thanks,
> Shawn
>
>


mm being ignored by edismax

2016-10-06 Thread Nick Hall
Hello,

I'm working on upgrading a Solr installation from 4.0 to 6.2.1 and have
everything mostly working but have hit a snag. I kept the schema basically
the same, just made some minor changes to allow it to work with the new
version, but one of my queries is working differently with the new version
and I'm not sure why.

In version 4.0 when I do a query with edismax like:

"params":{
  "mm":"3",
  "debugQuery":"on",
  "indent":"on",
  "q":"string1 string2 string3 string4 string5",
  "qf":"vehicle_string_t^1",
  "wt":"json",
  "defType":"edismax"}},

I get the results I expect, and the debugQuery shows:

"rawquerystring":"string1 string2 string3 string4 string5",
"querystring":"string1 string2 string3 string4 string5",
"parsedquery":"+((DisjunctionMaxQuery((vehicle_string_t:\"string 1\"))
DisjunctionMaxQuery((vehicle_string_t:\"string 2\"))
DisjunctionMaxQuery((vehicle_string_t:\"string 3\"))
DisjunctionMaxQuery((vehicle_string_t:\"string 4\"))
DisjunctionMaxQuery((vehicle_string_t:\"string 5\")))~3)",
"parsedquery_toString":"+(((vehicle_string_t:\"string 1\")
(vehicle_string_t:\"string 2\") (vehicle_string_t:\"string 3\")
(vehicle_string_t:\"string 4\") (vehicle_string_t:\"string 5\"))~3)",


But when I run the same query with version 6.2.1, debugQuery shows:

"rawquerystring":"string1 string2 string3 string4 string5",
"querystring":"string1 string2 string3 string4 string5",
"parsedquery":"(+(+DisjunctionMaxQuery((vehicle_string_t:\"string 1\"))
+DisjunctionMaxQuery((vehicle_string_t:\"string 2\"))
+DisjunctionMaxQuery((vehicle_string_t:\"string 3\"))
+DisjunctionMaxQuery((vehicle_string_t:\"string 4\"))
+DisjunctionMaxQuery((vehicle_string_t:\"string 5\"/no_coord",
"parsedquery_toString":"+(+(vehicle_string_t:\"string 1\")
+(vehicle_string_t:\"string 2\") +(vehicle_string_t:\"string 3\")
+(vehicle_string_t:\"string 4\") +(vehicle_string_t:\"string 5\"))",


You can see that the key difference is that in version 4 it uses the "~3"
to indicate the mm, but in 6.2.1 it doesn't matter what I have mm set to,
it always ends with "/no_coord" and is trying to match all 5 strings even
if mm is set to 1, so mm is being completely ignored.

I imagine there is some behavior that changed between 4 and 6.2.1 that I
need to adjust something in my configuration to account for, but I'm
scratching my head right now. Has anyone else seen this and can point me in
the right direction? Thanks,

Nick


Re: mm being ignored by edismax

2016-10-07 Thread Nick Hall
Thanks. I read through this discussion and got it to work by setting
q.op=OR when mm is set, and then it worked as it previously did.

I have two suggestions that may clarify things a little going forward.
First, as I read the documentation it does not seem clear to me that q.op
is intended to be used with the edismax (or dismax) query parsers. The
"common query parameters" page: https://cwiki.apache.
org/confluence/display/solr/Common+Query+Parameters does not list q.op as a
parameter. This parameter is listed on the "standard query parameters"
page: https://cwiki.apache.org/confluence/display/solr/
The+Standard+Query+Parser but not in the dismax page: https://cwiki.apache.
org/confluence/display/solr/The+DisMax+Query+Parser. For clarity it seems
like q.op should be added to the dismax page with a note about how its
behavior relates to mm?

Also, I use the Solr web interface to do test queries while debugging. This
web interface has no field for q.op as far as I can see, so with (e)dismax
the mm field does not work effectively with the web interface.

Thank you for your help,
Nick


On Thu, Oct 6, 2016 at 10:53 PM, Alexandre Rafalovitch 
wrote:

> I think it is the change in the OR and AND treatment that had been
> confusing a number of people. There were discussions before on the
> mailing list about it, for example
> http://search-lucene.com/m/eHNlzBMAHdfxcv1
>
> Regards,
>Alex.
> 
> Solr Example reading group is starting November 2016, join us at
> http://j.mp/SolrERG
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 7 October 2016 at 10:24, Nick Hall  wrote:
> > Hello,
> >
> > I'm working on upgrading a Solr installation from 4.0 to 6.2.1 and have
> > everything mostly working but have hit a snag. I kept the schema
> basically
> > the same, just made some minor changes to allow it to work with the new
> > version, but one of my queries is working differently with the new
> version
> > and I'm not sure why.
> >
> > In version 4.0 when I do a query with edismax like:
> >
> > "params":{
> >   "mm":"3",
> >   "debugQuery":"on",
> >   "indent":"on",
> >   "q":"string1 string2 string3 string4 string5",
> >   "qf":"vehicle_string_t^1",
> >   "wt":"json",
> >   "defType":"edismax"}},
> >
> > I get the results I expect, and the debugQuery shows:
> >
> > "rawquerystring":"string1 string2 string3 string4 string5",
> > "querystring":"string1 string2 string3 string4 string5",
> > "parsedquery":"+((DisjunctionMaxQuery((vehicle_string_t:\"string
> 1\"))
> > DisjunctionMaxQuery((vehicle_string_t:\"string 2\"))
> > DisjunctionMaxQuery((vehicle_string_t:\"string 3\"))
> > DisjunctionMaxQuery((vehicle_string_t:\"string 4\"))
> > DisjunctionMaxQuery((vehicle_string_t:\"string 5\")))~3)",
> > "parsedquery_toString":"+(((vehicle_string_t:\"string 1\")
> > (vehicle_string_t:\"string 2\") (vehicle_string_t:\"string 3\")
> > (vehicle_string_t:\"string 4\") (vehicle_string_t:\"string 5\"))~3)",
> >
> >
> > But when I run the same query with version 6.2.1, debugQuery shows:
> >
> > "rawquerystring":"string1 string2 string3 string4 string5",
> > "querystring":"string1 string2 string3 string4 string5",
> > "parsedquery":"(+(+DisjunctionMaxQuery((vehicle_string_t:\"string
> 1\"))
> > +DisjunctionMaxQuery((vehicle_string_t:\"string 2\"))
> > +DisjunctionMaxQuery((vehicle_string_t:\"string 3\"))
> > +DisjunctionMaxQuery((vehicle_string_t:\"string 4\"))
> > +DisjunctionMaxQuery((vehicle_string_t:\"string 5\"/no_coord",
> >     "parsedquery_toString":"+(+(vehicle_string_t:\"string 1\")
> > +(vehicle_string_t:\"string 2\") +(vehicle_string_t:\"string 3\")
> > +(vehicle_string_t:\"string 4\") +(vehicle_string_t:\"string 5\"))",
> >
> >
> > You can see that the key difference is that in version 4 it uses the "~3"
> > to indicate the mm, but in 6.2.1 it doesn't matter what I have mm set to,
> > it always ends with "/no_coord" and is trying to match all 5 strings even
> > if mm is set to 1, so mm is being completely ignored.
> >
> > I imagine there is some behavior that changed between 4 and 6.2.1 that I
> > need to adjust something in my configuration to account for, but I'm
> > scratching my head right now. Has anyone else seen this and can point me
> in
> > the right direction? Thanks,
> >
> > Nick
>


Re: How to retrieve 200K documents from Solr 4.10.2

2016-10-12 Thread Nick Vasilyev
Check out cursorMark, it should be available in your release. There is some
good information on this page:

https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results


On Wed, Oct 12, 2016 at 5:46 PM, Salikeen, Obaid <
obaid.salik...@iacpublishinglabs.com> wrote:

> Hi,
>
> I am using Solr 4.10.2. I have 200K documents sitting on Solr cluster (it
> has 3 nodes), and let me first state that I am new Solr. I want to retrieve
> all documents from Sold (essentially just one field from each document).
>
> What is the best way of fetching this much data without overloading Solr
> cluster?
>
>
> Approach I tried:
> I tried using the following API (running every minute) to fetch a batch of
> 1000 documents every minute. On Each run, I initialize start with the new
> index i.e adding 1000.
> http://SOLR_HOST/solr/abc/select?q=*:*&fq=&start=1&rows=
> 1000&fl=url&wt=csv&csv.header=false&hl=false
>
> However, with the above approach, I have two issues:
>
> 1.   Solr cluster gets overloaded i.e it slows down
>
> 2.   I am not sure if start=X&rows=1000 would give me the correct
> results (changing rows=2 or rows=4 gives me totally different results,
> which is why I am not confident if I will get the correct results).
>
>
> Thanks
> Obaid
>
>


I want "john smi" to find "john smith" in my custom "fullname_s" field

2017-06-06 Thread Nick Way
Hi - I have a Solr collection with a custom field "fullname_s" (a string).

I want "john smi" to find "john smith" (I lower-cased the names upon
indexing them)

I have tried

fullname_s:"john smi*"
fullname_s:john smi*
fullname_s:"john smi?"
fullname_s:john smi?


but nothing gives the expected result - am I missing something? I spent
hours on this one point yesterday so if anyone can please point me in the
right direction I'd be really grateful.

I'm using Solr with Adobe Coldfusion by the way but I think the principles
are the same.

Thank you!

Nick


Re: I want "john smi" to find "john smith" in my custom "fullname_s" field

2017-06-06 Thread Nick Way
Fantastic thank you so much; I now have 'fullname_s:#string.spacesescaped#*
or email_s:#string.spacesescaped#*' which is working like a dream - thank
you so much - really appreciate your help.

Thank you also Amrit.

Nick

On 6 June 2017 at 10:40, Erik Hatcher  wrote:

> Nick - try escaping the space, so that your query is q=fullname_s:john\
> smi*
>
> However, whitespace and escaping is problematic.  There is a handy prefix
> query parser, so this would work on a string field with spaces:
>
> q={!prefix f=fullname_s}john smi
>
> note no trailing asterisk on that one.   Even better, IMO, is to separate
> the query string from the query parser:
>
> q={!prefix f=fullname_s v=$qq}&qq=john smi
>
> Erik
>
> 
>
> Amrit - the issue with your example below is that q=fullname_s:john smi*
> parses “john” against fullname_s and “smi” as a prefix query against the
> default field, not likely fullname_s.   Check your parsed query to see
> exactly how it parsed.It works for you because… magic!   (copyField *
> => _text_)
>
>
>
>
> > On Jun 6, 2017, at 5:14 AM, Amrit Sarkar  wrote:
> >
> > Nick,
> >
> > "string" is a primitive data-type and the entire value of a field is
> > indexed as single token. The regex matching happens against the tokens
> for
> > text fields and against the full content for string fields. So once a
> piece
> > of text is tokenized, there is no way to perform a regex query across
> word
> > boundaries.
> >
> > fullname_s:john smi* is working for me.
> >
> > {
> >  "responseHeader":{
> >"zkConnected":true,
> >"status":0,
> >"QTime":16,
> >"params":{
> >  "q":"fullname_s:john smi*",
> >  "indent":"on",
> >  "wt":"json"}},
> >  "response":{"numFound":1,"start":0,"maxScore":1.0,"docs":[
> >  {
> >"id":"1",
> >"fullname_s":"john smith",
> >"_version_":1569446064473243648}]
> >  }}
> >
> > I am on Solr 6.5.0. What version you are on?
> >
> >
> > Amrit Sarkar
> > Search Engineer
> > Lucidworks, Inc.
> > 415-589-9269
> > www.lucidworks.com
> > Twitter http://twitter.com/lucidworks
> > LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> >
> > On Tue, Jun 6, 2017 at 1:30 PM, Nick Way 
> > wrote:
> >
> >> Hi - I have a Solr collection with a custom field "fullname_s" (a
> string).
> >>
> >> I want "john smi" to find "john smith" (I lower-cased the names upon
> >> indexing them)
> >>
> >> I have tried
> >>
> >> fullname_s:"john smi*"
> >> fullname_s:john smi*
> >> fullname_s:"john smi?"
> >> fullname_s:john smi?
> >>
> >>
> >> but nothing gives the expected result - am I missing something? I spent
> >> hours on this one point yesterday so if anyone can please point me in
> the
> >> right direction I'd be really grateful.
> >>
> >> I'm using Solr with Adobe Coldfusion by the way but I think the
> principles
> >> are the same.
> >>
> >> Thank you!
> >>
> >> Nick
> >>
>
>


Solr list operator

2017-09-06 Thread Nick Way
Hi, I have a custom field "listOfIDs" = "1,2,4,33"

I want the equivalent of:

select * where '1' IN (listOfIDs)  --> should get a match

select * where '33' IN (listOfIDs)  --> should get a match

select * where '3' IN (listOfIDs)  --> should NOT get a match


Can anyone help me out please as I can't seem to find any documentation on
this. Thanks very much in advance.

Kind regards,


Nick Way


Re: Solr list operator

2017-09-12 Thread Nick Way
Thank you very much Erik, Walter and Susheel.

To be honest I didn't really understand the suggested routes (due to my
limited knowledge) but managed to get things working by inserting my data
with a double comma at the beginning eg:

custom field "listOfIDs" = ",,1,2,4,33"

and then searching for "*,myVal,*" which seems to work.

Out of interest does anyone have experience accessing Solr via Adobe
Coldfusion (as this is what we do) - and it would be helpful to have a
contact for some Solr consulting from time to time, if anyone might be
interested in that?

​Thank you very much for your help which was much appreciated.

Best,

Nick

On 6 September 2017 at 16:46, Erick Erickson 
wrote:

> You'll have to split up the input on commas if you don't just do it
> the multiValued way Walter suggests, perhaps one of the pattern
> tokenizers mentioned here:
>
> https://cwiki.apache.org/confluence/display/solr/Tokenizers
>
> Best,
> Erick
>
> On Wed, Sep 6, 2017 at 6:29 AM, Walter Underwood 
> wrote:
> > Use a multivalued field. Search for listOfIds:1. Or search for
> listOfIds:33. This is one of the simplest things that Solr can do.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >
> >> On Sep 6, 2017, at 6:07 AM, Susheel Kumar 
> wrote:
> >>
> >> Nick, checkout terms query parser
> >> http://lucene.apache.org/solr/guide/6_6/other-parsers.html or streaming
> >> expressions.
> >>
> >> Thnx
> >>
> >> On Wed, Sep 6, 2017 at 8:33 AM, alex goretoy  wrote:
> >>
> >>> https://www.youtube.com/watch?v=pNe1wWeaHOU&list=
> >>> PLYI8318YYdkCsZ7dsYV01n6TZhXA6Wf9i&index=1
> >>> https://www.youtube.com/watch?v=pNe1wWeaHOU&list=
> >>> PLYI8318YYdkCsZ7dsYV01n6TZhXA6Wf9i&index=1
> >>>
> >>> http://audiobible.life CHECK IT OUT!
> >>>
> >>>
> >>> On Wed, Sep 6, 2017 at 5:57 PM, Nick Way  >
> >>> wrote:
> >>>> Hi, I have a custom field "listOfIDs" = "1,2,4,33"
> >>>>
> >>>> I want the equivalent of:
> >>>>
> >>>> select * where '1' IN (listOfIDs)  --> should get a match
> >>>>
> >>>> select * where '33' IN (listOfIDs)  --> should get a match
> >>>>
> >>>> select * where '3' IN (listOfIDs)  --> should NOT get a match
> >>>>
> >>>>
> >>>> Can anyone help me out please as I can't seem to find any
> documentation
> >>> on
> >>>> this. Thanks very much in advance.
> >>>>
> >>>> Kind regards,
> >>>>
> >>>>
> >>>> Nick Way
> >>>
> >
>


Re: Best python 3 client for solrcloud

2016-11-24 Thread Nick Vasilyev
I am a comitter for

https://github.com/moonlitesolutions/SolrClient.

I think its pretty good, my aim with it is to provide several reusable
modules for working with Solr in python. Not just querying, but working
with collections indexing, reindexing, etc..

Check it out and let me know what you think.

On Nov 24, 2016 3:51 PM, "Dorian Hoxha"  wrote:

> Hi searchers,
>
> I see multiple clients for solr in python but each one looks like misses
> many features. What I need is for at least the low-level api to work with
> cloud (like retries on different nodes and nice exceptions). What is the
> best that you use currently ?
>
> Thank You!
>


Re: Find groups where at least one item matches a query

2017-02-05 Thread Nick Vasilyev
Check out the group.limit argument.

On Feb 5, 2017 12:10 PM, "Cristian Popovici" 
wrote:

> Erick, thanks for you answer.
>
> Sorry - I forgot to mention that I do not know the group id when I perform
> the query.
> Grouping - I think - does not help for me as it filters out the documents
> that do not meet the filter criteria.
>
> Example:
> *q=pathology:Normal&group=true&group.field=groupId*  will miss out the
> "pathology":
> "Metastasis".
>
> I need to retrieve both documents in the same group even if only one meets
> the search criteria.
>
> Thanks!
>
> On Sun, Feb 5, 2017 at 6:54 PM, Erick Erickson 
> wrote:
>
> > Isn't this just "&fq=groupId:223"?
> >
> > Or do you mean you need multiple _groups_? In which case you can use
> > grouping, see:
> > https://cwiki.apache.org/confluence/display/solr/
> > Collapse+and+Expand+Results
> > and/or
> > https://cwiki.apache.org/confluence/display/solr/Result+Grouping
> >
> > but do note there are some limitations in distributed mode.
> >
> > Best,
> > Erick
> >
> > On Sun, Feb 5, 2017 at 1:49 AM, Cristian Popovici
> >  wrote:
> > > Hi all,
> > >
> > > I'm new to Solr and I need a bit of help.
> > >
> > > I have a structure of documents indexed in Solr that are grouped
> together
> > > by a property. I need to retrieve all groups where at least one entry
> in
> > > the group matches a query.
> > >
> > > Example:
> > > I have two documents indexed and both share the *groupId *property that
> > > defines the grouping field.
> > >
> > > *{*
> > > *"groupId": "223",*
> > > *"modality": "Computed Tomography",*
> > > *"anatomy": "Subcutaneous fat",*
> > > *"pathology": "Metastasis",*
> > > *}*
> > >
> > > *{*
> > > *"groupId": "223",*
> > > *"modality": "Computed Tomography",*
> > > *"anatomy": "Subcutaneous fat",*
> > > *"pathology": "Normal",*
> > > *}*
> > >
> > > I need to retrieve both entries in the group when performing a query
> > like:
> > >
> > > *(pathology:Normal)*
> > > Is this possible in solr?
> > >
> > > Thanks!
> >
>


Invalid UTF-8 character 0xffff at char #17373581, byte #17539047

2017-02-28 Thread Nick Way
Hello everyone,

We use Solr (with Adobe Coldfusion) to index circa 60,000 pdfs, however the
daily refresh has been failing with this error "Invalid UTF-8 character
0x at char #17373581, byte #17539047" [truncated - full error
message is posted below]

   -
   - Can Solr be configured to skip problematic documents (eg those
   containing an invalid character)?
   - Can Solr be configured to log which document it had a problem indexing?
   - If no to both of the above, do you have any suggestions for how I can
   either detect the problematic document or stop Solr erroring on it?


Thank you very much indeed.

Kind regards,


Nick Way

full error message:

[was class java.io.CharConversionException] Invalid UTF-8 character 0x
at char #17373581, byte #17539047) java.lang.RuntimeException: [was class
java.io.CharConversionException] Invalid UTF-8 character 0x at char
#17373581, byte #17539047) at
com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at
org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:301) at
org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:157) at
org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79) at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:67)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326) at
org.mortbay.jetty.HttpConnection.handleRequest(H [was class
java.io.CharConversionException] Invalid UTF-8 character 0x at char
#17373581, byte #17539047) java.lang.RuntimeException: [was class
java.io.CharConversionException] Invalid UTF-8 character 0x at char
#17373581, byte #17539047) at
com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at
org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:301) at
org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:157) at
org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79) at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:67)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326) at
org.mortbay.jetty.HttpConnection.handleRequest(H request:
http://localhost:8985/solr/solr77b/update?commit=true&waitFlush=false&waitSearcher=false&wt=xml&version=2.2


Re: millions of records problem

2011-10-17 Thread Nick Veenhof
You could use this technique? I'm currently reading up on it
http://khaidoan.wikidot.com/solr-common-gram-filter


On 17 October 2011 12:57, Jan Høydahl  wrote:
> Hi,
>
> What exactly do you mean by "slow" search? 1s? 10s?
> Which operating system, how many CPUs, which servlet container and how much 
> RAM have you allocated to your JVM? (-Xmx)
> What kind and size of docs? Your numbers indicate about 100bytes per doc?
> What kind of searches? Facets? Sorting? Wildcards?
> Have you tried to "slim down" you schema by setting indexed="false" and 
> stored="false" wherever possible?
>
> First thought is that it's really impressive if you've managed to get 500mill 
> docs into one index with only 8Gb RAM!! I would expect that to fail or best 
> case be veery slow. If you have a beefy server I'd first try putting in 64Gb 
> RAM, slim down your schema and perhaps even switch to Solr4.0(trunk) which is 
> more RAM efficient.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> On 17. okt. 2011, at 12:19, Jesús Martín García wrote:
>
>> Hi,
>>
>> I've got 500 millions of documents in solr everyone with the same number of 
>> fields an similar width. The version of solr which I used is 1.4.1 with 
>> lucene 2.9.3.
>>
>> I don't have the option to use shards so the whole index has to be in a 
>> machine...
>>
>> The size of the index is about 50Gb and the ram is 8GbEverything is 
>> working but the searches are so slowly, although I tried different 
>> configurations of the solrconfig.xml as:
>>
>> - configure a first searcher with the most used searches
>> - configure the caches (query, filter and document) with great numbers...
>>
>> but everything is still working slowly, so do you have any ideas to boost 
>> the searches without the penalty to use much more ram?
>>
>> Thanks in advance,
>>
>> Jesús
>>
>> --
>> ...
>>      __
>>    /   /       Jesús Martín García
>> C E / S / C A   Tècnic de Projectes
>>  /__ /         Centre de Serveis Científics i Acadèmics de Catalunya
>>
>> Gran Capità, 2-4 (Edifici Nexus) · 08034 Barcelona
>> T. 93 551 6213 · F. 93 205 6979 · jmar...@cesca.cat
>> ...
>>
>
>


Re: Solr Distributed Search vs Hadoop

2011-12-23 Thread Nick Vincent
For data of this size you may want to look at something like Apache
Cassandra, which is made specifically to handle data at this kind of
scale across many machines.

You can still use Hadoop to analyse and transform the data in a
performant manner, however it's probably best to do some research on
this on the relevant technical forums for those technologies.

Nick


DisMax, multi fields, and phrase fields

2010-07-01 Thread Nick Hall
In my application, I have documents like:

DOCUMENT 1:
part_num: ABC123 Spark Plug
application: 2008 Toyota Corolla
application: 2007 Honda Civic

DOCUMENT 2:
part_num: FGH234 Spark Plug
application: 2007 Toyota Corolla
application: 2008 Honda Civic

The "application" field is set up to be a multi-valued field, and I am using
the DisMax request handler.

My goal is to be able to have the user search for something like:

2008 Toyota Corolla Spark Plug

and have it match Document 1 in this case. This currently works by using
DisMax and having it search both the part_num and application field.
However, this search also finds Document 2 because all the terms, "2008",
"Toyota", and "Corolla" all appear in the application fields, even though
they do not belong together in this case.

I understand that it may be hard to eliminate Document 2 from the search
results because the search has to be allowed to be a little fuzzy, but if I
check the scores of the documents, Document 1 is just barely ahead of
Document 2 in its score. I would like to figure out a way to get Document 1
to score higher in this case, since part of the query matches the phrase in
its application exactly.

I've been playing around with the phrase fields (pf) and phrase slop (ps)
parameters to try to get it to realize that "2008 Toyota Corolla" is a
phrase, in this example, and weight it higher for Document 1, but I haven't
been able to get Solr to identify this as a phrase. I've been looking at the
debug query and it will identify it as a phrase if the user only types in
something like:

2008 Toyota Corolla

but as soon as the Spark Plug terms are added, it looks like Solr is trying
to make the entire search expression into one long phrase.

Does anyone have a recommendation of how this can be done, so it can break
the search expression down and automatically make a phrase out of part of
it? Or, should I approach this whole problem from a different angle? Thanks.


Re: Solr Javascript+JSON not optimized for SEO

2010-10-25 Thread Nick Jenkin
The solution is to offer both, and provide fallback for browsers that
don't support javascript (e.g. Googlebot)
I would also ponder the question "how does this ajax feature help my
users?". If you can't find a good answer to that, you should probably
just not use ajax. (NB: "it's faster" is not a valid answer!)
-Nick

On Sun, Oct 24, 2010 at 12:30 AM, PeterKerk  wrote:
>
> Unfortunately its not online yet, but is there anything I can clarify in more
> detail?
>
> Thanks!
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-Javascript-JSON-not-optimized-for-SEO-tp1751641p1758054.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Concatenate multiple tokens into one

2010-11-11 Thread Nick Martin
Hi Robert, All,

I have a similar problem, here is my fieldType, 
http://paste.pocoo.org/show/289910/
I want to include stopword removal and lowercase the incoming terms. The idea 
being to take, "Foo Bar Baz Ltd" and turn it into "foobarbaz" for the EdgeNgram 
filter factory.
If anyone can tell me a simple way to concatenate tokens into one token again, 
similar too the KeyWordTokenizer that would be super helpful.

Many thanks

Nick

On 11 Nov 2010, at 00:23, Robert Gründler wrote:

> 
> On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote:
> 
>> Are you sure you really want to throw out stopwords for your use case?  I 
>> don't think autocompletion will work how you want if you do. 
> 
> in our case i think it makes sense. the content is targetting the electronic 
> music / dj scene, so we have a lot of words like "DJ" or "featuring" which
> make sense to throw out of the query. Also searches for "the beastie boys" 
> and "beastie boys" should return a match in the autocompletion.
> 
>> 
>> And if you don't... then why use the WhitespaceTokenizer and then try to jam 
>> the tokens back together? Why not just NOT tokenize in the first place. Use 
>> the KeywordTokenizer, which really should be called the 
>> NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates 
>> one token from the entire input string. 
> 
> I started out with the KeywordTokenizer, which worked well, except the 
> StopWord problem.
> 
> For now, i've come up with a quick-and-dirty custom "ConcatFilter", which 
> does what i'm after:
> 
> public class ConcatFilter extends TokenFilter {
> 
>   private TokenStream tstream;
> 
>   protected ConcatFilter(TokenStream input) {
>   super(input);
>   this.tstream = input;
>   }
> 
>   @Override
>   public Token next() throws IOException {
>   
>   Token token = new Token();
>   StringBuilder builder = new StringBuilder();
>   
>   TermAttribute termAttribute = (TermAttribute) 
> tstream.getAttribute(TermAttribute.class);
>   TypeAttribute typeAttribute = (TypeAttribute) 
> tstream.getAttribute(TypeAttribute.class);
>   
>   boolean incremented = false;
>   
>   while (tstream.incrementToken()) {
>   
>   if (typeAttribute.type().equals("word")) {
>   builder.append(termAttribute.term());   
> 
>   }
>   incremented = true;
>   }
>   
>   token.setTermBuffer(builder.toString());
>   
>   if (incremented == true)
>   return token;
>   
>   return null;
>   }
> }
> 
> I'm not sure if this is a safe way to do this, as i'm not familar with the 
> whole solr/lucene implementation after all.
> 
> 
> best
> 
> 
> -robert
> 
> 
> 
> 
>> 
>> Then lowercase, remove whitespace (or not), do whatever else you want to do 
>> to your single token to normalize it, and then edgengram it. 
>> 
>> If you include whitespace in the token, then when making your queries for 
>> auto-complete, be sure to use a query parser that doesn't do 
>> "pre-tokenization", the 'field' query parser should work well for this. 
>> 
>> Jonathan
>> 
>> 
>> 
>> 
>> From: Robert Gründler [rob...@dubture.com]
>> Sent: Wednesday, November 10, 2010 6:39 PM
>> To: solr-user@lucene.apache.org
>> Subject: Concatenate multiple tokens into one
>> 
>> Hi,
>> 
>> i've created the following filterchain in a field type, the idea is to use 
>> it for autocompletion purposes:
>> 
>>  
>>  
>> > words="stopwords.txt" enablePositionIncrements="true" />  
>> > replacement="" replace="all" />  
>> 
>> 
>> 
>> > /> 
>> 
>> With that kind of filterchain, the EdgeNGramFilterFactory will receive 
>> multiple tokens on input strings with whitespaces in it. This leads to the 
>> following results:
>> Input Query: "George Cloo"
>> Matches:
>> - "George Harrison"
>> - "John Clooridge"
>> - "George Smith"
>> -"George Clooney"
>> - etc
>> 
>> However, only "George Clooney" should match in the autocompletion use case.
>> Therefore, i'd like to add a filter before the EdgeNGramFilterFactory, which 
>> concatenates all the tokens generated by the WhitespaceTokenizerFactory.
>> Are there filters which can do such a thing?
>> 
>> If not, are there examples how to implement a custom TokenFilter?
>> 
>> thanks!
>> 
>> -robert
>> 
>> 
>> 
>> 
> 



Re: Concatenate multiple tokens into one

2010-11-11 Thread Nick Martin
Thanks Robert, I had been trying to get your ConcatFilter to work, but I'm not 
sure what i need in the classpath and where Token comes from.
Will check the thread you mention.

Best

Nick

On 11 Nov 2010, at 18:13, Robert Gründler wrote:

> I've posted a ConcaFilter in my previous mail which does concatenate tokens. 
> This works fine, but i
> realized that what i wanted to achieve is implemented easier in another way 
> (by using 2 separate field types).
> 
> Have a look at a previous mail i wrote to the list and the reply from Ahmet 
> Arslan (topic: "EdgeNGram relevancy).
> 
> 
> best
> 
> 
> -robert
> 
> 
> 
> 
> See 
> On Nov 11, 2010, at 5:27 PM, Nick Martin wrote:
> 
>> Hi Robert, All,
>> 
>> I have a similar problem, here is my fieldType, 
>> http://paste.pocoo.org/show/289910/
>> I want to include stopword removal and lowercase the incoming terms. The 
>> idea being to take, "Foo Bar Baz Ltd" and turn it into "foobarbaz" for the 
>> EdgeNgram filter factory.
>> If anyone can tell me a simple way to concatenate tokens into one token 
>> again, similar too the KeyWordTokenizer that would be super helpful.
>> 
>> Many thanks
>> 
>> Nick
>> 
>> On 11 Nov 2010, at 00:23, Robert Gründler wrote:
>> 
>>> 
>>> On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote:
>>> 
>>>> Are you sure you really want to throw out stopwords for your use case?  I 
>>>> don't think autocompletion will work how you want if you do. 
>>> 
>>> in our case i think it makes sense. the content is targetting the 
>>> electronic music / dj scene, so we have a lot of words like "DJ" or 
>>> "featuring" which
>>> make sense to throw out of the query. Also searches for "the beastie boys" 
>>> and "beastie boys" should return a match in the autocompletion.
>>> 
>>>> 
>>>> And if you don't... then why use the WhitespaceTokenizer and then try to 
>>>> jam the tokens back together? Why not just NOT tokenize in the first 
>>>> place. Use the KeywordTokenizer, which really should be called the 
>>>> NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just 
>>>> creates one token from the entire input string. 
>>> 
>>> I started out with the KeywordTokenizer, which worked well, except the 
>>> StopWord problem.
>>> 
>>> For now, i've come up with a quick-and-dirty custom "ConcatFilter", which 
>>> does what i'm after:
>>> 
>>> public class ConcatFilter extends TokenFilter {
>>> 
>>> private TokenStream tstream;
>>> 
>>> protected ConcatFilter(TokenStream input) {
>>> super(input);
>>> this.tstream = input;
>>> }
>>> 
>>> @Override
>>> public Token next() throws IOException {
>>> 
>>> Token token = new Token();
>>> StringBuilder builder = new StringBuilder();
>>> 
>>> TermAttribute termAttribute = (TermAttribute) 
>>> tstream.getAttribute(TermAttribute.class);
>>> TypeAttribute typeAttribute = (TypeAttribute) 
>>> tstream.getAttribute(TypeAttribute.class);
>>> 
>>> boolean incremented = false;
>>> 
>>> while (tstream.incrementToken()) {
>>> 
>>> if (typeAttribute.type().equals("word")) {
>>> builder.append(termAttribute.term());   
>>> 
>>> }
>>> incremented = true;
>>> }
>>> 
>>> token.setTermBuffer(builder.toString());
>>> 
>>> if (incremented == true)
>>> return token;
>>> 
>>> return null;
>>> }
>>> }
>>> 
>>> I'm not sure if this is a safe way to do this, as i'm not familar with the 
>>> whole solr/lucene implementation after all.
>>> 
>>> 
>>> best
>>> 
>>> 
>>> -robert
>>> 
>>> 
>>> 
>>> 
>>>> 
>>>> Then lowercase, remove whitespace (or not), do whatever else you want to 
>>>> do to your single token to no

Re: EdgeNGram relevancy

2010-11-11 Thread Nick Martin

On 12 Nov 2010, at 01:46, Ahmet Arslan  wrote:

>> This setup now makes troubles regarding StopWords, here's
>> an example:
>> 
>> Let's say the index contains 2 Strings: "Mr Martin
>> Scorsese" and "Martin Scorsese". "Mr" is in the stopword
>> list.
>> 
>> Query: edgytext:Mr Scorsese OR edgytext2:Mr Scorsese^2.0
>> 
>> This way, the only result i get is "Mr Martin Scorsese",
>> because the strict field edgytext2 is boosted by 2.0. 
>> 
>> Any idea why in this case "Martin Scorsese" is not in the
>> result at all?
> 
> Did you run your query without using () and "" operators? If yes can you try 
> this?
> &q=edgytext:(Mr Scorsese) OR edgytext2:"Mr Scorsese"^2.0
> 
> If no can you paste output of &debugQuery=on
> 
> 
> 

This would still not deal with the problem of removing stop words from the 
indexing and query analysis stages.

I really need something that will allow that and give a single token as in the 
example below.

Best

Nick

Re: Error: _version_field must exist in schema

2012-11-22 Thread Nick Zadrozny
On Wed, Oct 17, 2012 at 3:20 PM, Dotan Cohen  wrote:

> I do have a Solr 4 Beta index running on Websolr that does not have
> such a field. It works, but throws many "Service Unavailable" and
> "Communication Error" errors. Might the lack of the _version_ field be
> the reason?
>

Belated reply, but this is probably something you should let us know about
directly at supp...@onemorecloud.com if it happens again. Cheers.

-- 
Nick Zadrozny

Cofounder, One More Cloud

websolr.com <https://websolr.com/home> • bonsai.io <http://bonsai.io/home>

Hassle-free hosted full-text search,
powered by Apache Solr and ElasticSearch.


Re: Help! Confused about using Jquery for the Search query - Want to ditch it

2012-06-07 Thread Nick Chase



On 6/7/2012 1:53 PM, Spadez wrote:

Hi Ben,

Thank you for the reply. So, If I don't want to use Javascript and I want
the entire page to reload each time, is it being done like this?

1. User submits form via GET
2. Solr server queried via GET
3. Solr server completes query
4. Solr server returns XML output
5. XML data put into results page
6. User shown new results page

Is this basically how it would work if we wanted Javascript out of the
equation?


Seems to me that you'd still have to have Javascript turn the XML into 
HTML -- unless you use the XsltResponseWriter 
(http://wiki.apache.org/solr/XsltResponseWriter) to use XSLT to turn the 
raw XML into your actual results HTML.


The other option is to create a python page that does the call to Solr 
and spits out just the HTML for your results, then call THAT rather than 
calling Solr directly.


  Nick


Re: Help! Confused about using Jquery for the Search query - Want to ditch it

2012-06-07 Thread Nick Chase
+1 on that!  If you do want to provide direct results, ALWAYS send 
requests through a proxy that can verify that a) all requests are coming 
from your web app, and b) only "acceptable" queries are being passed on.


  Nick

On 6/7/2012 2:50 PM, Michael Della Bitta wrote:

On Thu, Jun 7, 2012 at 1:59 PM, Nick Chase  wrote:

The other option is to create a python page that does the call to Solr and 
spits out just the HTML for your results, then call THAT rather than calling 
Solr directly.


This is the *only* option if you're listening to Walter and I. Don't
give end users direct access to your Solr box!


Re: Question on addBean and deleteByQuery

2012-06-07 Thread Nick Zadrozny
On Wed, Jun 6, 2012 at 8:51 PM, Darin Pope  wrote:

> When using SolrJ (1.4.1 or 3.5.0) and calling either addBean or
> deleteByQuery, the POST body has numbers before and after the XML (47 and 0
> as noted in the example below):
>

It looks like this is HTTP chunked transfer encoding. As to whether that's
configurable in SolrJ, I defer to the experts on the list.

http://en.wikipedia.org/wiki/Chunked_transfer_encoding

-- 
Nick Zadrozny

http://websolr.com — hassle-free hosted search, powered by Apache Solr


Re: Question Solr Index main in RAM

2011-02-27 Thread Nick Jenkin
You could also try using a ram disk,
mkdir /var/ramdisk
mount -t tmpfs none /var/ramdisk -o size=m

Obviously, if you lose power you will lose everything..

On Mon, Feb 28, 2011 at 11:37 AM, Lance Norskog  wrote:
> This sounds like a great idea but rarely works out. Garbage collection
> has to work around the data stored in memory, and most of the data you
> want to hit frequently is in the indexed and cached. The operating
> system is very smart about keeping the popular parts of the index in
> memory, and there is no garbage collection there.
>
> I do not know if the RAMDirectoryFactory in current development has
> disk-backed persistence.
>
> On Thu, Feb 24, 2011 at 7:26 AM, Bill Bell  wrote:
>> How to use this?
>>
>> Bill Bell
>> Sent from mobile
>>
>>
>> On Feb 24, 2011, at 7:19 AM, Koji Sekiguchi  wrote:
>>
>>> (11/02/24 21:38), Andrés Ospina wrote:

 Hi,

 My name is Felipe and i want to use the index main of solr in RAM memory.

 How it's possible? I have solr 1.4

 Thank you!

 Felipe
>>>
>>> Welcome Felipe!
>>>
>>> If I understand your question correctly, you can use RAMDirectoryFactory:
>>>
>>> https://hudson.apache.org/hudson/job/Solr-3.x/javadoc/org/apache/solr/core/RAMDirectoryFactory.html
>>>
>>> But I believe it is available 3.1 (to be released soon...).
>>>
>>> Koji
>>> --
>>> http://www.rondhuit.com/en/
>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>


SOLR 4 Alpha Out Of Mem Err

2012-07-14 Thread Nick Koton
I have been experiencing out of memory errors when indexing via solrj into a
4 alpha cluster.  It seems when I delegate commits to the server (either
auto commit or commit within) there is nothing to throttle the solrj clients
and the server struggles to fan out the work.  However, when I handle
commits entirely within the client, the indexing rate is very restricted.

Any suggestions would be appreciated

Nick Cotton
nick.ko...@gmail.com







RE: SOLR 4 Alpha Out Of Mem Err

2012-07-14 Thread Nick Koton
at java.lang.Thread.start(Thread.java:640)
at 
java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:657)
at 
java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:152)
at 
org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:340)
at 
org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:296)
at 
org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:228)
at 
org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:101)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:329)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:952)
at 
org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:176)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:442)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:263)
... 25 more

Jul 14, 2012 9:20:57 PM org.apache.solr.common.SolrException log
-Original Message-
From: Mark Miller [mailto:markrmil...@gmail.com] 
Sent: Saturday, July 14, 2012 2:44 PM
To: solr-user@lucene.apache.org
Subject: Re: SOLR 4 Alpha Out Of Mem Err

Can you give more info? How much RAM are you giving Solr with Xmx? 

Can you be more specific about the behavior you are seeing with auto commit vs 
client commit? 

How often are you trying to commit? With the client? With auto commit?

Are you doing soft commits? Std commits? A mix?

What's the stack trace for the OOM?

What OS are you using?

Anything else you can add? 

-- 
Mark Miller



On Saturday, July 14, 2012 at 4:21 PM, Nick Koton wrote:

> I have been experiencing out of memory errors when indexing via solrj into a
> 4 alpha cluster. It seems when I delegate commits to the server (either
> auto commit or commit within) there is nothing to throttle the solrj clients
> and the server struggles to fan out the work. However, when I handle
> commits entirely within the client, the indexing rate is very restricted.
> 
> Any suggestions would be appreciated
> 
> Nick Cotton
> nick.ko...@gmail.com
> 
> 





RE: SOLR 4 Alpha Out Of Mem Err

2012-07-15 Thread Nick Koton
> Do you have the following hard autoCommit in your config (as the stock
server does)?
> 
>   15000
>   false
> 

I have tried with and without that setting.  When I described running with
auto commit, that setting is what I mean.  I have varied the time in the
range 10,000-60,000 msec.  I have tried this setting with and without soft
commit in the server config file.

I have tried without this setting, but specifying the commit within time in
the solrj client in the add method.

In both these cases, the client seems to overrun the server and out of
memory in the server results.  One clarification I should make is that after
the server gets out of memory, the solrj client does NOT receive an error.
However, the documents indexed do not reliably appear to queries.

Approach #3 is to remove the autocommit in the server config, issue the add
method without commit within, but issue commits in the solrj client with
wait for sync and searcher set to true.  In case #3, I do not see the out of
memory in the server.  However, document index rates are restricted to about
1,000 per second.

 Nick

-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: Sunday, July 15, 2012 5:15 AM
To: solr-user@lucene.apache.org
Subject: Re: SOLR 4 Alpha Out Of Mem Err

Do you have the following hard autoCommit in your config (as the stock
server does)?

 
   15000
   false
 

This is now fairly important since Solr now tracks information on every
uncommitted document added.
At some point we should probably hardcode some mechanism based on number of
documents or time.

-Yonik
http://lucidimagination.com



RE: SOLR 4 Alpha Out Of Mem Err

2012-07-15 Thread Nick Koton
>> Solrj multi-threaded client sends several 1,000 docs/sec

>Can you expand on that?  How many threads at once are sending docs to solr?
Is each request a single doc or multiple?
I realize, after the fact, that my solrj client is much like
org.apache.solr.client.solrj.LargeVolumeTestBase.  The number of threads is
configurable at run time as are the various commit parameters.  Most of the
test have been in the 4-16 threads range.  Most of my testing has been with
the single document SolrServer::add(SolrInputDocument doc )method.  When I
realized what LargeVolumeTestBase is doing, I converted my program to use
the SolrServer::add(Collection docs) method with 100
documents in each add batch.  Unfortunately, the out of memory errors still
occur without client side commits.

If you agree my three approaches to committing are logical, would it be
useful for me to try to reproduce this with "example" schema in a small
cloud configuration using LargeVolumeTestBase or the like?  It will take me
a couple days to work it in.  Or perhaps this sort of test is already run?

Best 
Nick

-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: Sunday, July 15, 2012 11:05 AM
To: Nick Koton
Cc: solr-user@lucene.apache.org
Subject: Re: SOLR 4 Alpha Out Of Mem Err

On Sun, Jul 15, 2012 at 11:52 AM, Nick Koton  wrote:
>> Do you have the following hard autoCommit in your config (as the 
>> stock
> server does)?
>> 
>>   15000
>>   false
>> 
>
> I have tried with and without that setting.  When I described running 
> with auto commit, that setting is what I mean.

OK cool.  You should be able to run the stock server (i.e. with this
autocommit) and blast in updates all day long - it looks like you have more
than enough memory.  If you can't, we need to fix something.  You shouldn't
need explicit commits unless you want the docs to be searchable at that
point.

> Solrj multi-threaded client sends several 1,000 docs/sec

Can you expand on that?  How many threads at once are sending docs to solr?
Is each request a single doc or multiple?

-Yonik
http://lucidimagination.com



RE: SOLR 4 Alpha Out Of Mem Err

2012-07-16 Thread Nick Koton
> That suggests you're running out of threads
Michael,
Thanks for this useful observation.  What I found just prior to the "problem
situation" was literally thousands of threads in the server JVM.  I have
pasted a few samples below obtained from the admin GUI.  I spent some time
today using this barometer, but I don't have enough to share right now.  I'm
looking at the difference between ConcurrentUpdateSolrServer and
HttpSolrServer and how my client may be misusing them.  I'll assume my
client is misbehaving and driving the server crazy for now.  If I figure out
how, I will share it so perhaps a safe guard can be put in place.

Nick


Server threads - very roughly 0.1 %:
cmdDistribExecutor-9-thread-7161 (10096) 
java.util.concurrent.SynchronousQueue$TransferStack@17b90c55
.   sun.misc.Unsafe.park(Native Method)
.
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
.
java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(Synchronous
Queue.java:424)
.
java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueu
e.java:323)
.
java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:874)
.
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:945)
.
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:9
07)
.   java.lang.Thread.run(Thread.java:662)
-0.ms
-0.ms cmdDistribExecutor-9-thread-7160 (10086) 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@5509b5
6
.   sun.misc.Unsafe.park(Native Method)
.   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
.
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(
AbstractQueuedSynchronizer.java:1987)
.
org.apache.http.impl.conn.tsccm.WaitingThread.await(WaitingThread.java:158)
.
org.apache.http.impl.conn.tsccm.ConnPoolByRoute.getEntryBlocking(ConnPoolByR
oute.java:403)
.
org.apache.http.impl.conn.tsccm.ConnPoolByRoute$1.getPoolEntry(ConnPoolByRou
te.java:300)
.
org.apache.http.impl.conn.tsccm.ThreadSafeClientConnManager$1.getConnection(
ThreadSafeClientConnManager.java:224)
.
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDir
ector.java:401)
.
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.ja
va:820)
.
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.ja
va:754)
.
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.ja
va:732)
.
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java
:351)
.
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java
:182)
.
org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:325
)
.
org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306
)
.   java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
.   java.util.concurrent.FutureTask.run(FutureTask.java:138)
.
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
.   java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
.   java.util.concurrent.FutureTask.run(FutureTask.java:138)
.
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.ja
va:886)
.
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:9
08)
.   java.lang.Thread.run(Thread.java:662)
20.ms
20.ms cmdDistribExecutor-9-thread-7159 (10085) 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6f062d
d3
.   sun.misc.Unsafe.park(Native Method)
.   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
.
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(
AbstractQueuedSynchronizer.java:1987)
.
org.apache.http.impl.conn.tsccm.WaitingThread.await(WaitingThread.java:158)
.
org.apache.http.impl.conn.tsccm.ConnPoolByRoute.getEntryBlocking(ConnPoolByR
oute.java:403)
.
org.apache.http.impl.conn.tsccm.ConnPoolByRoute$1.getPoolEntry(ConnPoolByRou
te.java:300)
.
org.apache.http.impl.conn.tsccm.ThreadSafeClientConnManager$1.getConnection(
ThreadSafeClientConnManager.java:224)
.
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDir
ector.java:401)
.
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.ja
va:820)
.
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.ja
va:754)
.
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.ja
va:732)
.
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java
:351)
.
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java
:182)
.
org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:325
)
.
org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306
)
.   java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
.   java.util.concurrent.F

RE: SOLR 4 Alpha Out Of Mem Err

2012-07-17 Thread Nick Koton
After trying a number of things, I am successful in allowing the server to
auto commit and without having it hit thread/memory errors.  I have isolated
the required client change to replacing ConcurrentUpdateSolrServer with
HttpSolrServer.  I am able to maintain index rates of 3,000 documents/sec
with 6 shards and two servers per shard.  The servers receiving the index
requests hit steady state with approximately 800 threads per server.

So could there be something amiss in the server side implementation of
ConcurrentUpdateSolrServer?

Best regards,
Nick

-Original Message-
From: Nick Koton [mailto:nick.ko...@gmail.com] 
Sent: Monday, July 16, 2012 5:53 PM
To: 'solr-user@lucene.apache.org'
Subject: RE: SOLR 4 Alpha Out Of Mem Err

> That suggests you're running out of threads
Michael,
Thanks for this useful observation.  What I found just prior to the "problem
situation" was literally thousands of threads in the server JVM.  I have
pasted a few samples below obtained from the admin GUI.  I spent some time
today using this barometer, but I don't have enough to share right now.  I'm
looking at the difference between ConcurrentUpdateSolrServer and
HttpSolrServer and how my client may be misusing them.  I'll assume my
client is misbehaving and driving the server crazy for now.  If I figure out
how, I will share it so perhaps a safe guard can be put in place.

Nick


Server threads - very roughly 0.1 %:
cmdDistribExecutor-9-thread-7161 (10096)
java.util.concurrent.SynchronousQueue$TransferStack@17b90c55
.   sun.misc.Unsafe.park(Native Method)
.
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
.
java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(Synchronous
Queue.java:424)
.
java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueu
e.java:323)
.
java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:874)
.
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:945)
.
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:9
07)
.   java.lang.Thread.run(Thread.java:662)
-0.ms
-0.ms cmdDistribExecutor-9-thread-7160 (10086)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@5509b5
6
.   sun.misc.Unsafe.park(Native Method)
.   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
.
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(
AbstractQueuedSynchronizer.java:1987)
.
org.apache.http.impl.conn.tsccm.WaitingThread.await(WaitingThread.java:158)
.
org.apache.http.impl.conn.tsccm.ConnPoolByRoute.getEntryBlocking(ConnPoolByR
oute.java:403)
.
org.apache.http.impl.conn.tsccm.ConnPoolByRoute$1.getPoolEntry(ConnPoolByRou
te.java:300)
.
org.apache.http.impl.conn.tsccm.ThreadSafeClientConnManager$1.getConnection(
ThreadSafeClientConnManager.java:224)
.
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDir
ector.java:401)
.
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.ja
va:820)
.
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.ja
va:754)
.
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.ja
va:732)
.
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java
:351)
.
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java
:182)
.
org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:325
)
.
org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306
)
.   java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
.   java.util.concurrent.FutureTask.run(FutureTask.java:138)
.
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
.   java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
.   java.util.concurrent.FutureTask.run(FutureTask.java:138)
.
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.ja
va:886)
.
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:9
08)
.   java.lang.Thread.run(Thread.java:662)
20.ms
20.ms cmdDistribExecutor-9-thread-7159 (10085)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6f062d
d3
.   sun.misc.Unsafe.park(Native Method)
.   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
.
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(
AbstractQueuedSynchronizer.java:1987)
.
org.apache.http.impl.conn.tsccm.WaitingThread.await(WaitingThread.java:158)
.
org.apache.http.impl.conn.tsccm.ConnPoolByRoute.getEntryBlocking(ConnPoolByR
oute.java:403)
.
org.apache.http.impl.conn.tsccm.ConnPoolByRoute$1.getPoolEntry(ConnPoolByRou
te.java:300)
.
org.apache.http.impl.conn.tsccm.ThreadSafeClientConnManager$1.getConnection(
ThreadSafeClientConnManager.java:224)
.
org.apache.http.impl.client.DefaultRequestDirect

SOLR 4 ALPHA /terms /browse

2012-07-18 Thread Nick Koton
When I setup a 2 shard cluster using the example and run it through its
paces, I find two features that do not work as I expect.  Any suggestions on
adjusting my configuration or expectations would be appreciated.

/terms does not return any terms when issued as follows:
http://hostname:8983/solr/terms?terms.fl=name&terms=true&terms.limit=-1&isSh
ard=true&terms.sort=index&terms.prefix=s
but does return reasonable results when distrib is turned off like so
http://hostname:8983/solr/terms?terms.fl=name&terms=true&distrib=false&terms
.limit=-1&isShard=true&terms.sort=index&terms.prefix=s

/browse returns this stack trace to the browser
HTTP ERROR 500

Problem accessing /solr/browse. Reason:

{msg=ZkSolrResourceLoader does not support getConfigDir() - likely, what
you are trying to do is not supported in ZooKeeper
mode,trace=org.apache.solr.common.cloud.ZooKeeperException:
ZkSolrResourceLoader does not support getConfigDir() - likely, what you are
trying to do is not supported in ZooKeeper mode
at
org.apache.solr.cloud.ZkSolrResourceLoader.getConfigDir(ZkSolrResourceLoader
.java:99)
at
org.apache.solr.response.VelocityResponseWriter.getEngine(VelocityResponseWr
iter.java:117)
at
org.apache.solr.response.VelocityResponseWriter.write(VelocityResponseWriter
.java:40)
at
org.apache.solr.core.SolrCore$LazyQueryResponseWriterWrapper.write(SolrCore.
java:1990)
at
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.
java:398)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
276)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler
.java:1337)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119
)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java
:233)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java
:1065)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:
192)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:
999)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117
)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHand
lerCollection.java:250)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.
java:149)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:1
11)
at org.eclipse.jetty.server.Server.handle(Server.java:351)
at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpCo
nnection.java:454)
at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpCo
nnection.java:47)
at
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpC
onnection.java:890)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplet
e(AbstractHttpConnection.java:944)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:634)
at
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)
at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnectio
n.java:66)
at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketCon
nector.java:254)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:
599)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:5
34)
at java.lang.Thread.run(Thread.java:662)
,code=500}

Best regards,
Nick Koton





SOLR 4 BEAT - More Like This

2012-08-28 Thread Nick Koton
I am having difficulty getting MLT to work with a cloud configuration in
SOLR 4 beta.  I have reproduced it with the "example" schema and data from
the distribution:
http://hostname:8983/solr/example/select?
q=id:VDBDB1A16&mlt=true&mlt.fl=text,features,name,sku,id,manu,cat,title,desc
ription,keywords,author,resourcename

When I direct this at a single instance SOLR server I get the first response
below while the second response shows the cloud system's response.  The
"moreLikeThis" section is missing.  However, if I make an error in the query
syntax (like missing mlt.fl) I see the error from both.

Is anyone else seeing similar behavior?

Nick Koton


STANALONE RESPONSE


- -05-trueid:VDBDB1A16text,features,name,sku,id,manu,cat,title,description,keywords,
author,resourcename--VDBDB1A16A-DATA
V-Series 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory -
OEMA-DATA Technology Inc.corsair-electronicsmemory-CAS latency 3, 2.7v0true45.18414,-93.881412006-02-13T15:26:37Zelectronics|0.9 memory|0.11411477698415427584---VS1GB400C3CORSAIR
ValueSelect 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory
- RetailCorsair Microsystems Inc.corsair-electronicsmemory74.9974.99,USD7true37.7752,-100.02322006-02-13T15:26:37Zelectronics|4.0 memory|2.01411477698411233280-TWINX2048-3200PROCORSAIR XMS 2GB (2 x 1GB)
184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) Dual Channel Kit System
Memory - RetailCorsair Microsystems Inc.corsair-electronicsmemory-CAS latency 2, 2-3-3-6 timing, 2.75v, unbuffered,
heat-spreader185.0185,USD5true37.7752,-122.42322006-02-13T15:26:37Zelectronics|6.0 memory|3.01411477698403893248-0579B002Canon PIXMA MP500 All-In-One Photo
PrinterCanon Inc.canon-electronicsmultifunction
printerprinterscannercopier-Multifunction ink-jet color photo
printerFlatbed scanner, optical scan resolution of 1,200 x 2,400
dpi2.5" color LCD preview screenDuplex
CopyingPrinting speed up to 29ppm black, 19ppm
colorHi-Speed USBmemory card: CompactFlash, Micro
Drive, SmartMedia, Memory Stick, Memory Stick Pro, SD Card, and
MultiMediaCard352.0179.99179.99,USD6true45.19214,-93.899411411477698464710656-EN7800GTX/2DHTV/256MASUS Extreme
N7800GTX/2DHTV (256 MB)ASUS Computer Inc.asus-electronicsgraphics card-NVIDIA GeForce 7800 GTX GPU/VPU clocked at
486MHz256MB GDDR3 Memory clocked at 1.35GHzPCI Express
x16Dual DVI connectors, HDTV out, video inputOpenGL
2.0, DirectX 9.016.0479.95479.95,USD740.7143,-74.006false2006-02-13T00:00:00Z1411477698517139456 

CLOUD RESPONSE


- -030-trueid:VDBDB1A16text,features,name,sku,id,manu,cat,title,description,keywords,
author,resourcename--VDBDB1A16A-DATA V-Series 1GB 184-Pin DDR
SDRAM Unbuffered DDR 400 (PC 3200) System Memory - OEMA-DATA Technology Inc.corsair-electronicsmemory-CAS latency 3, 2.7v0true45.18414,-93.881412006-02-13T15:26:37Zelectronics|0.9 memory|0.11411212567291887616 


05trueid:VDBDB1A16text,features,name,sku,id,manu,cat,title,description,keywords,author,resourcenameVDBDB1A16A-DATA V-Series 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - OEMA-DATA Technology Inc.corsairelectronicsmemoryCAS latency 3,	 2.7v0true45.18414,-93.881412006-02-13T15:26:37Zelectronics|0.9 memory|0.11411477698415427584VS1GB400C3CORSAIR ValueSelect 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - RetailCorsair Microsystems Inc.corsairelectronicsmemory74.9974.99,USD7true37.7752,-100.02322006-02-13T15:26:37Zelectronics|4.0 memory|2.01411477698411233280TWINX2048-3200PROCORSAIR  XMS 2GB (2 x 1GB) 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) Dual Channel Kit System Memory - RetailCorsair Microsystems Inc.corsairelectronicsmemoryCAS latency 2,	2-3-3-6 timing, 2.75v, unbuffered, heat-spreader185.0185,USD5true37.7752,-122.42322006-02-13T15:26:37Zelectronics|6.0 memory|3.014114776984038932480579B002Canon PIXMA MP500 All-In-One Photo PrinterCanon Inc.canonelectronicsmultifunction printerprinterscannercopierMultifunction ink-jet color photo printerFlatbed scanner, optical scan resolution of 1,200 x 2,400 dpi2.5" color LCD preview screenDuplex CopyingPrinting speed up to 29ppm black, 19ppm colorHi-Speed USBmemory card: CompactFlash, Micro Drive, SmartMedia, Memory Stick, Memory Stick Pro, SD Card, and MultiMediaCard352.0179.99179.99,USD6true45.19214,-93.899411411477698464710656EN7800GTX/2DHTV/256MASUS Extreme N7800GTX/2DHTV (256 MB)ASUS Computer Inc.asuselectronicsgraphics cardNVIDIA GeForce 7800 GTX GPU/VPU clocked at 486MHz256MB GDDR3 Memory clocked at 1.35GHzPCI Express x16Dual DVI connectors, HDTV out, video inputOpenGL 2.0, DirectX 9.016.0479.95479.95,USD740.7143,-74.006false2006-02-13T00:00:00Z1411477698517139456


030trueid:VDBDB1A16text,features,name,sku,id,manu,cat,title,description,keywords,author,resourcenameVDBDB1A16A-DATA V-Series 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - OEMA-DATA Technology Inc.corsairelectronicsmemoryCAS latency 3,	 2.7v0true45.18414,-93.8814

SOLR 4 BETA facet.pivot and cloud

2012-10-04 Thread Nick Cotton
Please pardon this if is and FAQ, but after searching the archives I
cannot get a clear answer.

Does the new facet.pivot work with SOLRCloud?  When I run SOLR 4 BETA
with zookeeper, even if I specify shards=1, pivoting does not seem to
work.  The quickest way to demo this is with the velocity browse page
on the example data.  The pivot facet for "cat,inStock" only appears
if I run without zookeeper.

If this is known, can you please let me know if this is a defect in
beta that is expected to be working in GA or whether it will remain a
limition for some time.

regards,

Nick Koton


SolrCloud failover behavior

2012-11-03 Thread Nick Chase
I think there's a change in the behavior of SolrCloud vs. what's in the 
wiki, but I was hoping someone could confirm for me.  I checked JIRA and 
there were a couple of issues requesting partial results if one server 
comes down, but that doesn't seem to be the issue here.  I also checked 
CHANGES.txt and don't see anything that seems to apply.


I'm running "Example B: Simple two shard cluster with shard replicas" 
from the wiki at https://wiki.apache.org/solr/SolrCloud and everything 
starts out as expected.  However, when I get to the part about fail over 
behavior is when things get a little wonky.


I added data to the shard running on 7475.  If I kill 7500, a query to 
any of the other servers works fine.  But if I kill 7475, rather than 
getting zero results on a search to 8983 or 8900, I get a 503 error:



   
  503
  5
  
 *:*
  
   
   
  no servers hosting shard:
  503
   


I don't see any errors in the consoles.

Also, if I kill 8983, which includes the Zookeeper server, everything 
dies, rather than just staying in a steady state; the other servers 
continually show:


Nov 03, 2012 11:39:34 AM org.apache.zookeeper.ClientCnxn$SendThread 
startConnect

NFO: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:9983
ov 03, 2012 11:39:35 AM org.apache.zookeeper.ClientCnxn$SendThread run
ARNING: Session 0x13ac6cf87890002 for server null, unexpected error, 
closing socket connection and attempting reconnect

ava.net.ConnectException: Connection refused: no further information
   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
   at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
   at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1143)


ov 03, 2012 11:39:35 AM org.apache.zookeeper.ClientCnxn$SendThread 
startConnect


over and over again, and a call to any of the servers shows a connection 
error to 8983.


This is the current 4.0.0 release, running on Windows 7.

If this is the proper behavior and the wiki needs updating, fine; I just 
need to know.  Otherwise if anybody has any clues as to what I may be 
missing, I'd be grateful. :)


Thanks...

---  Nick


Re: SolrCloud failover behavior

2012-11-06 Thread Nick Chase
Thanks a million, Erick!  You're right about killing both nodes hosting 
the shard.  I'll get the wiki corrected.


  Nick

On 11/3/2012 10:51 PM, Erick Erickson wrote:

SolrCloud doesn't work unless every shard has at least one server that is
up and running.

I _think_ you might be killing both nodes that host one of the shards. The
admin
page has a link showing you the state of your cluster. So when this happens,
does that page show both nodes for that shard being down?

And yeah, SolrCloud requires a quorum of ZK nodes up. So with only one ZK
node, killing that will bring down the whole cluster. Which is why the
usual
recommendation is that ZK be run externally and usually an odd number of ZK
nodes (three or more).

Anyone can create a login and edit the Wiki, so any clarifications are
welcome!

Best
Erick


On Sat, Nov 3, 2012 at 12:17 PM, Nick Chase  wrote:


I think there's a change in the behavior of SolrCloud vs. what's in the
wiki, but I was hoping someone could confirm for me.  I checked JIRA and
there were a couple of issues requesting partial results if one server
comes down, but that doesn't seem to be the issue here.  I also checked
CHANGES.txt and don't see anything that seems to apply.

I'm running "Example B: Simple two shard cluster with shard replicas" from
the wiki at 
https://wiki.apache.org/solr/**SolrCloud<https://wiki.apache.org/solr/SolrCloud>and
 everything starts out as expected.  However, when I get to the part
about fail over behavior is when things get a little wonky.

I added data to the shard running on 7475.  If I kill 7500, a query to any
of the other servers works fine.  But if I kill 7475, rather than getting
zero results on a search to 8983 or 8900, I get a 503 error:



   503
   5
   
  *:*
   


   no servers hosting shard:
   503



I don't see any errors in the consoles.

Also, if I kill 8983, which includes the Zookeeper server, everything
dies, rather than just staying in a steady state; the other servers
continually show:

Nov 03, 2012 11:39:34 AM org.apache.zookeeper.**ClientCnxn$SendThread
startConnect
NFO: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:9983
ov 03, 2012 11:39:35 AM org.apache.zookeeper.**ClientCnxn$SendThread run
ARNING: Session 0x13ac6cf87890002 for server null, unexpected error,
closing socket connection and attempting reconnect
ava.net.ConnectException: Connection refused: no further information
at sun.nio.ch.SocketChannelImpl.**checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.**finishConnect(Unknown Source)
at org.apache.zookeeper.**ClientCnxn$SendThread.run(**
ClientCnxn.java:1143)

ov 03, 2012 11:39:35 AM org.apache.zookeeper.**ClientCnxn$SendThread
startConnect

over and over again, and a call to any of the servers shows a connection
error to 8983.

This is the current 4.0.0 release, running on Windows 7.

If this is the proper behavior and the wiki needs updating, fine; I just
need to know.  Otherwise if anybody has any clues as to what I may be
missing, I'd be grateful. :)

Thanks...

---  Nick





Internal Vs. External ZooKeeper

2012-11-11 Thread Nick Chase
OK, I can't find a definitive answer on this.  The wiki says not to use 
the embedded ZooKeeper servers for production.  But my question is: why 
not?  Basically, what are the reasons and circumstances that make you 
better off using an external ZooKeeper ensemble?


Thanks...

 Nick


Re: Internal Vs. External ZooKeeper

2012-11-11 Thread Nick Chase
Thanks, Jack, this is a great explanation!  And since a greater number 
of ZK nodes tends to degrade write performance, that would be a factor 
in making every Solr node a ZK node as well.  Much obliged!


  Nick

On 11/11/2012 10:45 AM, Jack Krupansky wrote:

"Production" typically implies "high availability" and in a distributed
system the goal is that the overall cluster integrity and performance
should not be compromised just because a few "worker" nodes go down.
Solr nodes do a lot of complex operations and are quite prone to running
into "issues" that compromise their integrity and require that they be
taken down, restarted, etc. In fact, taking down a "bunch" of Solr
"worker" nodes should not be a big deal (unless they are all of the
nodes/replicas from a single shard/slice), while taking down a "bunch"
of zookeepers could be catastrophic to maintaining the integrity of the
zookeeper ensemble. (OTOH, if every Solr node is also a zookeeper node,
a "bunch" of Solr nodes would generally be less than a quorum, so maybe
that is not an absolute issue per se.) Zookeeper nodes are categorically
distinct in terms of their importance to maintaining the integrity and
availability of the overall cluster. They are special in that sense. And
they are special because they are maintaining the integrity of the
cluster's configuration information. Even for large clusters their
number will be relatively "few" compared to the "many" of "worker" nodes
(replicas), so zookeeper nodes need to be "protected" from the vagaries
that can disrupt and take Solr nodes down, not the least of which is
incoming traffic.

I'm not sure what the implications would be if you had a large cluster
and because Zookeeper was embedded you had a large number of zookeepers.
Any of the inter-zookeeper operations would take longer and could be
compromised by even a single busy/overloaded/dead Solr node. OTOH, the
Zookeeper ensemble design is supposed to be able to handle a far number
of missing zookeeper nodes.

OTOH, if high availability is not a requirement for a production cluster
(use case?), then non-embedded zookeepers are certainly an annoyance.

Maybe you could think of embedded zookeeper like every employee having
their manager sitting right next to them all the time. How could that be
anything but a bad idea in terms of maximizing worker output - and
distracting/preventing managers from focusing on their own "work"?

-- Jack Krupansky

-Original Message- From: Nick Chase
Sent: Sunday, November 11, 2012 7:12 AM
To: solr-user@lucene.apache.org
Subject: Internal Vs. External ZooKeeper

OK, I can't find a definitive answer on this.  The wiki says not to use
the embedded ZooKeeper servers for production.  But my question is: why
not?  Basically, what are the reasons and circumstances that make you
better off using an external ZooKeeper ensemble?

Thanks...

 Nick



zkcli issues

2012-11-11 Thread Nick Chase

OK, so this is my ZooKeeper week, sorry. :)

So I'm trying to use ZkCLI without success.  I DID start and stop Solr 
in non-cloud mode, so everything is extracted and it IS finding 
zookeeper*.jar.  However, now it's NOT finding SolrJ.


I even tried to run it from the provided script (in cloud-scripts) with 
no success.  Here's what I've got:


>>> cd 

>>>.\example\cloud-scripts\zkcli.bat -cmd upconfig -zkhost 
localhost:9983 -confdir example/solr/collection/conf -confname conf1 
-solrhome example/solr


>>>set JVM=java

>>>set SDIR=C:\sw\apache-solr-4.0.0\example\cloud-scripts\

>>>if "\" == "\" set SDIR=C:\sw\apache-solr-4.0.0\example\cloud-scripts

>>>"java" -classpath 
"C:\sw\apache-solr-4.0.0\example\cloud-scripts\..\solr-webapp\webapp\WEB-INF\lib\*" 
org.apache.solr.cloud.ZkCLI -cmd upconfig -zkhost localhost:9983 
-confdir example/solr/collection/conf -confname conf1 -solrhome example/solr


Error: Could not find or load main class 
C:\sw\apache-solr-4.0.0\example\cloud-scripts\..\solr-webapp\webapp\WEB-INF\lib\apache-solr-solrj-4.0.0.jar


I've verified that 
C:\sw\apache-solr-4.0.0\example\cloud-scripts\..\solr-webapp\webapp\WEB-INF\lib\apache-solr-solrj-4.0.0.jar 
exists, so I'm really at a loss here.


Thanks...

  Nick


Re: zkcli issues

2012-11-15 Thread Nick Chase
Unfortunately, this doesn't seem to solve the issue; now I'm beginning 
to wonder if maybe it's because I'm on Windows.  Has anyone successfully 
run ZkCLI on Windows?


  Nick

On 11/12/2012 2:27 AM, Jeevanandam Madanagopal wrote:

Nick - Sorry, embedded links are not shown in previous email. I'm mentioning 
below.


Handy SolrCloud ZkCLI Commands 
(http://www.myjeeva.com/2012/10/solrcloud-cluster-single-collection-deployment/#handy-solrcloud-cli-commands)



Uploading Solr Configuration into ZooKeeper ensemble 
(http://www.myjeeva.com/2012/10/solrcloud-cluster-single-collection-deployment/#uploading-solrconfig-to-zookeeper)



Cheers,
Jeeva


On Nov 12, 2012, at 12:48 PM, Jeevanandam Madanagopal  wrote:


Nick -

I believe you're experiencing a difficulties with SolrCloud CLI commands for 
interacting ZooKeeper.
Please have a look on below links, it will provide you direction.
Handy SolrCloud ZkCLI Commands
Uploading Solr Configuration into ZooKeeper ensemble

Cheers,
Jeeva

On Nov 12, 2012, at 4:45 AM, Mark Miller  wrote:


On 11/11/2012 04:47 PM, Yonik Seeley wrote:

On Sun, Nov 11, 2012 at 10:39 PM, Nick Chase  wrote:

So I'm trying to use ZkCLI without success.  I DID start and stop Solr in
non-cloud mode, so everything is extracted and it IS finding zookeeper*.jar.
However, now it's NOT finding SolrJ.


Re: zkcli issues

2012-11-16 Thread Nick Chase
I agree that it *shouldn't* be OS specific. :)  Anyway, thanks for the 
suggestion, but that's not it.  I get the same error with the script 
right out of the box:


Error: Could not find or load main class 
C:\sw\apache-solr-4.0.0\example\cloud-scripts\..\solr-webapp\webapp\WEB-INF\lib\apache-solr-solrj-4.0.0.jar


And anyway, it's a weird error, referencing a jar as a class, isn't it? 
 Start up a JIRA?


-  Nick

On 11/16/2012 10:42 AM, Mark Miller wrote:

I *think* I tested the script on windows once way back.

Anyway, the code itself should not be OS specific.

One thing you might want to check if you are copying unix cmd line
stuff - I think windows separates classpath entries with ; rather than
: - so you likely to need to change that. You'd think java could have
been smart enough to accept either/or at worst, but meh.

For example:
.:/Users/jeeva/dc-1/solr-cli-lib/*
should be
.;/Users/jeeva/dc-1/solr-cli-lib/*

- Mark

On Thu, Nov 15, 2012 at 8:53 PM, Nick Chase  wrote:

Unfortunately, this doesn't seem to solve the issue; now I'm beginning to
wonder if maybe it's because I'm on Windows.  Has anyone successfully run
ZkCLI on Windows?

  Nick


On 11/12/2012 2:27 AM, Jeevanandam Madanagopal wrote:


Nick - Sorry, embedded links are not shown in previous email. I'm
mentioning below.


Handy SolrCloud ZkCLI Commands
(http://www.myjeeva.com/2012/10/solrcloud-cluster-single-collection-deployment/#handy-solrcloud-cli-commands)




Uploading Solr Configuration into ZooKeeper ensemble
(http://www.myjeeva.com/2012/10/solrcloud-cluster-single-collection-deployment/#uploading-solrconfig-to-zookeeper)




Cheers,
Jeeva


On Nov 12, 2012, at 12:48 PM, Jeevanandam Madanagopal 
wrote:


Nick -

I believe you're experiencing a difficulties with SolrCloud CLI commands
for interacting ZooKeeper.
Please have a look on below links, it will provide you direction.
Handy SolrCloud ZkCLI Commands
Uploading Solr Configuration into ZooKeeper ensemble

Cheers,
Jeeva

On Nov 12, 2012, at 4:45 AM, Mark Miller  wrote:


On 11/11/2012 04:47 PM, Yonik Seeley wrote:


On Sun, Nov 11, 2012 at 10:39 PM, Nick Chase 
wrote:


So I'm trying to use ZkCLI without success.  I DID start and stop Solr
in
non-cloud mode, so everything is extracted and it IS finding
zookeeper*.jar.
However, now it's NOT finding SolrJ.






Re: Seattle / PNW Hadoop/Lucene/HBase Meetup, Wed Sep 30th

2009-09-30 Thread Nick Dimiduk
As Bradford is out of town this evening, I will take up the mantel of
Person-on-Point. Contact me with questions re: tonight's gathering.

See you tonight!

-Nick
614.657.0267

On Mon, Sep 28, 2009 at 4:33 PM, Bradford Stephens <
bradfordsteph...@gmail.com> wrote:

> Hello everyone!
> Don't forget that the Meetup is THIS Wednesday! I'm looking forward to
> hearing about Hive from the Facebook team ... and there might be a few
> other
> interesting talks as well. Here's the details in the wiki:
> http://wiki.apache.org/hadoop/PNW_Hadoop_%2B_Apache_Cloud_Stack_User_Group
>
> Cheers,
> Bradford
>
> On Mon, Sep 14, 2009 at 11:35 AM, Bradford Stephens <
> bradfordsteph...@gmail.com> wrote:
>
> > Greetings,
> >
> > It's time for another Hadoop/Lucene/Apache"Cloud"  Stack meetup!
> > This month it'll be on Wednesday, the 30th, at 6:45 pm.
> >
> > We should have a few interesting guests this time around -- someone from
> > Facebook may be stopping by to talk about Hive :)
> >
> > We've had great attendance in the past few months, let's keep it up! I'm
> > always
> > amazed by the things I learn from everyone.
> >
> > We're back at the University of Washington, Allen Computer Science
> > Center (not Computer Engineering)
> > Map: http://www.washington.edu/home/maps/?CSE
> >
> > Room: 303 -or- the Entry level. If there are changes, signs will be
> posted.
> >
> > More Info:
> >
> > The meetup is about 2 hours (and there's usually food): we'll have two
> > in-depth talks of 15-20
> > minutes each, and then several "lightning talks" of 5 minutes. If no
> > one offers, We'll then have discussion and 'social time'.  we'll just
> > have general discussion. Let net know if you're interested in speaking
> > or attending. We'd like to focus on education, so every presentation
> > *needs* to ask some questions at the end. We can talk about these
> > after the presentations, and I'll record what we've learned in a wiki
> > and share that with the rest of us.
> >
> > Contact: Bradford Stephens, 904-415-3009, bradfordsteph...@gmail.com
> >
> > Cheers,
> > Bradford
> > --
> > http://www.roadtofailure.com -- The Fringes of Scalability, Social
> > Media, and Computer Science
> >
>
>
>
> --
> http://www.roadtofailure.com -- The Fringes of Scalability, Social Media,
> and Computer Science
>


Re: Seattle / PNW Hadoop/Lucene/HBase Meetup, Wed Sep 30th

2009-10-07 Thread Nick Dimiduk
Hey PNW Clouders! I'd really like to chat further with the crew doing
distributed Solr. Give me a ring or shoot me an email, let's do lunch!
-Nick

On Wed, Sep 30, 2009 at 2:10 PM, Nick Dimiduk  wrote:

> As Bradford is out of town this evening, I will take up the mantel of
> Person-on-Point. Contact me with questions re: tonight's gathering.
>
> See you tonight!
>
> -Nick
> 614.657.0267
>
>
> On Mon, Sep 28, 2009 at 4:33 PM, Bradford Stephens <
> bradfordsteph...@gmail.com> wrote:
>
>> Hello everyone!
>> Don't forget that the Meetup is THIS Wednesday! I'm looking forward to
>> hearing about Hive from the Facebook team ... and there might be a few
>> other
>> interesting talks as well. Here's the details in the wiki:
>> http://wiki.apache.org/hadoop/PNW_Hadoop_%2B_Apache_Cloud_Stack_User_Group
>>
>> Cheers,
>> Bradford
>>
>> On Mon, Sep 14, 2009 at 11:35 AM, Bradford Stephens <
>> bradfordsteph...@gmail.com> wrote:
>>
>> > Greetings,
>> >
>> > It's time for another Hadoop/Lucene/Apache"Cloud"  Stack meetup!
>> > This month it'll be on Wednesday, the 30th, at 6:45 pm.
>> >
>> > We should have a few interesting guests this time around -- someone from
>> > Facebook may be stopping by to talk about Hive :)
>> >
>> > We've had great attendance in the past few months, let's keep it up! I'm
>> > always
>> > amazed by the things I learn from everyone.
>> >
>> > We're back at the University of Washington, Allen Computer Science
>> > Center (not Computer Engineering)
>> > Map: http://www.washington.edu/home/maps/?CSE
>> >
>> > Room: 303 -or- the Entry level. If there are changes, signs will be
>> posted.
>> >
>> > More Info:
>> >
>> > The meetup is about 2 hours (and there's usually food): we'll have two
>> > in-depth talks of 15-20
>> > minutes each, and then several "lightning talks" of 5 minutes. If no
>> > one offers, We'll then have discussion and 'social time'.  we'll just
>> > have general discussion. Let net know if you're interested in speaking
>> > or attending. We'd like to focus on education, so every presentation
>> > *needs* to ask some questions at the end. We can talk about these
>> > after the presentations, and I'll record what we've learned in a wiki
>> > and share that with the rest of us.
>> >
>> > Contact: Bradford Stephens, 904-415-3009, bradfordsteph...@gmail.com
>> >
>> > Cheers,
>> > Bradford
>> > --
>> > http://www.roadtofailure.com -- The Fringes of Scalability, Social
>> > Media, and Computer Science
>> >
>>
>>
>>
>> --
>> http://www.roadtofailure.com -- The Fringes of Scalability, Social Media,
>> and Computer Science
>>
>
>


MoreLikeThis and interesting terms

2009-10-16 Thread Nick Spacek
Hi folks,

I'm having an issue where I want MLT to operate on multiple fields, one of
which contains a large number of terms (that is, each document in the index
has many terms for this field) and the others only a few terms per document.
In my situation, the fields with the fewer terms I am boosting, because they
are what I'm particularly interested in, and the field with many terms is to
be the "fallback" (so a lower boost).

The problem is that because there are so many common terms in the one field
across the documents in the index, MLT is returning only interesting terms
from this field. If I increase the mlt.maxqt I can find the terms from the
other (more important) fields, but then I have included many more terms from
the already over-influencing field.

Hopefully I have explained this well enough. My question is if there is a
mechanism in place (I haven't been able to find one) to make those fields I
have boosted the ones that come up first in the list of interesting terms.

Thanks,
Nick Spacek


MoreLikeThis support Dismax parameters

2009-10-19 Thread Nick Spacek
>From what I've read/found, MoreLikeThis doesn't support the dismax
parameters that are available in the StandardRequestHandler (such as bq). Is
it possible that we might get support for those parameters some time? What
are the issues with MLT Handler inheriting from the StandardRequestHandler
instead of RequestHandlerBase?

Nick Spacek


Re: MoreLikeThis support Dismax parameters

2009-11-03 Thread Nick Spacek
>
> As i said: that may be what you're looking for (it's hard to tell based on
> your email) but the other possibility is that you want to be able to
> specify bq (and maybe bf) type parrams to influence the MLT portion of the
> request (ie: apply a bias so docs matching a particular query/func are
> mosre likely to be suggested) ... this is an area that hasn't really been
> very well explored as far as i can remember.
>

Right, so I have a field with many terms in it and I want to find similar
documents using this against a number of other fields. In my situation, I
want to take the description field and look in description, city, and
province. I want the city and province fields to be "more important". I have
applied a boost to them, but even though they have higher values they are
not considered by Solr to be as "interesting", I think because they do not
occur as frequently. What ends up happening is that all of the matching
terms in the description field end up pushing the matching terms from city
and province to the bottom of the "interesting" list.

I think that's what you were saying in the second paragraph, right? There
currently doesn't seem to be a way to influence the ordering of the
interesting terms.

Thanks,
Nick


Re: [POLL] - A new logo for Solr

2008-08-01 Thread Nick Jenkin
Is there an option to keep the current one?
-Nick
On 8/2/08, Shalin Shekhar Mangar <[EMAIL PROTECTED]> wrote:
> On Fri, Aug 1, 2008 at 8:39 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote:
>
>  > On Fri, Aug 1, 2008 at 10:44 AM, Shalin Shekhar Mangar
>  > <[EMAIL PROTECTED]> wrote:
>  > > The design with the most number of total (first+second place) votes will
>  > be
>  > > accepted as the community's choice
>  >
>  > Thanks for setting this up Shalin.
>  > I'm not sure first+second votes should simply be added though
>  > (something that *no* one voted for as 1st could win)... lets just wait
>  > and see the results and hopefully it will be obvious this time.
>  >
>  > -Yonik
>  >
>
>
> Sure Yonik. No problem :)
>
>  --
>  Regards,
>
> Shalin Shekhar Mangar.
>


Indexing Only Parts of HTML Pages

2008-08-13 Thread Nick Tkach
I'm wondering, is there some way ("out of the box") to tell Solr that 
we're only interested in indexing certain parts of a page?  For example, 
let's say I have a bunch of pages in my site that contain some common 
navigation elements, roughly like this:



  
  

  Stuff here about parts of my site


  More stuff about other parts of the site

A bunch of stuff particular to each individual page...
  


Is there some way to either tell Solr to not index what's in the two 
divs whenever it encounters them (and it will-in nearly every page) or, 
failing that, to somehow easily give content in those areas a large 
negative score in order to get the same effect?


FWIW, we are using Nutch to do the crawling, but as I understand it 
there's no way to get Nutch to skip only parts of pages without writing 
custom code, right?


Re: Solr Logo thought

2008-08-21 Thread Nick Jenkin
I like the O, it is both the sun and it looks like an eye which suits
in with the search.
Good stuff.
-Nick
On 8/21/08, Lukáš Vlček <[EMAIL PROTECTED]> wrote:
> Hi,
>
>  Well, the eye looking O is not intentional. It is more a result of the
>  techique I used when doing the initial skatch. Believe it or not this design
>  started at magnetic drawing board (http://www.reggies.co.za/nov/nov274.jpg)
>  which I use now when playing with my 2 year old daughter. It is an excellent
>  piece of hardware and though it lacks in terms of output resolution and is
>  not presure sensitive and its undo capatilities are very limitted it
>  outperforms my A4+ Wacom tablet in terms of bootup time and is absolutely
>  *green energy* equipment. But as I mentioned, its resolution is quite low
>  and thus the vectorized version is not perfect yet.
>
>  Anyway, I think that eye looking O is an interesting observation I will work
>  on this because I see that it can be confuzing.
>
>  Regards,
>  Lukas
>
>
>  On Wed, Aug 20, 2008 at 9:01 PM, Mike Klaas <[EMAIL PROTECTED]> wrote:
>
>  > Nice job Lukas; the professionalism and quality of work is evident.  I like
>  > aspects of the logo, but too am having trouble getting past the eye-looking
>  > O.  Is it intentional (eye:look:search, etc)?
>  >
>  > -Mike
>  >
>  >
>  > On 20-Aug-08, at 5:25 AM, Mark Miller wrote:
>  >
>  >  I went through the same thought process - it took a couple minutes for the
>  >> whole thing to grow on me. Perhaps a tweak to the O if your looking for 
> some
>  >> constructive criticism?
>  >>
>  >> Again though, I really think its an awesome multipurpose logo. Works well
>  >> in color, b/w, large, small, and just the sun part as a facicon/other.
>  >>
>  >> Grant Ingersoll wrote:
>  >>
>  >>> It's pretty good, for me.  My first thought is it is an eye (the orange
>  >>> reminds me of eyelashes), and then the second thought is it is the Sun. 
> Take
>  >>> that w/ a grain of salt, though, there's a reason why I do server-side 
> code
>  >>> and not user interfaces and graphic design. :-)
>  >>>
>  >>> -Grant
>  >>>
>  >>> On Aug 20, 2008, at 3:48 AM, Lukáš Vlček wrote:
>  >>>
>  >>>  Hi,
>  >>>>
>  >>>> Only few responded so far. How we can get more feedback? Do you think I
>  >>>> should work on the proposal a little bit more and then attach it to
>  >>>> SOLR-84?
>  >>>>
>  >>>> Regards,
>  >>>> Lukas
>  >>>>
>  >>>> On Mon, Aug 18, 2008 at 6:14 PM, Otis Gospodnetic <
>  >>>> [EMAIL PROTECTED]> wrote:
>  >>>>
>  >>>>  I like it, even its asymmetry. :)
>  >>>>>
>  >>>>>
>  >>>>> Otis
>  >>>>> --
>  >>>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>  >>>>>
>  >>>>>
>  >>>>>
>  >>>>> - Original Message 
>  >>>>>
>  >>>>>> From: Lukáš Vlček <[EMAIL PROTECTED]>
>  >>>>>> To: solr-user@lucene.apache.org
>  >>>>>> Sent: Sunday, August 17, 2008 7:02:25 PM
>  >>>>>> Subject: Re: Solr Logo thought
>  >>>>>>
>  >>>>>> Hi,
>  >>>>>>
>  >>>>>> My initial draft of Solr logo can be found here:
>  >>>>>> http://picasaweb.google.com/lukas.vlcek/Solr
>  >>>>>> The reason why I haven't attached it to SOLR-84 for now is that this
>  >>>>>> is
>  >>>>>>
>  >>>>> just
>  >>>>>
>  >>>>>> draft and not final design (there are a lot of unfinished details). I
>  >>>>>>
>  >>>>> would
>  >>>>>
>  >>>>>> like to get some feedback before I spend more time on it.
>  >>>>>>
>  >>>>>> I had several ideas but in the end I found that the simplicity works
>  >>>>>>
>  >>>>> best.
>  >>>>>
>  >>>>>> Simple font, sun motive, just two colors. Should look fine in both the
>  >>>>>>
>  >>>>> large
>  >>>>>
>  >>>>>> and small formats. As for the favicon I would use the sun motive only
>  >>>>>> -
>  >>>>>>
>  >>>>> it
>  >>>>>
>  >>>>>> means the O letter with the beams. The logo font still needs a lot of
>  >>>>>>
>  >>>>> small
>  >>>>>
>  >>>>>> (but important) touches. For now I would like to get feedback mostly
>  >>>>>>
>  >>>>> about
>  >>>>>
>  >>>>>> the basic idea.
>  >>>>>>
>  >>>>>> Regards,
>  >>>>>> Lukas
>  >>>>>>
>  >>>>>> On Sat, Aug 9, 2008 at 8:21 PM, Mark Miller wrote:
>  >>>>>>
>  >>>>>>  Plenty left, but here is a template to get things started:
>  >>>>>>> http://wiki.apache.org/solr/LogoContest
>  >>>>>>>
>  >>>>>>> Speaking of which, if we want to maintain the momentum of interest in
>  >>>>>>>
>  >>>>>> this
>  >>>>>
>  >>>>>> topic, someone (ie: not me) should setup a "LogoContest" wiki page
>  >>>>>>>>
>  >>>>>>> with some
>  >>>>>
>  >>>>>> of the "goals" discussed in the various threads on solr-user and
>  >>>>>>>>
>  >>>>>>> solr-dev
>  >>>>>
>  >>>>>> recently, as well as draft up some good guidelines for how we should
>  >>>>>>>>
>  >>>>>>> run the
>  >>>>>
>  >>>>>> contest
>  >>>>>>>>
>  >>>>>>>>
>  >>>>>>>
>  >>
>  >
>
>
>
> --
>
> http://blog.lukas-vlcek.com/
>


  1   2   >