ApacheCon North America 2018 schedule is now live.

2018-05-01 Thread Rich Bowen

Dear Apache Enthusiast,

We are pleased to announce our schedule for ApacheCon North America 
2018. ApacheCon will be held September 23-27 at the Montreal Marriott 
Chateau Champlain in Montreal, Canada.


Registration is open! The early bird rate of $575 lasts until July 21, 
at which time it goes up to $800. And the room block at the Marriott 
($225 CAD per night, including wifi) closes on August 24th.


We will be featuring more than 100 sessions on Apache projects. The 
schedule is now online at https://apachecon.com/acna18/


The schedule includes full tracks of content from Cloudstack[1], 
Tomcat[2], and our GeoSpatial community[3].


We will have 4 keynote speakers, two of whom are Apache members, and two 
from the wider community.


On Tuesday, Apache member and former board member Cliff Schmidt will be 
speaking about how Amplio uses technology to educate and improve the 
quality of life of people living in very difficult parts of the 
world[4]. And Apache Fineract VP Myrle Krantz will speak about how Open 
Source banking is helping the global fight against poverty[5].


Then, on Wednesday, we’ll hear from Bridget Kromhout, Principal Cloud 
Developer Advocate from Microsoft, about the really hard problem in 
software - the people[6]. And Euan McLeod, ‎VP VIPER at ‎Comcast will 
show us the many ways that Apache software delivers your favorite shows 
to your living room[7].


ApacheCon will also feature old favorites like the Lightning Talks, the 
Hackathon (running the duration of the event), PGP key signing, and lots 
of hallway-track time to get to know your project community better.


Follow us on Twitter, @ApacheCon, and join the disc...@apachecon.com 
mailing list (send email to discuss-subscr...@apachecon.com) to stay up 
to date with developments. And if your company wants to sponsor this 
event, get in touch at h...@apachecon.com for opportunities that are 
still available.


See you in Montreal!

Rich Bowen
VP Conferences, The Apache Software Foundation
h...@apachecon.com
@ApacheCon

[1] http://cloudstackcollab.org/
[2] http://tomcat.apache.org/conference.html
[3] http://apachecon.dukecon.org/acna/2018/#/schedule?search=geospatial
[4] 
http://apachecon.dukecon.org/acna/2018/#/scheduledEvent/df977fd305a31b903
[5] 
http://apachecon.dukecon.org/acna/2018/#/scheduledEvent/22c6c30412a3828d6
[6] 
http://apachecon.dukecon.org/acna/2018/#/scheduledEvent/fbbb2384fa91ebc6b
[7] 
http://apachecon.dukecon.org/acna/2018/#/scheduledEvent/88d50c3613852c2de


Re: Load Balancing between Two Cloud Clusters

2018-05-01 Thread Monica Skidmore
Thank you, Erick.  This is exactly the information I needed but hadn't 
correctly parsed as a new Solr cloud user.  You've just made setting up our new 
configuration much easier!!

Monica Skidmore
Senior Software Engineer
 

 
On 4/30/18, 7:29 PM, "Erick Erickson"  wrote:

"We need a way to determine that a node is still 'alive' and should be
in the load balancer, and we need a way to know that a new node is now
available and fully ready with its replicas to add to the load
balancer."

Why? If a Solr node is running but the replicas aren't up yet, it'll
pass the request along to a node that _does_ have live replicas, you
don't have to do anything. As far as the node being alive, there are
lots of ways, any API end point has to have a Solr to field it,
perhaps just use the Collections LIST command?

"How does ZooKeeper make this determination?  Does it do something
different if multiple collections are on a single cluster?  And, even
with just one cluster, what is best practice for keeping a current
list of active nodes in the cluster, especially for extremely high
query rates?"

This is a common misconception. ZooKeeper isn't interested in Solr at
all. ZooKeeper will ping the nodes it knows about and, perhaps, remove
a node from the live_nodes list, but that's all. It isn't involved in
Solr's operation in terms of routing queries, updates or anything like
that.

_Solr_ keeps track of all this by _watching_ various znodes. Say Solr
hosts some replica in a collection. when it comes up it sets a "watch"
on the /collections/my_collection/state.json Znode. It also published
its own state. So say it hosts three replicas for the collection. As
each one is loaded and ready for action, Solr posts an update to the
relevant state.json file.

ZooKeeper is then responsible for telling an other node who'd set a
watch that the znode has changed. ZK doesn't know or care whether
those are Solr nodes or not.

So when a request comes in to a Solr node, it knows what other Solr
nodes host what particular replicas and does all the sub-requests
itself, ZK isn't involved at all at that level.

So imagine node1 hosts S1R1 and S2R1 Node2 hosts S1R2 and S2R2 (for
collection A). When node1 comes up it updates the state in ZK to say
S1R2 and S1R2 are "active". Now claim node2 is coming up but hasn't
loaded it's cores yet. If it receives a request it can forward them on
to node1.

Now node2 loads both its cores. It updates the ZK node for the
collection, and since node1 is watching, it fetches the updated
state.json. From this point forward, both nodes have complete
information about all the replicas in the collection and don't need to
reference ZK any more at all.

In fact, ZK can completely go away and _queries_ can continue to work
off their cached state.json. Updates will fail since ZK quorums are
required for updates to indexes to prevent "split brain" problems.

Best,
Erick

On Mon, Apr 30, 2018 at 11:03 AM, Monica Skidmore
 wrote:
> Thank you, Erick.  That confirms our understanding for a single cluster, 
or once we select a node from one of the two clusters to query.
>
> As we try to set up an external load balancer to go between two clusters, 
though, we still have some questions.  We need a way to determine that a node 
is still 'alive' and should be in the load balancer, and we need a way to know 
that a new node is now available and fully ready with its replicas to add to 
the load balancer.
>
> How does ZooKeeper make this determination?  Does it do something 
different if multiple collections are on a single cluster?  And, even with just 
one cluster, what is best practice for keeping a current list of active nodes 
in the cluster, especially for extremely high query rates?
>
> Again, if there's some good documentation on this, I'd love a pointer...
>
> Monica Skidmore
> Senior Software Engineer
>
>
>
> On 4/30/18, 1:09 PM, "Erick Erickson"  wrote:
>
> Multiple clusters with the same dataset aren't load-balanced by Solr,
> you'll have to accomplish that from "outside", e.g. something that 
sends
> queries to each cluster.
>
> _Within_ a cluster (collection), as long as a request gets to any Solr
> node, sub-requests are distributed with an internal software LB. As 
far as
> a single collection, you're fine just sending any query to any node. 
Even
> if you send a query to a node that hosts no replicas for a 
collection, Solr
> will "do the right thing" and forward it appropiately.
>
> HTH,
> Erick
>
> On Mon, Apr 30, 2018 at 9:46 AM, Monica Skidmore <
> monica.skidm...@careerbuilder.com> wrote:
>
> > We are mig

Error when indexing against a specific dynamic field type

2018-05-01 Thread THADC
Hello,

We are migrating from solr 4.7 to 7.3. When I encounter a data item that
matches a custom dynamic field from our 4.7 schema:

**

, I get the following exception:

*Exception writing document id FULL_36265 to the index; possible analysis
error: Document contains at least one immense term in
field="gridFacts_tsing" (whose UTF8 encoding is longer than the max length
32766), all of which were skipped.  Please correct the analyzer to not
produce such terms.  The prefix of the first immense term is: '[108, 111,
114, 101, 109, 32, 105, 112, 115, 117, 109, 32, 100, 111, 108, 111, 114, 32,
115, 105, 116, 32, 97, 109, 101, 116, 44, 32, 99, 111]...', original
message: bytes can be at most 32766 in length; got 68144.*

Any ideas are greatly appreciated. Thank you.







--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


RE: 7.3 appears to leak

2018-05-01 Thread Markus Jelsma
Mạnh, Shalin,

I tried to reproduce it locally but i failed, it is not just a stream of 
queries and frequent updates/commits. We will temporarily abuse a production 
machine to run 7.3 and a control  machine on 7.2 to rule some things out.

We have plenty custom plugins, so when i can reproduce it again, we can rule 
stuff out and hopefully get back at you guys!

Thanks,
Markus
 
-Original message-
> From:Đạt Cao Mạnh 
> Sent: Monday 30th April 2018 4:07
> To: solr-user@lucene.apache.org
> Subject: Re: 7.3 appears to leak
> 
> Hi Markus,
> 
> I tried indexing documents and query documents with queries and filter
> question, but can't not find any leak problems. Can you give us more
> information about the leak?
> 
> Thanks!
> 
> On Fri, Apr 27, 2018 at 5:11 PM Shalin Shekhar Mangar <
> shalinman...@gmail.com> wrote:
> 
> > Hi Markus,
> >
> > Can you give an idea of what your filter queries look like? Any custom
> > plugins or things we should be aware of? Simple indexing artificial docs,
> > querying and committing doesn't seem to reproduce the issue for me.
> >
> > On Thu, Apr 26, 2018 at 10:13 PM, Markus Jelsma <
> > markus.jel...@openindex.io>
> > wrote:
> >
> > > Hello,
> > >
> > > We just finished upgrading our three separate clusters from 7.2.1 to 7.3,
> > > which went fine, except for our main text search collection, it appears
> > to
> > > leak memory on commit!
> > >
> > > After initial upgrade we saw the cluster slowly starting to run out of
> > > memory within about an hour and a half. We increased heap in case 7.3
> > just
> > > requires more of it, but the heap consumption graph is still growing on
> > > each commit. Heap space cannot be reclaimed by forcing the garbage
> > > collector to run, everything just piles up in the OldGen. Running with
> > this
> > > slightly larger heap, the first nodes will run out of memory in about two
> > > and a half hours after cluster restart.
> > >
> > > The heap eating cluster is a 2shard/3replica system on separate nodes.
> > > Each replica is about 50 GB in size and about 8.5 million documents. On
> > > 7.2.1 it ran fine with just a 2 GB heap. With 7.3 and 2.5 GB heap, it
> > will
> > > take just a little longer for it to run out of memory.
> > >
> > > I inspected reports shown by the sampler of VisualVM and spotted one
> > > peculiarity, the number of instances of SortedIntDocSet kept growing on
> > > each commit by about the same amount as the number of cached filter
> > > queries. But this doesn't happen on the logs cluster, SortedIntDocSet
> > > instances are neatly collected there. The number of instances also
> > accounts
> > > for the number of commits since start up times the cache sizes
> > >
> > > Our other two clusters don't have this problem, one of them receives very
> > > few commits per day, but the other receives data all the time, it logs
> > user
> > > interactions so a large amount of data is coming in all the time. I
> > cannot
> > > reproduce it locally by indexing data and committing all the time, the
> > peak
> > > usage in OldGen stays about the same. But, i can reproduce it locally
> > when
> > > i introduce queries, and filter queries while indexing pieces of data and
> > > committing it.
> > >
> > > So, what is the problem? I dug in the CHANGES.txt of both Lucene and
> > Solr,
> > > but nothing really caught my attention. Does anyone here have an idea
> > where
> > > to look?
> > >
> > > Many thanks,
> > > Markus
> > >
> >
> >
> >
> > --
> > Regards,
> > Shalin Shekhar Mangar.
> >
> 


SolrCloud Heterogenous Hardware setup

2018-05-01 Thread Greenhorn Techie
Hi,

We are building a SolrCloud setup, which will index time-series data. Being
time-series data with write-once semantics, we are planning to have
multiple collections i.e. one collection per month. As per our use case,
end users should be able to query across last 12 months worth of data,
which means 12 collections (with one collection per month). To achieve
this, we are planning to leverage Solr collection aliasing such that the
search_alias collection will point to the 12 collections and indexing will
always happen to the latest collection.

As its write-once kind of data, the question I have is whether it is
possible to have two different hardware profiles within the SolrCloud
cluster such that all the older collections (being read-only) will be
stored on the lower hardware spec, while the latest collection (being write
heavy) will be stored only on the higher hardware profile machines.

   - Is it possible to configure a collection such that the collection data
   is only stored on few nodes in the SolrCloud setup?
   - If this is possible, at the end of each month, what is the approach to
   be taken to “move” the latest collection from higher-spec hardware machines
   to the lower-spec ones?

TIA.


Re: SolrCloud Heterogenous Hardware setup

2018-05-01 Thread Erick Erickson
"Is it possible to configure a collection such that the collection
data is only stored on few nodes in the SolrCloud setup?"

Yes. There are "node placement rules", but also you can create a
collection with a createNodeSet that specifies the nodes that the
replicas are placed on.

" If this is possible, at the end of each month, what is the approach
to be taken to “move” the latest collection from higher-spec hardware
machines to the lower-spec ones?"

There are a bunch of ways, in order of how long they've been around
(check your version). All of these are COLLECTIONS API calls.
- ADDREPLICA/DELETEREPLCIA
- MOVEREPLICA
- REPLACENODE

The other thing you may wan to look at is that David Smiley has been
working on timeseries support in Solr, but that's quite recent so may
not be available in whatever version you're using. Nor do I know
enough details a about it to know how (or if) it it supported the
heterogeneous setup you're talking about. Check CHANGES.txt.

Best,
Erick

On Tue, May 1, 2018 at 7:59 AM, Greenhorn Techie
 wrote:
> Hi,
>
> We are building a SolrCloud setup, which will index time-series data. Being
> time-series data with write-once semantics, we are planning to have
> multiple collections i.e. one collection per month. As per our use case,
> end users should be able to query across last 12 months worth of data,
> which means 12 collections (with one collection per month). To achieve
> this, we are planning to leverage Solr collection aliasing such that the
> search_alias collection will point to the 12 collections and indexing will
> always happen to the latest collection.
>
> As its write-once kind of data, the question I have is whether it is
> possible to have two different hardware profiles within the SolrCloud
> cluster such that all the older collections (being read-only) will be
> stored on the lower hardware spec, while the latest collection (being write
> heavy) will be stored only on the higher hardware profile machines.
>
>- Is it possible to configure a collection such that the collection data
>is only stored on few nodes in the SolrCloud setup?
>- If this is possible, at the end of each month, what is the approach to
>be taken to “move” the latest collection from higher-spec hardware machines
>to the lower-spec ones?
>
> TIA.


Re: Error when indexing against a specific dynamic field type

2018-05-01 Thread Erick Erickson
You're sending it a huge term. My guess is you're sending something
like base64-encoded data or perhaps just a single unbroken string in
your field.

Examine your document, it should jump out at you.

Best,
Erick

On Tue, May 1, 2018 at 7:40 AM, THADC  wrote:
> Hello,
>
> We are migrating from solr 4.7 to 7.3. When I encounter a data item that
> matches a custom dynamic field from our 4.7 schema:
>
> * stored="true" multiValued="false"/>*
>
> , I get the following exception:
>
> *Exception writing document id FULL_36265 to the index; possible analysis
> error: Document contains at least one immense term in
> field="gridFacts_tsing" (whose UTF8 encoding is longer than the max length
> 32766), all of which were skipped.  Please correct the analyzer to not
> produce such terms.  The prefix of the first immense term is: '[108, 111,
> 114, 101, 109, 32, 105, 112, 115, 117, 109, 32, 100, 111, 108, 111, 114, 32,
> 115, 105, 116, 32, 97, 109, 101, 116, 44, 32, 99, 111]...', original
> message: bytes can be at most 32766 in length; got 68144.*
>
> Any ideas are greatly appreciated. Thank you.
>
>
>
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


User queries end up in filterCache if facetting is enabled

2018-05-01 Thread Markus Jelsma
Hello,

We noticed the number of entries of the filterCache to be higher than we 
expected, using showItems="1024" something unexpected was listed as entries of 
the filterCache, the complete Query.toString() of our user queries, massive 
entries, a lot of them.

We also spotted all entries of fields we facet on, even though we don't use 
them as filtes, but that is caused by facet.field=enum, and should be expected, 
right?

Now, the user query entries are not expected. In the simplest set up, searching 
for something and only enabling the facet engine with facet=true causes it to 
appears in the cache as an entry. The following queries:

http://localhost:8983/solr/search/select?q=content_nl:nog&facet=true
http://localhost:8983/solr/search/select?q=*:*&facet=true

become listed as:

CACHE.searcher.filterCache.item_*:*:
org.apache.solr.search.BitDocSet@​70051ee0

CACHE.searcher.filterCache.item_content_nl:nog:
org.apache.solr.search.BitDocSet@​13150cf6

This is on 7.3, but 7.2.1 does this as well. 

So, should i expect this? Can i disable this? Bug?


Thanks,
Markus





Re: Load Balancing between Two Cloud Clusters

2018-05-01 Thread Erick Erickson
Glad to help. Yeah, I thought you might have been making it harder
than it needed to be ;).

In SolrCloud you're constantly running up against "it's just magic
until it's not", knowing when magic applies and when it doesn't can be
tricky, very tricky.

Basically when using LBs, people just throw nodes at the LB when they
come up. If the Solr end points aren't available, then they're skipped
etc.

I'll also add that SolrJ, the CloudSolrClient specifically, does all
this on the client side, it's ZK-aware so knows the topology of the
active Solr nodes and "does the right thing" via internal LBs.

Best,
Erick

On Tue, May 1, 2018 at 6:41 AM, Monica Skidmore
 wrote:
> Thank you, Erick.  This is exactly the information I needed but hadn't 
> correctly parsed as a new Solr cloud user.  You've just made setting up our 
> new configuration much easier!!
>
> Monica Skidmore
> Senior Software Engineer
>
>
>
> On 4/30/18, 7:29 PM, "Erick Erickson"  wrote:
>
> "We need a way to determine that a node is still 'alive' and should be
> in the load balancer, and we need a way to know that a new node is now
> available and fully ready with its replicas to add to the load
> balancer."
>
> Why? If a Solr node is running but the replicas aren't up yet, it'll
> pass the request along to a node that _does_ have live replicas, you
> don't have to do anything. As far as the node being alive, there are
> lots of ways, any API end point has to have a Solr to field it,
> perhaps just use the Collections LIST command?
>
> "How does ZooKeeper make this determination?  Does it do something
> different if multiple collections are on a single cluster?  And, even
> with just one cluster, what is best practice for keeping a current
> list of active nodes in the cluster, especially for extremely high
> query rates?"
>
> This is a common misconception. ZooKeeper isn't interested in Solr at
> all. ZooKeeper will ping the nodes it knows about and, perhaps, remove
> a node from the live_nodes list, but that's all. It isn't involved in
> Solr's operation in terms of routing queries, updates or anything like
> that.
>
> _Solr_ keeps track of all this by _watching_ various znodes. Say Solr
> hosts some replica in a collection. when it comes up it sets a "watch"
> on the /collections/my_collection/state.json Znode. It also published
> its own state. So say it hosts three replicas for the collection. As
> each one is loaded and ready for action, Solr posts an update to the
> relevant state.json file.
>
> ZooKeeper is then responsible for telling an other node who'd set a
> watch that the znode has changed. ZK doesn't know or care whether
> those are Solr nodes or not.
>
> So when a request comes in to a Solr node, it knows what other Solr
> nodes host what particular replicas and does all the sub-requests
> itself, ZK isn't involved at all at that level.
>
> So imagine node1 hosts S1R1 and S2R1 Node2 hosts S1R2 and S2R2 (for
> collection A). When node1 comes up it updates the state in ZK to say
> S1R2 and S1R2 are "active". Now claim node2 is coming up but hasn't
> loaded it's cores yet. If it receives a request it can forward them on
> to node1.
>
> Now node2 loads both its cores. It updates the ZK node for the
> collection, and since node1 is watching, it fetches the updated
> state.json. From this point forward, both nodes have complete
> information about all the replicas in the collection and don't need to
> reference ZK any more at all.
>
> In fact, ZK can completely go away and _queries_ can continue to work
> off their cached state.json. Updates will fail since ZK quorums are
> required for updates to indexes to prevent "split brain" problems.
>
> Best,
> Erick
>
> On Mon, Apr 30, 2018 at 11:03 AM, Monica Skidmore
>  wrote:
> > Thank you, Erick.  That confirms our understanding for a single 
> cluster, or once we select a node from one of the two clusters to query.
> >
> > As we try to set up an external load balancer to go between two 
> clusters, though, we still have some questions.  We need a way to determine 
> that a node is still 'alive' and should be in the load balancer, and we need 
> a way to know that a new node is now available and fully ready with its 
> replicas to add to the load balancer.
> >
> > How does ZooKeeper make this determination?  Does it do something 
> different if multiple collections are on a single cluster?  And, even with 
> just one cluster, what is best practice for keeping a current list of active 
> nodes in the cluster, especially for extremely high query rates?
> >
> > Again, if there's some good documentation on this, I'd love a pointer...
> >
> > Monica Skidmore
> > Senior Software Engineer
> >
> >
> >
> > On 4/30/18, 1:09 PM, "Erick Erickson"  wrote:

Re: Error when indexing against a specific dynamic field type

2018-05-01 Thread Steve Rowe
The input in the error message starts “lorem ipsum”, so it contains spaces, but 
the alphaOnlySort field type (in Solr’s example schemas anyway) uses 
KeywordTokenizer, which tokenizes the entire input as a single token.

As Erick implied, you maybe should not be doing that with this kind of data - 
perhaps the analyzer used by this dynamic field should change?

Alternatively, you could:

a) truncate long values so that a prefix makes it through the indexing process, 
e.g. by adding TruncateTokenFilterFactory[1] to alphaOnlySort’s analyzer, or by 
adding TruncateFieldUpdateProcessorFactory[2] to your update request processor 
chain; or

b) entirely eliminate overly long values, e.g. using LengthFilterFactory[3].

[1] 
https://lucene.apache.org/core/7_3_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/TruncateTokenFilterFactory.html
[2] 
https://lucene.apache.org/solr/7_3_0/solr-core/org/apache/solr/update/processor/TruncateFieldUpdateProcessorFactory.html
[3] 
https://lucene.apache.org/solr/guide/7_3/filter-descriptions.html#length-filter

--
Steve
www.lucidworks.com

> On May 1, 2018, at 11:28 AM, Erick Erickson  wrote:
> 
> You're sending it a huge term. My guess is you're sending something
> like base64-encoded data or perhaps just a single unbroken string in
> your field.
> 
> Examine your document, it should jump out at you.
> 
> Best,
> Erick
> 
> On Tue, May 1, 2018 at 7:40 AM, THADC  
> wrote:
>> Hello,
>> 
>> We are migrating from solr 4.7 to 7.3. When I encounter a data item that
>> matches a custom dynamic field from our 4.7 schema:
>> 
>> *> stored="true" multiValued="false"/>*
>> 
>> , I get the following exception:
>> 
>> *Exception writing document id FULL_36265 to the index; possible analysis
>> error: Document contains at least one immense term in
>> field="gridFacts_tsing" (whose UTF8 encoding is longer than the max length
>> 32766), all of which were skipped.  Please correct the analyzer to not
>> produce such terms.  The prefix of the first immense term is: '[108, 111,
>> 114, 101, 109, 32, 105, 112, 115, 117, 109, 32, 100, 111, 108, 111, 114, 32,
>> 115, 105, 116, 32, 97, 109, 101, 116, 44, 32, 99, 111]...', original
>> message: bytes can be at most 32766 in length; got 68144.*
>> 
>> Any ideas are greatly appreciated. Thank you.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Error when indexing against a specific dynamic field type

2018-05-01 Thread Shawn Heisey
On 5/1/2018 8:40 AM, THADC wrote:
> I get the following exception:
>
> *Exception writing document id FULL_36265 to the index; possible analysis
> error: Document contains at least one immense term in
> field="gridFacts_tsing" (whose UTF8 encoding is longer than the max length
> 32766), all of which were skipped.  Please correct the analyzer to not
> produce such terms.  The prefix of the first immense term is: '[108, 111,
> 114, 101, 109, 32, 105, 112, 115, 117, 109, 32, 100, 111, 108, 111, 114, 32,
> 115, 105, 116, 32, 97, 109, 101, 116, 44, 32, 99, 111]...', original
> message: bytes can be at most 32766 in length; got 68144.*
>
> Any ideas are greatly appreciated. Thank you.

The error is not ambiguous.  It tells you precisely what the problem
is.  A single term in a Lucene index cannot be longer than about 32K,
that one has a term that's more than twice that size.

I'm guessing that the fieldType named alphaOnlySort is one of two
things:  Either the StrField class, or the TextField class with the
keyword tokenizer factory.

To fix this problem you will need to either reduce the size of the input
on the field, or use an analysis chain that splits the input into
smaller tokens.  It appears that the input string is comma separated
numbers, which probably should be tokenized, not treated as a single term.

Thanks,
Shawn



Re: SolrCloud Heterogenous Hardware setup

2018-05-01 Thread Greenhorn Techie
Thanks Erick. This information is very helpful. Will explore further on the
node placement rules within Collections API.

Many Thanks


On 1 May 2018 at 16:26:34, Erick Erickson (erickerick...@gmail.com) wrote:

"Is it possible to configure a collection such that the collection
data is only stored on few nodes in the SolrCloud setup?"

Yes. There are "node placement rules", but also you can create a
collection with a createNodeSet that specifies the nodes that the
replicas are placed on.

" If this is possible, at the end of each month, what is the approach
to be taken to “move” the latest collection from higher-spec hardware
machines to the lower-spec ones?"

There are a bunch of ways, in order of how long they've been around
(check your version). All of these are COLLECTIONS API calls.
- ADDREPLICA/DELETEREPLCIA
- MOVEREPLICA
- REPLACENODE

The other thing you may wan to look at is that David Smiley has been
working on timeseries support in Solr, but that's quite recent so may
not be available in whatever version you're using. Nor do I know
enough details a about it to know how (or if) it it supported the
heterogeneous setup you're talking about. Check CHANGES.txt.

Best,
Erick

On Tue, May 1, 2018 at 7:59 AM, Greenhorn Techie
 wrote:
> Hi,
>
> We are building a SolrCloud setup, which will index time-series data.
Being
> time-series data with write-once semantics, we are planning to have
> multiple collections i.e. one collection per month. As per our use case,
> end users should be able to query across last 12 months worth of data,
> which means 12 collections (with one collection per month). To achieve
> this, we are planning to leverage Solr collection aliasing such that the
> search_alias collection will point to the 12 collections and indexing
will
> always happen to the latest collection.
>
> As its write-once kind of data, the question I have is whether it is
> possible to have two different hardware profiles within the SolrCloud
> cluster such that all the older collections (being read-only) will be
> stored on the lower hardware spec, while the latest collection (being
write
> heavy) will be stored only on the higher hardware profile machines.
>
> - Is it possible to configure a collection such that the collection data
> is only stored on few nodes in the SolrCloud setup?
> - If this is possible, at the end of each month, what is the approach to
> be taken to “move” the latest collection from higher-spec hardware
machines
> to the lower-spec ones?
>
> TIA.


Re: Error when indexing against a specific dynamic field type

2018-05-01 Thread THADC
Erick, thanks for the response. I have a number of documents in our database
where solr is throwing the same exception against *_tsing types.

However, when I index against the same document with our solr 4.7, it is
successfully indexed. So, I assume something is different between 4.7 and
7.3. I was assuming I could adjust the dynamic field somehow so that it
indexes against these documents without errors when using 7.3.

I can't remove the offending documents. Its my customer's data 

Is there some adjustment I can make to the dynamic field?

Thanks again.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: SolrCloud Heterogenous Hardware setup

2018-05-01 Thread Deepak Goel
I had a similar problem some time back. Although it might not be the best
way, but I used cron to move data from a high-end-spec to a lower-end-spec.
It worked beautifully



Deepak
"The greatness of a nation can be judged by the way its animals are
treated. Please stop cruelty to Animals, become a Vegan"

+91 73500 12833
deic...@gmail.com

Facebook: https://www.facebook.com/deicool
LinkedIn: www.linkedin.com/in/deicool

"Plant a Tree, Go Green"

Make In India : http://www.makeinindia.com/home

On Tue, May 1, 2018 at 10:02 PM, Greenhorn Techie  wrote:

> Thanks Erick. This information is very helpful. Will explore further on the
> node placement rules within Collections API.
>
> Many Thanks
>
>
> On 1 May 2018 at 16:26:34, Erick Erickson (erickerick...@gmail.com) wrote:
>
> "Is it possible to configure a collection such that the collection
> data is only stored on few nodes in the SolrCloud setup?"
>
> Yes. There are "node placement rules", but also you can create a
> collection with a createNodeSet that specifies the nodes that the
> replicas are placed on.
>
> " If this is possible, at the end of each month, what is the approach
> to be taken to “move” the latest collection from higher-spec hardware
> machines to the lower-spec ones?"
>
> There are a bunch of ways, in order of how long they've been around
> (check your version). All of these are COLLECTIONS API calls.
> - ADDREPLICA/DELETEREPLCIA
> - MOVEREPLICA
> - REPLACENODE
>
> The other thing you may wan to look at is that David Smiley has been
> working on timeseries support in Solr, but that's quite recent so may
> not be available in whatever version you're using. Nor do I know
> enough details a about it to know how (or if) it it supported the
> heterogeneous setup you're talking about. Check CHANGES.txt.
>
> Best,
> Erick
>
> On Tue, May 1, 2018 at 7:59 AM, Greenhorn Techie
>  wrote:
> > Hi,
> >
> > We are building a SolrCloud setup, which will index time-series data.
> Being
> > time-series data with write-once semantics, we are planning to have
> > multiple collections i.e. one collection per month. As per our use case,
> > end users should be able to query across last 12 months worth of data,
> > which means 12 collections (with one collection per month). To achieve
> > this, we are planning to leverage Solr collection aliasing such that the
> > search_alias collection will point to the 12 collections and indexing
> will
> > always happen to the latest collection.
> >
> > As its write-once kind of data, the question I have is whether it is
> > possible to have two different hardware profiles within the SolrCloud
> > cluster such that all the older collections (being read-only) will be
> > stored on the lower hardware spec, while the latest collection (being
> write
> > heavy) will be stored only on the higher hardware profile machines.
> >
> > - Is it possible to configure a collection such that the collection data
> > is only stored on few nodes in the SolrCloud setup?
> > - If this is possible, at the end of each month, what is the approach to
> > be taken to “move” the latest collection from higher-spec hardware
> machines
> > to the lower-spec ones?
> >
> > TIA.
>


Re: Error when indexing against a specific dynamic field type

2018-05-01 Thread Erick Erickson
Steve's comment is much more germane. KeywordTokenizer,
used in alphaOnlySort last I knew is not appropriate at all.
Do you really want single tokens that consist of the entire
document for sorting purposes? Wouldn't the first 1K be enough?

It looks like this was put in in 4.0, so I'm guessing your analysis chain
is different now between the two versions.

It doesn't really matter though, this is not going to be changed.
You'll have to do something about your long fields or your
analysis chain. And/or revisit what you hope to accomplish
with using that field type on such a field, I'm almost certain
your use case is flawed.

Best,
Erick




On Tue, May 1, 2018 at 10:35 AM, THADC
 wrote:
> Erick, thanks for the response. I have a number of documents in our database
> where solr is throwing the same exception against *_tsing types.
>
> However, when I index against the same document with our solr 4.7, it is
> successfully indexed. So, I assume something is different between 4.7 and
> 7.3. I was assuming I could adjust the dynamic field somehow so that it
> indexes against these documents without errors when using 7.3.
>
> I can't remove the offending documents. Its my customer's data
>
> Is there some adjustment I can make to the dynamic field?
>
> Thanks again.
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Query Regarding Solr Garbage Collection

2018-05-01 Thread Greenhorn Techie
Hi,

Following the https://wiki.apache.org/solr/SolrPerformanceFactors article,
I understand that Garbage Collection might be triggered due to significant
increase in JVM heap usage unless a commit is performed. Given this
background, I am curious to understand the reasons / factors that
contribute to increased heap usage of Solr JVM, which would thus force a
Garbage Collection cycle.

Especially, what are the factors that contribute to heap usage increase
during indexing time and what factors contribute during search/query time?

Thanks


Median Date

2018-05-01 Thread Jim Freeby
All,
We have a dateImported field in our schema.
I'd like to generate a statistic showing the median dateImported (actually we 
want median age of the documents, based on the dateImported value).
I have other stats that calculate the median value of numbers (like price).
This was achieved with something like:
rows=0&stats=true&stats.field={!tag=piv1 
percentiles='50'}price&facet=true&facet.pivot={!stats=piv1 }status
I have not found a way to calculate the median dateImported.  The mean works, 
but we  need median.
Any help would be appreciated?
Cheers,

Jim


Solr Heap usage

2018-05-01 Thread Greenhorn Techie
Hi,

Wondering what are the considerations to be aware to arrive at an optimal
heap size for Solr JVM? Though I did discuss this on the IRC, I am still
unclear on how Solr uses the JVM heap space. Are there any pointers to
understand this aspect better?

Given that Solr requires an optimally configured heap, so that the
remaining unused memory can be used for OS disk cache, I wonder how to best
configure Solr heap. Also, on the IRC it was discussed that having 31GB of
heap is better than having 32GB due to Java’s internal usage of heap. Can
anyone guide further on heap configuration please?

Thanks


Re: 7.3 appears to leak

2018-05-01 Thread Đạt Cao Mạnh
Thank Markus,

So I will go ahead with 7.3.1 release.

On Tue, May 1, 2018 at 9:41 PM Markus Jelsma 
wrote:

> Mạnh, Shalin,
>
> I tried to reproduce it locally but i failed, it is not just a stream of
> queries and frequent updates/commits. We will temporarily abuse a
> production machine to run 7.3 and a control  machine on 7.2 to rule some
> things out.
>
> We have plenty custom plugins, so when i can reproduce it again, we can
> rule stuff out and hopefully get back at you guys!
>
> Thanks,
> Markus
>
> -Original message-
> > From:Đạt Cao Mạnh 
> > Sent: Monday 30th April 2018 4:07
> > To: solr-user@lucene.apache.org
> > Subject: Re: 7.3 appears to leak
> >
> > Hi Markus,
> >
> > I tried indexing documents and query documents with queries and filter
> > question, but can't not find any leak problems. Can you give us more
> > information about the leak?
> >
> > Thanks!
> >
> > On Fri, Apr 27, 2018 at 5:11 PM Shalin Shekhar Mangar <
> > shalinman...@gmail.com> wrote:
> >
> > > Hi Markus,
> > >
> > > Can you give an idea of what your filter queries look like? Any custom
> > > plugins or things we should be aware of? Simple indexing artificial
> docs,
> > > querying and committing doesn't seem to reproduce the issue for me.
> > >
> > > On Thu, Apr 26, 2018 at 10:13 PM, Markus Jelsma <
> > > markus.jel...@openindex.io>
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > We just finished upgrading our three separate clusters from 7.2.1 to
> 7.3,
> > > > which went fine, except for our main text search collection, it
> appears
> > > to
> > > > leak memory on commit!
> > > >
> > > > After initial upgrade we saw the cluster slowly starting to run out
> of
> > > > memory within about an hour and a half. We increased heap in case 7.3
> > > just
> > > > requires more of it, but the heap consumption graph is still growing
> on
> > > > each commit. Heap space cannot be reclaimed by forcing the garbage
> > > > collector to run, everything just piles up in the OldGen. Running
> with
> > > this
> > > > slightly larger heap, the first nodes will run out of memory in
> about two
> > > > and a half hours after cluster restart.
> > > >
> > > > The heap eating cluster is a 2shard/3replica system on separate
> nodes.
> > > > Each replica is about 50 GB in size and about 8.5 million documents.
> On
> > > > 7.2.1 it ran fine with just a 2 GB heap. With 7.3 and 2.5 GB heap, it
> > > will
> > > > take just a little longer for it to run out of memory.
> > > >
> > > > I inspected reports shown by the sampler of VisualVM and spotted one
> > > > peculiarity, the number of instances of SortedIntDocSet kept growing
> on
> > > > each commit by about the same amount as the number of cached filter
> > > > queries. But this doesn't happen on the logs cluster, SortedIntDocSet
> > > > instances are neatly collected there. The number of instances also
> > > accounts
> > > > for the number of commits since start up times the cache sizes
> > > >
> > > > Our other two clusters don't have this problem, one of them receives
> very
> > > > few commits per day, but the other receives data all the time, it
> logs
> > > user
> > > > interactions so a large amount of data is coming in all the time. I
> > > cannot
> > > > reproduce it locally by indexing data and committing all the time,
> the
> > > peak
> > > > usage in OldGen stays about the same. But, i can reproduce it locally
> > > when
> > > > i introduce queries, and filter queries while indexing pieces of
> data and
> > > > committing it.
> > > >
> > > > So, what is the problem? I dug in the CHANGES.txt of both Lucene and
> > > Solr,
> > > > but nothing really caught my attention. Does anyone here have an idea
> > > where
> > > > to look?
> > > >
> > > > Many thanks,
> > > > Markus
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > > Shalin Shekhar Mangar.
> > >
> >
>


Re: Learning to Rank (LTR) with grouping

2018-05-01 Thread ilayaraja
*
"Top K shouldn't start from the "start" parameter, if it does, it is a bug.
"***

1. I clearly see that LTR do re-rank based on the start parameter.
2. When reRankDocs=24, pageSize=24, I still get the second page of results
re-ranked by ltr plugin when I query with start=24.


Alessandro Benedetti wrote
> Are you using SolrCloud or any distributed search ?
> 
> If you are using just a single Solr instance, LTR should have no problem
> with pagination.
> The re-rank involves the top K and then you paginate.
> So if a document from the original score page 1 ends up in page 3, you
> will
> see it at page three.
> have you verified that : "Say, if an item (Y) from second page is moved to
> first page after 
> re-ranking, while an item (X) from first page is moved away from the first 
> page.  ?" 
> Top K shouldn't start from the "start" parameter, if it does, it is a bug.
> 
> The situation change a little with distributed search where you can
> experiment this behaviour : 
> 
> *Pagination*
> Let’s explore the scenario on a single Solr node and on a sharded
> architecture.
> 
> SINGLE SOLR NODE
> 
> reRankDocs=15
> rows=10
> This means each page is composed by 10 results.
> What happens when we hit the page 2 ?
> The first 5 documents in the search results will have been rescored and
> affected by the reranking.
> The latter 5 documents will preserve the original score and original
> ranking.
> 
> e.g.
> Doc 11 – score= 1.2
> Doc 12 – score= 1.1
> Doc 13 – score= 1.0
> Doc 14 – score= 0.9
> Doc 15 – score= 0.8
> Doc 16 – score= 5.7
> Doc 17 – score= 5.6
> Doc 18 – score= 5.5
> Doc 19 – score= 4.6
> Doc 20 – score= 2.4
> This means that score(15) could be < score(16), but document 15 and 16 are
> still in the expected order.
> The reason is that the top 15 documents are rescored and reranked and the
> rest is left unchanged.
> 
> *SHARDED ARCHITECTURE*
> 
> reRankDocs=15
> rows=10
> Shards number=2
> When looking for the page 2, Solr will trigger queries to she shards to
> collect 2 pages per shard :
> Shard1 : 10 ReRanked docs (page1) + 5 ReRanked docs + 5 OriginalScored
> docs
> (page2)
> Shard2 : 10 ReRanked docs (page1) + 5 ReRanked docs + 5 OriginalScored
> docs
> (page2)
> 
> The the results will be merged, and possibly, original scored search
> results
> can top up reranked docs.
> A possible solution could be to normalise the scores to prevent any
> possibility that a reranked result is surpassed by original scored ones.
> 
> Note: The problem is going to happen after you reach rows * page >
> reRankDocs. In situations when reRankDocs is quite high , the problem will
> occur only in deep paging.
> 
> 
> 
> -
> ---
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html





-
--Ilay
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html