relaxed vs. improved validation in solr.TrieDateField

2016-04-29 Thread Uwe Reh

Hi,

doing some migration tests (4.10 to 6.0) I recognized a improved 
validation of TrieDateField.
Syntactical correct but impossible days are rejected now. (stack trace 
at the end of the mail)


Examples:
- '1997-02-29T00:00:00Z'
- '2006-06-31T00:00:00Z'
- '2000-00-00T00:00:00Z'
The first two dates are formal ok, but the Date does not exist. The 
third date is more suspicions, but was also accepted by Solr 4.10.


I appreciate this improvement in principle, but I have to respect the 
original data. The dates might be intentionally wrong.


Is there an easy way to get the weaker validation back?

Regards
Uwe



Invalid Date in Date Math String:'1997-02-29T00:00:00Z'
at 
org.apache.solr.util.DateMathParser.parseMath(DateMathParser.java:254)
at org.apache.solr.schema.TrieField.createField(TrieField.java:726)
at org.apache.solr.schema.TrieField.createFields(TrieField.java:763)
at 
org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:47)




Re: solr | backup and restoration

2016-04-29 Thread Jan Verweij - Reeleez
Hi Prateek,

To me it feels like the backup/restore is still an open item should be
higher on the agenda.
Yes, there are work-arounds like copying data from/into the index folder
but this doesn' t seem very stable.

I'm using the following approach in solrcloud since I ran into an issue
with restoring the same backup multiple times:

rm $SOLRDATA_LOCATION/$INDEXNAME/data/index.properties
rm -rf $SOLRDATA_LOCATION/$INDEXNAME/data/restore.snapshot.
curl "
http://localhost:8983/solr/$INDEXNAME/replication?command=restore&location=$BACKUP_LOCATION&name=$INDEXNAME.$TIMESTAMP
"

Though this is working, I still think it should work with a simple restore
command and no need to tweak the current core/index directory/files.




Met vriendelijke groet / Kind regards,

*Jan Verweij*
+31 (0)6 460 010 86
j...@reeleez.nl
www.reeleez.nl
*Disclaimer*: The information contained in this message is for the intended
addressee only and may contain confidential and/or privileged information.
If you are not the intended addressee, please delete this message and
notify the sender; do not copy or distribute this message or disclose its
contents to anyone.











On 27 April 2016 at 15:45, Prateek Jain J 
wrote:

>
> Manually copying files under index directory fixed the issue.
>
>
> Regards,
> Prateek Jain
>
> -Original Message-
> From: Prateek Jain J [mailto:prateek.j.j...@ericsson.com]
> Sent: 27 April 2016 02:08 PM
> To: solr-user@lucene.apache.org
> Subject: solr | backup and restoration
>
>
> Hi,
>
> We are using solr 4.8.1 in production and want to create backups at
> runtime. As per the reference guide,  we can create backup using something
> like this:
>
>
> http://localhost:8983/solr/myCore/replication?command=backup&location=/tmp/myBackup&numberToKeep=1
>
> and we verified that some file are getting created in /tmp/myBackup
> directory. The issue that we are facing is, how to restore everything using
> this backup.
>  Admin guide does talk about "Merging Indexes" using two methods:
>
>
> a.   indexDir for example,
>
>
> http://localhost:8983/solr/admin/cores?action=mergeindexes&core=core0&indexDir=/home/solr/core1/data/index&;
>
> indexDir=/home/solr/core2/data/index
>
>
>
> b.  srcCore for example,
>
>
> http://localhost:8983/solr/admin/cores?action=mergeindexes&core=core0&srcCore=core1&srcCore=core2
>
>
>
> these are not working in our case as, we want entire data should also be
> back there for example, if we want to re-create core from a snapshot. I do
> see there is such functionality available in later versions as, described
> here
>
>
> https://cwiki.apache.org/confluence/display/solr/Making+and+Restoring+Backups+of+SolrCores
>
>
> Regards,
> Prateek Jain
>
>


Facet ignoring repeated word

2016-04-29 Thread G, Rajesh
Hi,

I am trying to implement word 
cloud
  using Solr.  The problem I have is Solr facet query ignores repeated words in 
a document eg.

I have indexed the text :
It seems that the harder I work, the more work I get for the same compensation 
and reward. The more work I take on gets absorbed into my "normal" workload and 
I'm not recognized for working harder than my peers, which makes me not want to 
work to my potential. I am very underwhelmed by the evaluation process and 
bonus structure. I don't believe the current structure rewards strong 
performers. I am confident that the company could not hire someone with my 
talent to replace me if I left, but I don't think the company realizes that.

The indexed content has word my and the count the is 3 but when I run the query 
http://localhost:8182/solr/dev/select?facet=true&facet.field=comments&rows=0&indent=on&q=questionid:3956&wt=json
 the count of word my  is 1 and not 3. Can you please help?

Also please suggest If there is a better way to implement word cloud in Solr 
other than using facet?

"facet_fields":{
  "comments":[
"absorbed",1,
"am",1,
"believe",1,
"bonus",1,
"company",1,
"compensation",1,
"confident",1,
"could",1,
"current",1,
"don't",1,
"evaluation",1,
"get",1,
"gets",1,
"harder",1,
"hire",1,
"i",1,
"i'm",1,
"left",1,
"makes",1,
"me",1,
"more",1,
"my",1,
"normal",1,
"peers",1,
"performers",1,
"potential",1,
"process",1,
"realizes",1,
"recognized",1,
"replace",1,
"reward",1,
"rewards",1,
"same",1,
"seems",1,
"someone",1,
"strong",1,
"structure",1,
"take",1,
"talent",1,
"than",1,
"think",1,
"underwhelmed",1,
"very",1,
"want",1,
"which",1,
"work",1,
"working",1,
"workload",1]
}




CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered 
office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, 
Haryana-122002, India..



This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.




dataimport db-data-config.xml

2016-04-29 Thread kishor
I want to import data from mysql-table and csv file ata the same time beacuse
some data are in mysql tables and some are in csv file . I want to match
specific id from mysql table in csv file then add the data in solar.

What i think or wnat to do








   




   

Is this possible in solr? 

Please suggest me How to import data from csv and mysql table at the same
time.









--
View this message in context: 
http://lucene.472066.n3.nabble.com/dataimport-db-data-config-xml-tp4270673p4273614.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Set router.field in unit tests

2016-04-29 Thread Markus Jelsma
Hi - any hints to share?

Thanks!
Markus

 
 
-Original message-
> From:Markus Jelsma 
> Sent: Thursday 28th April 2016 13:30
> To: solr-user 
> Subject: Set router.field in unit tests
> 
> Hi - i'm working on a unit test that requires the cluster's router.field to 
> be set to a field different than ID. But i can't find it?! How can i set 
> router.field with AbstractFullDistribZkTestBase?
> 
> Thanks!
> Markus
> 


Many to Many Mapping with Solr

2016-04-29 Thread Sandeep Mestry
Hi All,

Hope the day is going on well for you.

This question has been asked before, but I couldn't find answer to my
specific request. I have many to many relationship and the mapping table
has additional columns. Whats the best way I can model this into solr
entity?

For example: a user has many recordings and a recording belongs to many
users. But each user-recording has additional feature like type, number etc.
I'd like to fetch recordings for the user. If the user adds/ updates/
deletes a recording then that should be reflected in the search.

I have 2 options:
1) to create user entity, recording entity and user_recording entity
- this is good but it's like treating solr like rdbms which i mostly avoid..

2) user entity containing all the recordings information and each recording
containing user information
- this has impact on index size but the fetch and manipulation will be
faster.

Any guidance will be good..

Thanks,
Sandeep


Re: Set router.field in unit tests

2016-04-29 Thread GW
Not exactly suer what you mean but I think you are wanting to change your
schema.xml



to




restart solr


On 29 April 2016 at 06:04, Markus Jelsma  wrote:

> Hi - any hints to share?
>
> Thanks!
> Markus
>
>
>
> -Original message-
> > From:Markus Jelsma 
> > Sent: Thursday 28th April 2016 13:30
> > To: solr-user 
> > Subject: Set router.field in unit tests
> >
> > Hi - i'm working on a unit test that requires the cluster's router.field
> to be set to a field different than ID. But i can't find it?! How can i set
> router.field with AbstractFullDistribZkTestBase?
> >
> > Thanks!
> > Markus
> >
>


Re: issues doing a spatial query

2016-04-29 Thread GW
I realise the world wrap thing. but it is correct ~ they are coordinates
taken from Google maps. I'd does not really matter though. I switched the
query to use geofilt and everything is fine.

Here's the kicker.

There is a post somewhere online that says you cannot use geofilt with
multivalued location_RPT. I lost months because I did not try it.

If i use geofilt with the coordinates in question (last in the multvalue)
with a distance of 1km I get a perfect result. In fact I can get a perfect
single direct hit on any of the values with geofilt + distance +
multivalued.


People that don't know what they are talking about should not post.

Many thanks for your response.

GW


On 29 April 2016 at 00:40, David Smiley  wrote:

> Hi.
> This makes sense to me.  The point 49.8,-97.1 is in your query box.  The
> box is lower-left to upper-right, so your box is actually an almost
> world-wrapping one grabbing all longitudes except  -93 to -92.  Maybe you
> mean to switch your left & right.
>
> On Sun, Apr 24, 2016 at 8:03 PM GW  wrote:
>
> > I was not getting the results I expected so I started testing with the
> solr
> > webclient
> >
> > Maybe I don;t understand things.
> >
> > simple test query
> >
> > q=*:*&fq=locations:[49,-92 TO 50,-93]
> >
> > I don't understand why I get a result set for longitude range -92 to -93
> > but should be zero results as far as I understand.
> >
> >
> >
> 
> >
> > {
> >   "responseHeader": {
> > "status": 0,
> > "QTime": 2,
> > "params": {
> >   "q": "*:*",
> >   "indent": "true",
> >   "fq": "locations:[49,-92 TO 50,-93]",
> >   "wt": "json",
> >   "_": "1461541195102"
> > }
> >   },
> >   "response": {
> > "numFound": 85,
> > "start": 0,
> > "docs": [
> >   {
> > "id": "data.spidersilk.co!337",
> > "entity_id": "337",
> > "type_id": "simple",
> > "gender": "Male",
> > "name": "Aviator Sunglasses",
> > "short_description": "A timeless accessory staple, the
> > unmistakable teardrop lenses of our Aviator sunglasses appeal to
> > everyone from suits to rock stars to citizens of the world.",
> > "description": "Gunmetal frame with crystal gradient
> > polycarbonate lenses in grey. ",
> > "size": "",
> > "color": "",
> > "zdomain": "magento.spidersilk.co",
> > "zurl":
> > "
> >
> http://magento.spidersilk.co/index.php/catalog/product/view/id/337/s/aviator-sunglasses/
> > ",
> > "main_image_url":
> > "
> >
> http://magento.spidersilk.co/media/catalog/product/cache/0/image/9df78eab33525d08d6e5fb8d27136e95/a/c/ace000a_1.jpg
> > ",
> > "keywords": "Eyewear  ",
> > "data_size": "851,564",
> > "category": "Eyewear",
> > "final_price_without_tax": "295,USD",
> > "image_url": [
> >   "
> > http://magento.spidersilk.co/media/catalog/product/a/c/ace000a_1.jpg";,
> >   "
> > http://magento.spidersilk.co/media/catalog/product/a/c/ace000b_1.jpg";
> > ],
> > "locations": [
> >   "37.4463603,-122.1591775",
> >   "42.5857514,-82.8873787",
> >   "41.6942622,-86.2697108",
> >   "49.8522263,-97.1390697"
> > ],
> > "_version_": 1532418847465799700
> >   },
> >
> >
> >
> > Thanks,
> >
> > GW
> >
> --
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com
>


Re: Many to Many Mapping with Solr

2016-04-29 Thread Alexandre Rafalovitch
You do not structure Solr to represent your database. You structure it
to represent what you will search.

In your case, it sounds like you want to return 'user-records', in
which case you will index the related information all together. Yes,
you will possibly need to recreate the multiple documents when you
update one record (or one user). And yes, you will have the same
information multiple times. But you can used index-only values or
docvalues to reduce storage and duplication.

You may also want to have Solr return only the relevant IDs from the
search and you recreate the m-to-m object structure from the database.
Then, you don't need to store much at all, just index.

Basically, don't think about your database as much when deciding Solr
structure. It does not map one-to-one.

Regards,
   Alex.

Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 29 April 2016 at 20:48, Sandeep Mestry  wrote:
> Hi All,
>
> Hope the day is going on well for you.
>
> This question has been asked before, but I couldn't find answer to my
> specific request. I have many to many relationship and the mapping table
> has additional columns. Whats the best way I can model this into solr
> entity?
>
> For example: a user has many recordings and a recording belongs to many
> users. But each user-recording has additional feature like type, number etc.
> I'd like to fetch recordings for the user. If the user adds/ updates/
> deletes a recording then that should be reflected in the search.
>
> I have 2 options:
> 1) to create user entity, recording entity and user_recording entity
> - this is good but it's like treating solr like rdbms which i mostly avoid..
>
> 2) user entity containing all the recordings information and each recording
> containing user information
> - this has impact on index size but the fetch and manipulation will be
> faster.
>
> Any guidance will be good..
>
> Thanks,
> Sandeep


Re: Many to Many Mapping with Solr

2016-04-29 Thread Joel Bernstein
We really still need to know more about your use case. In particular what
types of questions will you be asking of the data? It's useful to do this
in plain english without mapping to any specific implementation.


Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Apr 29, 2016 at 9:43 AM, Alexandre Rafalovitch 
wrote:

> You do not structure Solr to represent your database. You structure it
> to represent what you will search.
>
> In your case, it sounds like you want to return 'user-records', in
> which case you will index the related information all together. Yes,
> you will possibly need to recreate the multiple documents when you
> update one record (or one user). And yes, you will have the same
> information multiple times. But you can used index-only values or
> docvalues to reduce storage and duplication.
>
> You may also want to have Solr return only the relevant IDs from the
> search and you recreate the m-to-m object structure from the database.
> Then, you don't need to store much at all, just index.
>
> Basically, don't think about your database as much when deciding Solr
> structure. It does not map one-to-one.
>
> Regards,
>Alex.
> 
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 29 April 2016 at 20:48, Sandeep Mestry  wrote:
> > Hi All,
> >
> > Hope the day is going on well for you.
> >
> > This question has been asked before, but I couldn't find answer to my
> > specific request. I have many to many relationship and the mapping table
> > has additional columns. Whats the best way I can model this into solr
> > entity?
> >
> > For example: a user has many recordings and a recording belongs to many
> > users. But each user-recording has additional feature like type, number
> etc.
> > I'd like to fetch recordings for the user. If the user adds/ updates/
> > deletes a recording then that should be reflected in the search.
> >
> > I have 2 options:
> > 1) to create user entity, recording entity and user_recording entity
> > - this is good but it's like treating solr like rdbms which i mostly
> avoid..
> >
> > 2) user entity containing all the recordings information and each
> recording
> > containing user information
> > - this has impact on index size but the fetch and manipulation will be
> > faster.
> >
> > Any guidance will be good..
> >
> > Thanks,
> > Sandeep
>


Re: Facet ignoring repeated word

2016-04-29 Thread Ahmet Arslan
Hi,

Depending on your requirements; StatsComponent, TermsComponent, 
LukeRequestHandler can also be used.


https://cwiki.apache.org/confluence/display/solr/The+Terms+Component
https://wiki.apache.org/solr/LukeRequestHandler
https://cwiki.apache.org/confluence/display/solr/The+Stats+Component
Ahmet



On Friday, April 29, 2016 11:56 AM, "G, Rajesh"  wrote:
Hi,

I am trying to implement word 
cloud
  using Solr.  The problem I have is Solr facet query ignores repeated words in 
a document eg.

I have indexed the text :
It seems that the harder I work, the more work I get for the same compensation 
and reward. The more work I take on gets absorbed into my "normal" workload and 
I'm not recognized for working harder than my peers, which makes me not want to 
work to my potential. I am very underwhelmed by the evaluation process and 
bonus structure. I don't believe the current structure rewards strong 
performers. I am confident that the company could not hire someone with my 
talent to replace me if I left, but I don't think the company realizes that.

The indexed content has word my and the count the is 3 but when I run the query 
http://localhost:8182/solr/dev/select?facet=true&facet.field=comments&rows=0&indent=on&q=questionid:3956&wt=json
 the count of word my  is 1 and not 3. Can you please help?

Also please suggest If there is a better way to implement word cloud in Solr 
other than using facet?

"facet_fields":{
  "comments":[
"absorbed",1,
"am",1,
"believe",1,
"bonus",1,
"company",1,
"compensation",1,
"confident",1,
"could",1,
"current",1,
"don't",1,
"evaluation",1,
"get",1,
"gets",1,
"harder",1,
"hire",1,
"i",1,
"i'm",1,
"left",1,
"makes",1,
"me",1,
"more",1,
"my",1,
"normal",1,
"peers",1,
"performers",1,
"potential",1,
"process",1,
"realizes",1,
"recognized",1,
"replace",1,
"reward",1,
"rewards",1,
"same",1,
"seems",1,
"someone",1,
"strong",1,
"structure",1,
"take",1,
"talent",1,
"than",1,
"think",1,
"underwhelmed",1,
"very",1,
"want",1,
"which",1,
"work",1,
"working",1,
"workload",1]
}




CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered 
office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, 
Haryana-122002, India..



This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.


Re: Decide on facets from results

2016-04-29 Thread Mark Robinson
Thanks much everyone!
Appreciate your responses.

Best,
Mark

On Thu, Apr 28, 2016 at 10:52 AM, Jay Potharaju 
wrote:

> On the same lines as Erik suggested but using facet stats instead. you can
> get stats on your facet fields in the first pass and then include the
> facets that you need in the second pass.
>
>
> > On Apr 27, 2016, at 1:21 PM, Mark Robinson 
> wrote:
> >
> > Thanks Eric!
> > So that will mean another call will be definitely required to SOLR with
> the
> > facets,  before the results can be send back (with the facet fields being
> > derived traversing through the response).
> >
> > I was basically checking on whether in the "process" method (I believe
> > results will be accessed in the process method), we can dynamically
> > generate facets after traversing through the results and identifying the
> > fields for faceting, using some aggregation function or so, without
> having
> > to make another call using facet=on&facet.field=, before the
> > response is send back to the user.
> >
> > Cheers!
> >
> > On Wed, Apr 27, 2016 at 2:27 PM, Erik Hatcher 
> > wrote:
> >
> >> Results will vary based on how you indexed those fields, but sure…
> >> &facet=on&facet.field= - with sufficient RAM, lots of fun
> to be
> >> had!
> >>
> >> —
> >> Erik Hatcher, Senior Solutions Architect
> >> http://www.lucidworks.com 
> >>
> >>
> >>
>  On Apr 27, 2016, at 12:13 PM, Mark Robinson 
> >>> wrote:
> >>>
> >>> Hi,
> >>>
> >>> If I don't have my facet list at query time, from the results can I
> >> select
> >>> some fields and by any means create a facet on them? ie after I get the
> >>> results I want to identify some fields as facets and send back facets
> for
> >>> them in the response.
> >>>
> >>> A kind of very dynamic faceting based on the results!
> >>>
> >>> Cld some one pls share their idea.
> >>>
> >>> Thanks!
> >>> Anil.
> >>
> >>
>


Re: Tuning solr for large index with rapid writes

2016-04-29 Thread Erick Erickson
Good luck!

You have one huge advantage when doing prototyping, you can
mine your current logs for real user queries. It's actually
surprisingly difficult to generate, say, 10,000 "realistic" queries. And
IMO you need something approaching that number to insure that
you're queries don't hit the caches etc

Anyway, sounds like you're off and running.

Best,
Erick

On Wed, Apr 27, 2016 at 10:12 AM, Stephen Lewis  wrote:
>>
> If I'm reading this right, you have 420M docs on a single shard?
> Yep, you were reading it right. Thanks for your guidance. We will do
> various prototyping following "the sizing exercise".
>
> Best,
> Stephen
>
> On Tue, Apr 26, 2016 at 6:17 PM, Erick Erickson 
> wrote:
>
>>
>> If I'm reading this right, you have 420M docs on a single shard? If that's
>> true
>> you are pushing the envelope of what I've seen work and be performant. Your
>> OOM errors are the proverbial 'smoking gun' that you're putting too many
>> docs
>> on too few nodes.
>>
>> You say that the document count is "growing quite rapidly". My expectation
>> is
>> that your problems will only get worse as you cram more docs into your
>> shard.
>>
>> You're correct that adding more memory (and consequently more JVM
>> memory?) only gets you so far before you start running into GC trouble,
>> when you hit full GC pauses they'll get longer and longer which is its own
>> problem. And you don't want to have huge JVM memory at the expense
>> of op system memory due the fact that Lucene uses MMapDirectory, see
>> Uwe's excellent blog:
>> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>>
>> I'd _strongly_ recommend you do "the sizing exercise". There are lots of
>> details here:
>>
>> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>>
>> You've already done some of this inadvertently, unfortunately it sounds
>> like
>> it's in production. If I were going to guess, I'd say the maximum number of
>> docs on any shard should be less than half what you currently have. So you
>> need to figure out how many docs you expect to host in this collection
>> eventually
>> and have N/200M shards. At least.
>>
>> There are various strategies when the answer is "I don't know", you
>> might add new
>> collections when you max out and then use "collection aliasing" to
>> query them etc.
>>
>> Best,
>> Erick
>>
>> On Tue, Apr 26, 2016 at 3:49 PM, Stephen Lewis  wrote:
>> > Hello,
>> >
>> > I'm looking for some guidance on the best steps for tuning a solr cloud
>> > cluster which is heavy on writes. We are currently running a solr cloud
>> > fleet composed of one core, one shard, and three nodes. The cloud is
>> hosted
>> > in AWS, and each solr node is on its own linux r3.2xl instance with 8 cpu
>> > and 61 GiB mem, and a 2TB EBS volume attached. Our index is currently 550
>> > GiB over 420M documents, and growing quite rapidly. We are currently
>> doing
>> > a bit more than 1000 document writes/deletes per second.
>> >
>> > Recently, we've hit some trouble with our production cloud. We have had
>> the
>> > process on individual instances die a few times, and we see the following
>> > error messages being logged (expanded logs at the bottom of the email):
>> >
>> > ERROR - 2016-04-26 00:56:43.873; org.apache.solr.common.SolrException;
>> > null:org.eclipse.jetty.io.EofException
>> >
>> > WARN  - 2016-04-26 00:55:29.571;
>> org.eclipse.jetty.servlet.ServletHandler;
>> > /solr/panopto/select
>> > java.lang.IllegalStateException: Committed
>> >
>> > WARN  - 2016-04-26 00:55:29.571; org.eclipse.jetty.server.Response;
>> > Committed before 500 {trace=org.eclipse.jetty.io.EofException
>> >
>> >
>> > Another time we saw this happen, we had java OOM errors (expanded logs at
>> > the bottom):
>> >
>> > WARN  - 2016-04-25 22:58:43.943;
>> org.eclipse.jetty.servlet.ServletHandler;
>> > Error for /solr/panopto/select
>> > java.lang.OutOfMemoryError: Java heap space
>> > ERROR - 2016-04-25 22:58:43.945; org.apache.solr.common.SolrException;
>> > null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap
>> space
>> > ...
>> > Caused by: java.lang.OutOfMemoryError: Java heap space
>> >
>> >
>> > When the cloud goes into recovery during live indexing, it takes about
>> 4-6
>> > hours for a node to recover, but when we turn off indexing, recovery only
>> > takes about 90 minutes.
>> >
>> > Moreover, we see that deletes are extremely slow. We do batch deletes of
>> > about 300 documents based on two value filters, and this takes about one
>> > minute:
>> >
>> > Research online suggests that a larger disk cache
>> >  could be helpful,
>> > but I also see from an older page
>> >  on tuning for
>> > Lucene that turning down the swappiness on our Linux instances may be
>> > preferred to simply increasing space for the disk cache.
>> >
>> > Moreover, to scale 

Re: Solr 5.2.1 on Java 8 GC

2016-04-29 Thread Nick Vasilyev
Not sure if it helps anyone, but I am seeing decent results with the
following.

It was mostly a result of trial and error, I am not familiar with Java GC
or even Java itself. I added my interpretation of what was happening, but I
am not sure if it is right, take it for what it's worth. It'd be nice if
someone could provide a better technical explanation. We are about to hit
deaily peak load and so far it doesn't look like there is any negative
performance impact.

-XX:NewRatio=2 \ #Increases the size of the young generation
-XX:SurvivorRatio=3 \ #Increases the size of the survivor spaces
-XX:TargetSurvivorRatio=90 \
-XX:MaxTenuringThreshold=8 \
-XX:+UseConcMarkSweepGC \
-XX:+CMSScavengeBeforeRemark \
-XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 \ #Increasing these didn't
really help
-XX:PretenureSizeThreshold=512m \ # I am not sure what the full impact of
this yet, but I am assuming it will put less stuff in the eden space
-XX:CMSFullGCsBeforeCompaction=1 \
-XX:+UseCMSInitiatingOccupancyOnly \
-XX:CMSInitiatingOccupancyFraction=70 \
-XX:CMSMaxAbortablePrecleanTime=6000 \
-XX:+CMSParallelRemarkEnabled
-XX:+ParallelRefProcEnabled
-XX:+UseLargePages \
-XX:+AggressiveOpts \

Here are the GC times over 1 second.

Before:
2016-04-26T04:30:53.175-0400: 244734.856: Total time for which application
threads were stopped: 12.6587130 seconds, Stopping threads took: 0.0024770
seconds
2016-04-26T04:31:22.808-0400: 244764.489: Total time for which application
threads were stopped: 10.6840330 seconds, Stopping threads took: 0.0004840
seconds
2016-04-26T04:31:48.586-0400: 244790.267: Total time for which application
threads were stopped: 10.8198760 seconds, Stopping threads took: 0.0010340
seconds
2016-04-26T04:32:10.095-0400: 244811.777: Total time for which application
threads were stopped: 9.5644690 seconds, Stopping threads took: 0.0006750
seconds
2016-04-26T04:32:32.600-0400: 244834.282: Total time for which application
threads were stopped: 10.0890420 seconds, Stopping threads took: 0.0009930
seconds
2016-04-26T04:32:55.747-0400: 244857.429: Total time for which application
threads were stopped: 10.3426480 seconds, Stopping threads took: 0.0008190
seconds
2016-04-26T04:33:20.522-0400: 244882.203: Total time for which application
threads were stopped: 10.7531070 seconds, Stopping threads took: 0.0013280
seconds
2016-04-26T04:33:45.853-0400: 244907.535: Total time for which application
threads were stopped: 10.3933700 seconds, Stopping threads took: 0.0013970
seconds
2016-04-26T04:34:15.634-0400: 244937.316: Total time for which application
threads were stopped: 10.5744420 seconds, Stopping threads took: 0.0008980
seconds
2016-04-26T04:34:53.802-0400: 244975.484: Total time for which application
threads were stopped: 10.4964470 seconds, Stopping threads took: 0.0013830
seconds
2016-04-26T04:35:19.276-0400: 245000.957: Total time for which application
threads were stopped: 9.8195470 seconds, Stopping threads took: 0.0016110
seconds
2016-04-26T04:35:43.617-0400: 245025.299: Total time for which application
threads were stopped: 9.4856600 seconds, Stopping threads took: 0.0014980
seconds
2016-04-26T04:36:06.540-0400: 245048.222: Total time for which application
threads were stopped: 9.5009880 seconds, Stopping threads took: 0.0009080
seconds
2016-04-26T04:36:32.843-0400: 245074.525: Total time for which application
threads were stopped: 9.637 seconds, Stopping threads took: 0.0011770
seconds
2016-04-26T04:36:57.114-0400: 245098.795: Total time for which application
threads were stopped: 10.0064990 seconds, Stopping threads took: 0.0011480
seconds
2016-04-26T04:37:21.074-0400: 245122.755: Total time for which application
threads were stopped: 9.7061140 seconds, Stopping threads took: 0.0009760
seconds
2016-04-26T04:37:45.716-0400: 245147.398: Total time for which application
threads were stopped: 9.910 seconds, Stopping threads took: 0.0008220
seconds
2016-04-26T04:38:11.412-0400: 245173.094: Total time for which application
threads were stopped: 10.6839560 seconds, Stopping threads took: 0.0015370
seconds
2016-04-26T04:38:37.177-0400: 245198.859: Total time for which application
threads were stopped: 10.0646910 seconds, Stopping threads took: 0.0013740
seconds
2016-04-26T04:39:00.516-0400: 245222.197: Total time for which application
threads were stopped: 9.8280250 seconds, Stopping threads took: 0.0008900
seconds
2016-04-26T04:39:25.255-0400: 245246.937: Total time for which application
threads were stopped: 10.8429080 seconds, Stopping threads took: 0.0007120
seconds
2016-04-26T04:41:06.937-0400: 245348.619: Total time for which application
threads were stopped: 9.8060420 seconds, Stopping threads took: 0.0006300
seconds
2016-04-26T04:41:43.370-0400: 245385.052: Total time for which application
threads were stopped: 10.8144800 seconds, Stopping threads took: 0.0002260
seconds
2016-04-26T04:42:09.479-0400: 245411.161: Total time for which application
threads were stopped: 9.4059640 seconds, Stopping threads took:

Re: Questions on SolrCloud core state, when will Solr recover a "DOWN" core to "ACTIVE" core.

2016-04-29 Thread Erick Erickson
Well, there have been lots of improvements since 4.6. You're right,
logically when things come back up and are all reachable, it seems
like it is theoretically possible to bring a node back up. There
have been situations where that doesn't happen, and various fixes
have been implemented to fix them as they're identified

You might try reloading the core from the core admin (that's
about the only thing you should try in SolrCloud from the
core admin screen)

Best,
Erick

On Wed, Apr 27, 2016 at 10:58 AM, Li Ding  wrote:
> Hi Erick,
>
> I don't have the GC log.  But after the GC finished.  Isn't zk ping
> succeeds and the core should be back to normal state?  From the log I
> posted.  The sequence is:
>
> 1) Solr Detects itself can't connect to ZK and reconnect to ZK
> 2) Solr marked all cores are down
> 3) Solr recovery each cores, some succeeds, some failed.
> 4) After 30 minutes, the cores that are failed still marked as down.
>
> So my questions is, during the 30 minutes interval, if GC takes too long,
> all cores should failed.  And GC doesn't take longer than a minute since
> all serving requests to other calls succeeds and the next zk ping should
> bring the core back to normal? right?  We have an active monitor running at
> the same time querying every core in distrib=false mode and every query
> succeeds.
>
> Thanks,
>
> Li
>
> On Tue, Apr 26, 2016 at 6:20 PM, Erick Erickson 
> wrote:
>
>> One of the reasons this happens is if you have very
>> long GC cycles, longer than the Zookeeper "keep alive"
>> timeout. During a full GC pause, Solr is unresponsive and
>> if the ZK ping times out, ZK assumes the machine is
>> gone and you get into this recovery state.
>>
>> So I'd collect GC logs and see if you have any
>> stop-the-world GC pauses that take longer than the ZK
>> timeout.
>>
>> see Mark Millers primer on GC here:
>> https://lucidworks.com/blog/2011/03/27/garbage-collection-bootcamp-1-0/
>>
>> Best,
>> Erick
>>
>> On Tue, Apr 26, 2016 at 2:13 PM, Li Ding  wrote:
>> > Thank you all for your help!
>> >
>> > The zookeeper log rolled over, thisis from Solr.log:
>> >
>> > Looks like the solr and zk connection is gone for some reason
>> >
>> > INFO  - 2016-04-21 12:37:57.536;
>> > org.apache.solr.common.cloud.ConnectionManager; Watcher
>> > org.apache.solr.common.cloud.ConnectionManager@19789a96
>> > name:ZooKeeperConnection Watcher:{ZK HOSTS here} got event WatchedEvent
>> > state:Disconnected type:None path:null path:null type:None
>> >
>> > INFO  - 2016-04-21 12:37:57.536;
>> > org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
>> >
>> > INFO  - 2016-04-21 12:38:24.248;
>> > org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection
>> expired
>> > - starting a new one...
>> >
>> > INFO  - 2016-04-21 12:38:24.262;
>> > org.apache.solr.common.cloud.ConnectionManager; Waiting for client to
>> > connect to ZooKeeper
>> >
>> > INFO  - 2016-04-21 12:38:24.269;
>> > org.apache.solr.common.cloud.ConnectionManager; Connected:true
>> >
>> >
>> > Then it publishes all cores on the hosts are down.  I just list three
>> cores
>> > here:
>> >
>> > INFO  - 2016-04-21 12:38:24.269; org.apache.solr.cloud.ZkController;
>> > publishing core=product1_shard1_replica1 state=down
>> >
>> > INFO  - 2016-04-21 12:38:24.271; org.apache.solr.cloud.ZkController;
>> > publishing core=collection1 state=down
>> >
>> > INFO  - 2016-04-21 12:38:24.272; org.apache.solr.cloud.ZkController;
>> > numShards not found on descriptor - reading it from system property
>> >
>> > INFO  - 2016-04-21 12:38:24.289; org.apache.solr.cloud.ZkController;
>> > publishing core=product2_shard5_replica1 state=down
>> >
>> > INFO  - 2016-04-21 12:38:24.292; org.apache.solr.cloud.ZkController;
>> > publishing core=product2_shard13_replica1 state=down
>> >
>> >
>> > product1 has only one shard one replica and it's able to be active
>> > successfully:
>> >
>> > INFO  - 2016-04-21 12:38:26.383; org.apache.solr.cloud.ZkController;
>> > Register replica - core:product1_shard1_replica1 address:http://
>> > {internalIp}:8983/solr collection:product1 shard:shard1
>> >
>> > WARN  - 2016-04-21 12:38:26.385; org.apache.solr.cloud.ElectionContext;
>> > cancelElection did not find election node to remove
>> >
>> > INFO  - 2016-04-21 12:38:26.393;
>> > org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader
>> > process for shard shard1
>> >
>> > INFO  - 2016-04-21 12:38:26.399;
>> > org.apache.solr.cloud.ShardLeaderElectionContext; Enough replicas found
>> to
>> > continue.
>> >
>> > INFO  - 2016-04-21 12:38:26.399;
>> > org.apache.solr.cloud.ShardLeaderElectionContext; I may be the new
>> leader -
>> > try and sync
>> >
>> > INFO  - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy; Sync
>> > replicas to http://{internalIp}:8983/solr/product1_shard1_replica1/
>> >
>> > INFO  - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy; Sync
>> > Success - now sync replicas to me
>> >
>> > INFO  - 2016-04

Phrases and edismax

2016-04-29 Thread Mark Robinson
Hi,

q=productType:(two piece bathtub white)
&defType=edismax&pf=productType^20.0&qf=productType^15.0

In the debug section this is what I see:-

(+(productType:two productType:piec productType:bathtub productType:white)
DisjunctionMaxQuery((productType:"piec bathtub white"^20.0)))/no_coord


My question is related to the "pf" (phrases) section of edismax.
As shown in the debug section why is the phrase taken as "piec bathtub
white". Why is the first word "two" not considered in the phrase fields
section.
I am looking for queries with the words "two piece bathtub white" being
together to be boosted and not "piece bathtub white" only to be boosted.

Could some one help me understand what I am missing?

Thanks!
Mark


RE: dataimport db-data-config.xml

2016-04-29 Thread Davis, Daniel (NIH/NLM) [C]
Kishor,

Data Import Handler doesn't know how to randomly access rows from the CSV to 
"JOIN" them to rows from the MySQL table at indexing time.
However, both MySQL and Solr know how to JOIN rows/documents from multiple 
tables/collections/cores.

Data Import Handler could read the CSV first, and query MySQL within that, but 
I don't think that's a great architecture because it depends on the business 
requirements in a rather brittle way (more on this below).

So, I see three basic architectures:

Use MySQL to do the JOIN:
--
- Your indexing isn't just DIH, but a script that first.
- Imports the CSV into a MySQL table, validating that the id in the CSV table 
is found in the MySQL table.
- Your DIH has either an  for one SQL query that contains an  
for the other SQL query, or it has a JOIN query/query on a MySQL view.

This is ideal if:
- Your resources (including you) are more familiar with RDBMS technology than 
Solr.
- You have no business requirement to return rows from just the MySQL table or 
just the CSV as search results.
- The data is small enough that the processing time to import into MySQL each 
time you index is acceptable.

Use Solr to do the JOIN:
--
- Index all the rows from the CSV as documents within Solr, 
- Index all the rows from the MySQL table as documents within Solr,
- Use JOIN queries to query them together.

This is ideal if:
- You don't control the MySQL database, and have no way at all to add a table 
to it.
- You have a business requirement to return either or both results from the 
MySQL table or the CSV.
- You want Solr JOIN queries on your Solr resume ;)   Not a terribly good 
reason, I guess.


Use Data Import Handler to do the JOIN:
---
If you absolutely want to join the data using Data Import Handler, then:
- Have DIH loop through the CSV *first*, and then make queries based on the id 
into the MySQL table.
- In this case, the  for the MySQL query will appear within the 
 for the CSV row, which will appear within an  for the CSV file 
within the filesystem.
- The  for the CSV row would be the primary document entity.

This is only appropriate if:
- There is no business requirement to search for results directly from the 
MySQL table on its own.
- Your business requirements suggest one result for each row from the CSV, 
rather than from the MySQL table or either way.
- The CSV contains every id in the MySQL table, or the entries within the MySQL 
table that don't have anything from the CSV shouldn't appear in the results 
anyway.


-Original Message-
From: kishor [mailto:krajus...@gmail.com] 
Sent: Friday, April 29, 2016 4:58 AM
To: solr-user@lucene.apache.org
Subject: dataimport db-data-config.xml

I want to import data from mysql-table and csv file ata the same time beacuse 
some data are in mysql tables and some are in csv file . I want to match 
specific id from mysql table in csv file then add the data in solar.

What i think or wnat to do








   




   

Is this possible in solr? 

Please suggest me How to import data from csv and mysql table at the same time.









--
View this message in context: 
http://lucene.472066.n3.nabble.com/dataimport-db-data-config-xml-tp4270673p4273614.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Facet ignoring repeated word

2016-04-29 Thread Erick Erickson
That's the way faceting is designed to work. It counts the _documents_
that a term appears in that satisfy your query, if a word appears
multiple times in a doc, it'll only count it once.

For the general use-case it'd be unsettling for a user to see a facet
count of 500, then click on it and discover that the number of docs in
the corpus was really 345 or something.

Ahmet's hints might help, but I'd really ask if counting words
multiple times really satisfies the use case.

Best,
Erick

On Fri, Apr 29, 2016 at 7:10 AM, Ahmet Arslan  wrote:
> Hi,
>
> Depending on your requirements; StatsComponent, TermsComponent, 
> LukeRequestHandler can also be used.
>
>
> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component
> https://wiki.apache.org/solr/LukeRequestHandler
> https://cwiki.apache.org/confluence/display/solr/The+Stats+Component
> Ahmet
>
>
>
> On Friday, April 29, 2016 11:56 AM, "G, Rajesh"  wrote:
> Hi,
>
> I am trying to implement word 
> cloud
>   using Solr.  The problem I have is Solr facet query ignores repeated words 
> in a document eg.
>
> I have indexed the text :
> It seems that the harder I work, the more work I get for the same 
> compensation and reward. The more work I take on gets absorbed into my 
> "normal" workload and I'm not recognized for working harder than my peers, 
> which makes me not want to work to my potential. I am very underwhelmed by 
> the evaluation process and bonus structure. I don't believe the current 
> structure rewards strong performers. I am confident that the company could 
> not hire someone with my talent to replace me if I left, but I don't think 
> the company realizes that.
>
> The indexed content has word my and the count the is 3 but when I run the 
> query 
> http://localhost:8182/solr/dev/select?facet=true&facet.field=comments&rows=0&indent=on&q=questionid:3956&wt=json
>  the count of word my  is 1 and not 3. Can you please help?
>
> Also please suggest If there is a better way to implement word cloud in Solr 
> other than using facet?
>
> "facet_fields":{
>   "comments":[
> "absorbed",1,
> "am",1,
> "believe",1,
> "bonus",1,
> "company",1,
> "compensation",1,
> "confident",1,
> "could",1,
> "current",1,
> "don't",1,
> "evaluation",1,
> "get",1,
> "gets",1,
> "harder",1,
> "hire",1,
> "i",1,
> "i'm",1,
> "left",1,
> "makes",1,
> "me",1,
> "more",1,
> "my",1,
> "normal",1,
> "peers",1,
> "performers",1,
> "potential",1,
> "process",1,
> "realizes",1,
> "recognized",1,
> "replace",1,
> "reward",1,
> "rewards",1,
> "same",1,
> "seems",1,
> "someone",1,
> "strong",1,
> "structure",1,
> "take",1,
> "talent",1,
> "than",1,
> "think",1,
> "underwhelmed",1,
> "very",1,
> "want",1,
> "which",1,
> "work",1,
> "working",1,
> "workload",1]
> }
>
>
>
>
> CEB India Private Limited. Registration No: U741040HR2004PTC035324. 
> Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, 
> Gurgaon, Haryana-122002, India..
>
>
>
> This e-mail and/or its attachments are intended only for the use of the 
> addressee(s) and may contain confidential and legally privileged information 
> belonging to CEB and/or its subsidiaries, including CEB subsidiaries that 
> offer SHL Talent Measurement products and services. If you have received this 
> e-mail in error, please notify the sender and immediately, destroy all copies 
> of this email and its attachments. The publication, copying, in whole or in 
> part, or use or dissemination in any other way of this e-mail and attachments 
> by anyone other than the intended person(s) is prohibited.


Re: Decide on facets from results

2016-04-29 Thread Joel Bernstein
Check out the new docs for the gatherNodes streaming expression. it Allows
you to aggregate and then use those aggregates as input for another
expression. You can even do this across collections.

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62693238

This is slated for Solr 6.1



Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Apr 29, 2016 at 10:38 AM, Mark Robinson 
wrote:

> Thanks much everyone!
> Appreciate your responses.
>
> Best,
> Mark
>
> On Thu, Apr 28, 2016 at 10:52 AM, Jay Potharaju 
> wrote:
>
> > On the same lines as Erik suggested but using facet stats instead. you
> can
> > get stats on your facet fields in the first pass and then include the
> > facets that you need in the second pass.
> >
> >
> > > On Apr 27, 2016, at 1:21 PM, Mark Robinson 
> > wrote:
> > >
> > > Thanks Eric!
> > > So that will mean another call will be definitely required to SOLR with
> > the
> > > facets,  before the results can be send back (with the facet fields
> being
> > > derived traversing through the response).
> > >
> > > I was basically checking on whether in the "process" method (I believe
> > > results will be accessed in the process method), we can dynamically
> > > generate facets after traversing through the results and identifying
> the
> > > fields for faceting, using some aggregation function or so, without
> > having
> > > to make another call using facet=on&facet.field=, before
> the
> > > response is send back to the user.
> > >
> > > Cheers!
> > >
> > > On Wed, Apr 27, 2016 at 2:27 PM, Erik Hatcher 
> > > wrote:
> > >
> > >> Results will vary based on how you indexed those fields, but sure…
> > >> &facet=on&facet.field= - with sufficient RAM, lots of fun
> > to be
> > >> had!
> > >>
> > >> —
> > >> Erik Hatcher, Senior Solutions Architect
> > >> http://www.lucidworks.com 
> > >>
> > >>
> > >>
> >  On Apr 27, 2016, at 12:13 PM, Mark Robinson <
> mark123lea...@gmail.com>
> > >>> wrote:
> > >>>
> > >>> Hi,
> > >>>
> > >>> If I don't have my facet list at query time, from the results can I
> > >> select
> > >>> some fields and by any means create a facet on them? ie after I get
> the
> > >>> results I want to identify some fields as facets and send back facets
> > for
> > >>> them in the response.
> > >>>
> > >>> A kind of very dynamic faceting based on the results!
> > >>>
> > >>> Cld some one pls share their idea.
> > >>>
> > >>> Thanks!
> > >>> Anil.
> > >>
> > >>
> >
>


Re: Set router.field in unit tests

2016-04-29 Thread Erick Erickson
I'm pretty sure you can just create a collection after the distributed
stuff is set up.

Take a look at:

CollectionsAPIDistributedZkTest.testNodesUsedByCreate to see creating
a collection
in your test just by a request (you can set any params you want there, including
router.field).

Or CollectionsAPISolrJTest.testCreateAndDeleteCollection for a niftier
builder pattern
SolrJ way.

Best,
Erick

On Fri, Apr 29, 2016 at 5:34 AM, GW  wrote:
> Not exactly suer what you mean but I think you are wanting to change your
> schema.xml
>
>  multiValued="false" />
>
> to
>
>  required="true" multiValued="false" />
>
>
> restart solr
>
>
> On 29 April 2016 at 06:04, Markus Jelsma  wrote:
>
>> Hi - any hints to share?
>>
>> Thanks!
>> Markus
>>
>>
>>
>> -Original message-
>> > From:Markus Jelsma 
>> > Sent: Thursday 28th April 2016 13:30
>> > To: solr-user 
>> > Subject: Set router.field in unit tests
>> >
>> > Hi - i'm working on a unit test that requires the cluster's router.field
>> to be set to a field different than ID. But i can't find it?! How can i set
>> router.field with AbstractFullDistribZkTestBase?
>> >
>> > Thanks!
>> > Markus
>> >
>>


Re: Questions on SolrCloud core state, when will Solr recover a "DOWN" core to "ACTIVE" core.

2016-04-29 Thread Don Bosco Durai
Hi Li

I got into very similar situation like you. The GC was taking much longer than 
the zookeeper timeout configured. I had 3 nodes in the SolrCloud and very often 
I would have my entire cluster totally messed up. Increasing the zookeeper 
timeout eventually helped. But before that, I was able do some temporary 
workaround by "rmr /solr/overseer/queue” in the zookeeper (not sure whether I 
restarted the solr after that). I am not even sure this is the right thing to 
do, but seem to have unblocked me at time. At least, there were no negative 
effect.

Thanks

Bosco





On 4/29/16, 7:52 AM, "Erick Erickson"  wrote:

>Well, there have been lots of improvements since 4.6. You're right,
>logically when things come back up and are all reachable, it seems
>like it is theoretically possible to bring a node back up. There
>have been situations where that doesn't happen, and various fixes
>have been implemented to fix them as they're identified
>
>You might try reloading the core from the core admin (that's
>about the only thing you should try in SolrCloud from the
>core admin screen)
>
>Best,
>Erick
>
>On Wed, Apr 27, 2016 at 10:58 AM, Li Ding  wrote:
>> Hi Erick,
>>
>> I don't have the GC log.  But after the GC finished.  Isn't zk ping
>> succeeds and the core should be back to normal state?  From the log I
>> posted.  The sequence is:
>>
>> 1) Solr Detects itself can't connect to ZK and reconnect to ZK
>> 2) Solr marked all cores are down
>> 3) Solr recovery each cores, some succeeds, some failed.
>> 4) After 30 minutes, the cores that are failed still marked as down.
>>
>> So my questions is, during the 30 minutes interval, if GC takes too long,
>> all cores should failed.  And GC doesn't take longer than a minute since
>> all serving requests to other calls succeeds and the next zk ping should
>> bring the core back to normal? right?  We have an active monitor running at
>> the same time querying every core in distrib=false mode and every query
>> succeeds.
>>
>> Thanks,
>>
>> Li
>>
>> On Tue, Apr 26, 2016 at 6:20 PM, Erick Erickson 
>> wrote:
>>
>>> One of the reasons this happens is if you have very
>>> long GC cycles, longer than the Zookeeper "keep alive"
>>> timeout. During a full GC pause, Solr is unresponsive and
>>> if the ZK ping times out, ZK assumes the machine is
>>> gone and you get into this recovery state.
>>>
>>> So I'd collect GC logs and see if you have any
>>> stop-the-world GC pauses that take longer than the ZK
>>> timeout.
>>>
>>> see Mark Millers primer on GC here:
>>> https://lucidworks.com/blog/2011/03/27/garbage-collection-bootcamp-1-0/
>>>
>>> Best,
>>> Erick
>>>
>>> On Tue, Apr 26, 2016 at 2:13 PM, Li Ding  wrote:
>>> > Thank you all for your help!
>>> >
>>> > The zookeeper log rolled over, thisis from Solr.log:
>>> >
>>> > Looks like the solr and zk connection is gone for some reason
>>> >
>>> > INFO  - 2016-04-21 12:37:57.536;
>>> > org.apache.solr.common.cloud.ConnectionManager; Watcher
>>> > org.apache.solr.common.cloud.ConnectionManager@19789a96
>>> > name:ZooKeeperConnection Watcher:{ZK HOSTS here} got event WatchedEvent
>>> > state:Disconnected type:None path:null path:null type:None
>>> >
>>> > INFO  - 2016-04-21 12:37:57.536;
>>> > org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
>>> >
>>> > INFO  - 2016-04-21 12:38:24.248;
>>> > org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection
>>> expired
>>> > - starting a new one...
>>> >
>>> > INFO  - 2016-04-21 12:38:24.262;
>>> > org.apache.solr.common.cloud.ConnectionManager; Waiting for client to
>>> > connect to ZooKeeper
>>> >
>>> > INFO  - 2016-04-21 12:38:24.269;
>>> > org.apache.solr.common.cloud.ConnectionManager; Connected:true
>>> >
>>> >
>>> > Then it publishes all cores on the hosts are down.  I just list three
>>> cores
>>> > here:
>>> >
>>> > INFO  - 2016-04-21 12:38:24.269; org.apache.solr.cloud.ZkController;
>>> > publishing core=product1_shard1_replica1 state=down
>>> >
>>> > INFO  - 2016-04-21 12:38:24.271; org.apache.solr.cloud.ZkController;
>>> > publishing core=collection1 state=down
>>> >
>>> > INFO  - 2016-04-21 12:38:24.272; org.apache.solr.cloud.ZkController;
>>> > numShards not found on descriptor - reading it from system property
>>> >
>>> > INFO  - 2016-04-21 12:38:24.289; org.apache.solr.cloud.ZkController;
>>> > publishing core=product2_shard5_replica1 state=down
>>> >
>>> > INFO  - 2016-04-21 12:38:24.292; org.apache.solr.cloud.ZkController;
>>> > publishing core=product2_shard13_replica1 state=down
>>> >
>>> >
>>> > product1 has only one shard one replica and it's able to be active
>>> > successfully:
>>> >
>>> > INFO  - 2016-04-21 12:38:26.383; org.apache.solr.cloud.ZkController;
>>> > Register replica - core:product1_shard1_replica1 address:http://
>>> > {internalIp}:8983/solr collection:product1 shard:shard1
>>> >
>>> > WARN  - 2016-04-21 12:38:26.385; org.apache.solr.cloud.ElectionContext;
>>> > cancelElection did not find election node to remo

Re: Set router.field in unit tests

2016-04-29 Thread Alan Woodward
It's almost certainly worth using SolrCloudTestBase rather than 
AbstractDistribZkTestBase as well - normally makes the test five or six times 
faster.

Alan Woodward
www.flax.co.uk


On 29 Apr 2016, at 17:11, Erick Erickson wrote:

> I'm pretty sure you can just create a collection after the distributed
> stuff is set up.
> 
> Take a look at:
> 
> CollectionsAPIDistributedZkTest.testNodesUsedByCreate to see creating
> a collection
> in your test just by a request (you can set any params you want there, 
> including
> router.field).
> 
> Or CollectionsAPISolrJTest.testCreateAndDeleteCollection for a niftier
> builder pattern
> SolrJ way.
> 
> Best,
> Erick
> 
> On Fri, Apr 29, 2016 at 5:34 AM, GW  wrote:
>> Not exactly suer what you mean but I think you are wanting to change your
>> schema.xml
>> 
>> > multiValued="false" />
>> 
>> to
>> 
>> > required="true" multiValued="false" />
>> 
>> 
>> restart solr
>> 
>> 
>> On 29 April 2016 at 06:04, Markus Jelsma  wrote:
>> 
>>> Hi - any hints to share?
>>> 
>>> Thanks!
>>> Markus
>>> 
>>> 
>>> 
>>> -Original message-
 From:Markus Jelsma 
 Sent: Thursday 28th April 2016 13:30
 To: solr-user 
 Subject: Set router.field in unit tests
 
 Hi - i'm working on a unit test that requires the cluster's router.field
>>> to be set to a field different than ID. But i can't find it?! How can i set
>>> router.field with AbstractFullDistribZkTestBase?
 
 Thanks!
 Markus
 
>>> 



Re: issues doing a spatial query

2016-04-29 Thread Erick Erickson
Where is the doc that's "somewhere online"? One of the
issues I face constantly is knowing what _current_
information is. Lots of posts out there are perfectly
correct at the time they were written, but haven't been
updated.

Best,
Erick

On Fri, Apr 29, 2016 at 6:16 AM, GW  wrote:
> I realise the world wrap thing. but it is correct ~ they are coordinates
> taken from Google maps. I'd does not really matter though. I switched the
> query to use geofilt and everything is fine.
>
> Here's the kicker.
>
> There is a post somewhere online that says you cannot use geofilt with
> multivalued location_RPT. I lost months because I did not try it.
>
> If i use geofilt with the coordinates in question (last in the multvalue)
> with a distance of 1km I get a perfect result. In fact I can get a perfect
> single direct hit on any of the values with geofilt + distance +
> multivalued.
>
>
> People that don't know what they are talking about should not post.
>
> Many thanks for your response.
>
> GW
>
>
> On 29 April 2016 at 00:40, David Smiley  wrote:
>
>> Hi.
>> This makes sense to me.  The point 49.8,-97.1 is in your query box.  The
>> box is lower-left to upper-right, so your box is actually an almost
>> world-wrapping one grabbing all longitudes except  -93 to -92.  Maybe you
>> mean to switch your left & right.
>>
>> On Sun, Apr 24, 2016 at 8:03 PM GW  wrote:
>>
>> > I was not getting the results I expected so I started testing with the
>> solr
>> > webclient
>> >
>> > Maybe I don;t understand things.
>> >
>> > simple test query
>> >
>> > q=*:*&fq=locations:[49,-92 TO 50,-93]
>> >
>> > I don't understand why I get a result set for longitude range -92 to -93
>> > but should be zero results as far as I understand.
>> >
>> >
>> >
>> 
>> >
>> > {
>> >   "responseHeader": {
>> > "status": 0,
>> > "QTime": 2,
>> > "params": {
>> >   "q": "*:*",
>> >   "indent": "true",
>> >   "fq": "locations:[49,-92 TO 50,-93]",
>> >   "wt": "json",
>> >   "_": "1461541195102"
>> > }
>> >   },
>> >   "response": {
>> > "numFound": 85,
>> > "start": 0,
>> > "docs": [
>> >   {
>> > "id": "data.spidersilk.co!337",
>> > "entity_id": "337",
>> > "type_id": "simple",
>> > "gender": "Male",
>> > "name": "Aviator Sunglasses",
>> > "short_description": "A timeless accessory staple, the
>> > unmistakable teardrop lenses of our Aviator sunglasses appeal to
>> > everyone from suits to rock stars to citizens of the world.",
>> > "description": "Gunmetal frame with crystal gradient
>> > polycarbonate lenses in grey. ",
>> > "size": "",
>> > "color": "",
>> > "zdomain": "magento.spidersilk.co",
>> > "zurl":
>> > "
>> >
>> http://magento.spidersilk.co/index.php/catalog/product/view/id/337/s/aviator-sunglasses/
>> > ",
>> > "main_image_url":
>> > "
>> >
>> http://magento.spidersilk.co/media/catalog/product/cache/0/image/9df78eab33525d08d6e5fb8d27136e95/a/c/ace000a_1.jpg
>> > ",
>> > "keywords": "Eyewear  ",
>> > "data_size": "851,564",
>> > "category": "Eyewear",
>> > "final_price_without_tax": "295,USD",
>> > "image_url": [
>> >   "
>> > http://magento.spidersilk.co/media/catalog/product/a/c/ace000a_1.jpg";,
>> >   "
>> > http://magento.spidersilk.co/media/catalog/product/a/c/ace000b_1.jpg";
>> > ],
>> > "locations": [
>> >   "37.4463603,-122.1591775",
>> >   "42.5857514,-82.8873787",
>> >   "41.6942622,-86.2697108",
>> >   "49.8522263,-97.1390697"
>> > ],
>> > "_version_": 1532418847465799700
>> >   },
>> >
>> >
>> >
>> > Thanks,
>> >
>> > GW
>> >
>> --
>> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
>> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
>> http://www.solrenterprisesearchserver.com
>>


Re: Idle timeout expired: 50000/50000 ms

2016-04-29 Thread Shawn Heisey
On 4/28/2016 3:13 PM, Robert Brown wrote:
> I operate several collections (about 7-8) all using the same 5-node
> ZooKeeper cluster.  They've been in production for 3 months, with only
> 2 previous issues where a Solr node went down.
>
> Tonight, during several updates to the various collections, a handful
> failed due to the below error.
>
> Could this be related to ZooKeeper in any way?  If so, what could I
> check to ensure everything is running smoothly?
>
> The collections are a mix of 1 and 2 shards, all with 1 replica.
>
> Updates are performed in batches of 1000 in JSON files.
>
> Are there any other things I could/should be checking?
>
>
> $VAR1 = {
>  'error' => {
>   'code' => 500,
>   'msg' => 'java.util.concurrent.TimeoutException:
> Idle timeout expired: 5/5 ms', 

This idle timeout is configured in Jetty.  The default setting in the
jetty config provided with Solr 5.x is 50 seconds.

If your update requests are taking too long for the Jetty idle timeout,
then I think you're having a general performance problem with Solr. 
Increasing the timeout might help in the short term, but unless you fix
the underlying performance issue, you'd probably just run into the new
timeout at some point in the future.

Most severe performance problems like this are memory related, and are
solved by adding more memory.  Sometimes that is Java heap memory,
sometimes that is memory that is not allocated to a program.  Sometimes
both are required.

Thanks,
Shawn



Schema API

2016-04-29 Thread Hendrik Haddorp
Hi,

I have a Solr Cloud 6 setup with a managed schema. It seems like when I
create multiple collections from the same config set that they still
share the same schema. That was rather unexpected, as in the REST and
SolrJ API I do specify a collection when doing the schema change.
Looking into what is stored in ZooKeeper I do however only see a config
name stored for my collections so I guess this is the design. Or am I
missing something? Do I really need to upload a new config set when I
want to create a collection with unique fields? If so I would need to
make sure to delete that once I delete my collection. Seems a bit odd
and complicated to me.

SolrJ is also behaving strange. When I try to add multiple fields using
a MultiUpdate request where two fields already exist and one is new I
get no error back (getStatus() == 0) while the response object contains
error messages for the fields that existed already and the new field did
not get added.

regards,
Hendrik


Re: Decide on facets from results

2016-04-29 Thread Mark Robinson
Thanks for the suggestion Joe.
I will check on it.

Thanks!
Mark.

On Fri, Apr 29, 2016 at 11:56 AM, Joel Bernstein  wrote:

> Check out the new docs for the gatherNodes streaming expression. it Allows
> you to aggregate and then use those aggregates as input for another
> expression. You can even do this across collections.
>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62693238
>
> This is slated for Solr 6.1
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Fri, Apr 29, 2016 at 10:38 AM, Mark Robinson 
> wrote:
>
> > Thanks much everyone!
> > Appreciate your responses.
> >
> > Best,
> > Mark
> >
> > On Thu, Apr 28, 2016 at 10:52 AM, Jay Potharaju 
> > wrote:
> >
> > > On the same lines as Erik suggested but using facet stats instead. you
> > can
> > > get stats on your facet fields in the first pass and then include the
> > > facets that you need in the second pass.
> > >
> > >
> > > > On Apr 27, 2016, at 1:21 PM, Mark Robinson 
> > > wrote:
> > > >
> > > > Thanks Eric!
> > > > So that will mean another call will be definitely required to SOLR
> with
> > > the
> > > > facets,  before the results can be send back (with the facet fields
> > being
> > > > derived traversing through the response).
> > > >
> > > > I was basically checking on whether in the "process" method (I
> believe
> > > > results will be accessed in the process method), we can dynamically
> > > > generate facets after traversing through the results and identifying
> > the
> > > > fields for faceting, using some aggregation function or so, without
> > > having
> > > > to make another call using facet=on&facet.field=, before
> > the
> > > > response is send back to the user.
> > > >
> > > > Cheers!
> > > >
> > > > On Wed, Apr 27, 2016 at 2:27 PM, Erik Hatcher <
> erik.hatc...@gmail.com>
> > > > wrote:
> > > >
> > > >> Results will vary based on how you indexed those fields, but sure…
> > > >> &facet=on&facet.field= - with sufficient RAM, lots of
> fun
> > > to be
> > > >> had!
> > > >>
> > > >> —
> > > >> Erik Hatcher, Senior Solutions Architect
> > > >> http://www.lucidworks.com 
> > > >>
> > > >>
> > > >>
> > >  On Apr 27, 2016, at 12:13 PM, Mark Robinson <
> > mark123lea...@gmail.com>
> > > >>> wrote:
> > > >>>
> > > >>> Hi,
> > > >>>
> > > >>> If I don't have my facet list at query time, from the results can I
> > > >> select
> > > >>> some fields and by any means create a facet on them? ie after I get
> > the
> > > >>> results I want to identify some fields as facets and send back
> facets
> > > for
> > > >>> them in the response.
> > > >>>
> > > >>> A kind of very dynamic faceting based on the results!
> > > >>>
> > > >>> Cld some one pls share their idea.
> > > >>>
> > > >>> Thanks!
> > > >>> Anil.
> > > >>
> > > >>
> > >
> >
>


Re: Idle timeout expired: 50000/50000 ms

2016-04-29 Thread Robert Brown

Thanks Shawn,

I'm definitely not looking to just upping the timeout, like you say, 
there's a bigger issue to be resolved.


My indexes are between 1m and up to 60m docs (30m per shard, ~70GB on 
disk each).


All of these collections get completely refreshed at least once a day, 
data may not actually be changing, but the JSON files are re-uploaded 
currently.


I've used a 3GB heap for all of the nodes, perhaps that needs upping a bit.

Strange that I've not seen this issue before though.

Any good advice/guidance for analysing and tweaking GC?




On 29/04/16 18:52, Shawn Heisey wrote:

On 4/28/2016 3:13 PM, Robert Brown wrote:

I operate several collections (about 7-8) all using the same 5-node
ZooKeeper cluster.  They've been in production for 3 months, with only
2 previous issues where a Solr node went down.

Tonight, during several updates to the various collections, a handful
failed due to the below error.

Could this be related to ZooKeeper in any way?  If so, what could I
check to ensure everything is running smoothly?

The collections are a mix of 1 and 2 shards, all with 1 replica.

Updates are performed in batches of 1000 in JSON files.

Are there any other things I could/should be checking?


$VAR1 = {
  'error' => {
   'code' => 500,
   'msg' => 'java.util.concurrent.TimeoutException:
Idle timeout expired: 5/5 ms',

This idle timeout is configured in Jetty.  The default setting in the
jetty config provided with Solr 5.x is 50 seconds.

If your update requests are taking too long for the Jetty idle timeout,
then I think you're having a general performance problem with Solr.
Increasing the timeout might help in the short term, but unless you fix
the underlying performance issue, you'd probably just run into the new
timeout at some point in the future.

Most severe performance problems like this are memory related, and are
solved by adding more memory.  Sometimes that is Java heap memory,
sometimes that is memory that is not allocated to a program.  Sometimes
both are required.

Thanks,
Shawn





Re: deactivate coord scoring factor in pf2 pf3

2016-04-29 Thread Doug Turnbull
I was wrong Elisabeth. I thought you could disable coord at query time in
Solr, turns out you can't (I was thinking of Lucene's BooleanQuery
disableCoord param).

https://issues.apache.org/jira/browse/SOLR-3931

I definitely know you can disable coord with a custom Similarity and just
return 1.0 for coordinating factor.
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.4.0/org/apache/lucene/search/similarities/TFIDFSimilarity.java#TFIDFSimilarity.coord%28int%2Cint%29

I was also trying to read the tea leaves of edismax's source to try to
figure out when coord is disabled and when its not (and what the underlying
justification is). Sadly I don't see an explicit rhyme or reason to its
coord behavior and what you're seeing. Do you have the full query you're
sending?
https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/search/ExtendedDismaxQParser.java

-Doug

On Thu, Apr 28, 2016 at 2:05 PM Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> Glad to see you're using http://splainer.io! I recognize those explains!
> (let me know if you have any ideas/thoughts/questions/criticisms I created
> the thing).
>
> Some thoughts
> - You might consider using ps2 or ps3 to add a slop to the two word and
> three word phrase searches. Slop adds a less strict positional tolerance.
> This would help get RER paired with Saint in your other document, and
> effectively eliminate the coord. , though at a lower score (1 / position
> difference IIRC)
> - Have you tried sending "disableCoord" to Solr? I usually leave coord on,
> as I consider it useful to bias towards more matches. But that option
> exists.
> - Using pf2 and pf3 together means that 3 word phrase matches will get
> counted twice. Once as a three word phrase match. Again as multiple 2 word
> phrase matches. I usually just stick with pf2.
>
> Best!
>
> -Doug
>
> On Thu, Apr 28, 2016 at 11:32 AM elisabeth benoit <
> elisaelisael...@gmail.com> wrote:
>
>> Hello all,
>>
>> I am using Solr 4.10.1. I use edismax, with pf2 to boost documents
>> starting
>> with. I use a start with token (b) automatically added at index time,
>> and added in request at query time.
>>
>> I have a problem at this point.
>>
>> request is *q=b saint denis rer*
>>
>> the start with field is name_sw
>>
>> first document *name_sw: Saint-Denis-Université*
>> second document *name_sw: RER Saint-Denis*
>>
>> So one will have the pf2 starts with boost and not the other. The problem
>> is that it has an effect on the scoring of pf2 for all other words.
>>
>> In other words, my problem is the proximity between "saint" and "denis" is
>> not scored the same value for those two documents.
>>
>> From what I get this is because of the coord scoring factor used for pf2.
>>
>> In explain output, for first document
>>
>> 0.52612317 Matches Punished by 0.667 (not all query terms matched)
>>0.78918475 Sum of the following:
>>  0.39459237 names_sw:"b saint"^0.21
>>
>>  0.39459237 Dismax (take winner of below)
>>0.39459237 names_sw:"saint denis"^0.21
>>
>>0.37580228 catchall:"saint den"^0.2
>>
>>
>> *So here, matches punished by 0.66*, which corresponds to coord(2/3)
>>
>> and final score pf2 for proximity between saint and denis
>>
>> 0.263061593153079 names_sw:"saint denis"^0.21
>>
>>
>> In explain output, for second document
>>
>>
>>  0.13153079 Matches Punished by 0.3334 (not all query terms matched)
>>0.39459237 Dismax (take winner of below)
>>  0.39459237 names_sw:"saint denis"^0.21
>>
>>  0.37580228 catchall:"saint den"^0.2
>>
>>
>> *So here matches punished by 0.33*, which corresponds to coord(1/3)
>>
>> and final score pf2 for proximity between saint and denis
>>
>> 0.1315307926306158 names_sw:"saint denis"^0.21
>>
>>
>> I would like to deactivate coord for pf2 pf3. Does anyone know how I
>> could do this?
>>
>>
>> Best regards,
>>
>> Elisabeth
>>
>


Error - Too many close [count:-1]

2016-04-29 Thread Vipul Gupta
Solr team - Any pointers on fixing this issue ?

[10:29:08] ERROR 0-thread-7 o.a.s.c.SolrCore <> Too many close [count:-1] on
org.apache.solr.core.SolrCore@3d6f8ad3. Please report this exception to
solr-user@lucene.apache.org


Re: Missed update on replica

2016-04-29 Thread Mike Wartes
I should add that this is on Solr 5.1.0.

On Thu, Apr 28, 2016 at 2:42 PM, Mike Wartes  wrote:

> I have a three node, one shard SolrCloud cluster.
>
> Last week one of the nodes went out of sync with the other two and I'm
> trying to understand why that happened.
>
> After poking through my logs and the solr code here's what I've pieced
> together:
>
> 1. Leader gets an update request for a batch delete of 306 items. It sends
> this update along to Replica A and Replica B.
> 2. On Replica A all is well. It receives the update request and logs that
> 306 documents were deleted.
> 3. Replica B also receives the update request but at some point during the
> request something kills the connection. Leader logs a "connection reset"
> socket error. Replica B doesn't have any errors but it does log that it
> only deleted 95 documents as a result of the update call.
> 4. Because of the socket error, Leader starts leader-initiated-recovery
> for Replica B. It sets Replica B to the "down" state in ZK.
> 5. Replica B gets the leader-initiated-recovery request, updates its ZK
> state to "recovering", and starts the PeerSync process.
> 6. Replica B's PeerSync reports that it has gotten "100 versions" from the
> leader but then declares that "Our versions are newer" and finishes
> successfully.
> 7. Replica B puts itself back in the active state, but it is now out of
> sync with the Leader and Replica A. It is left with 211 documents in it
> that should have been deleted.
>
> I am curious if anyone has any thoughts on why Replica B failed to detect
> that it was behind the leader in this scenario.
>
> I'm not really clear on how the update version numbers are assigned, but
> is it possible that the 95 documents that did make it to Replica B had a
> later version number than the 211 that didn't? I don't have perfect
> understanding of the PeerSync code but looking through it, in particular at
> the logic that prints the "Our versions are newer" message, it seems like
> if 95 of the 100 documents fetched from the leader during PeerSync did
> match what the replica already has it might declare itself up-to-date
> without looking at the last few.
>