Re: better stemming engine than Porter?

2008-04-22 Thread Jay

Hi Wagner,

Thanks for the intro of KStem! I quickly scanned the original paper on 
KStem by Robert Krovetz but could not find any timing comparison data on
KStem and Porter stem. I wonder how slow/fast Kstem is compared to 
Porter stem based on your use in your application?


Jay

Wagner,Harry wrote:

Mathieu,
It's not my Kstem. It was written by someone at Umass, Amherst. More info here: 
http://ciir.cs.umass.edu/cgi-bin/downloads/downloads.cgi 


Someone else had already ported it to Lucene. I simply modified that wrapper to 
work with Solr. I'll open an issue for it so that it can (hopefully) be 
integrated into the project.

Cheers... harry

-Original Message-
From: Mathieu Lecarme [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, April 22, 2008 3:57 AM

To: solr-user@lucene.apache.org
Subject: Re: better stemming engine than Porter?

Porter stemmer is not only agressive, it is ugly, too. The generated 
code is too old, too  few object centric and should be too slow.
If your kstem compile with java 1.4, why don't you suggest it to lucene 
core?


M.

Wagner,Harry a écrit :

Hi HH,
Here's a note I sent Solr-dev a while back:

---
I've implemented a Solr plug-in that wraps KStem for Solr use (someone
else had already written a Lucene wrapper for it).  KStem is considered
to be more appropriate for library usage since it is much less
aggressive than Porter (i.e., searches for organization do NOT match on
organ!). If there is any interest in feeding this back into Solr I would
be happy to contribute it.
---

I believe there was interest in it, but I never opened an issue for it
and I don't know if it was ever followed-up on. I'd be happy to do that
now. Can someone on the Solr-dev team point me in the right direction
for opening an issue?

Thanks... harry


-Original Message-
From: Hung Huynh [mailto:[EMAIL PROTECTED] 
Sent: Monday, April 21, 2008 11:59 AM

To: solr-user@lucene.apache.org
Subject: better stemming engine than Porter?

I recall I've read some where in one of the mailing-list archives that
some
one had developed a better stemming algo for Solr than the built-in
Porter
stemming. Does anyone have link to that stemming module? 


Thanks,

HH 





  






facet by update date

2016-01-24 Thread Jay Potharaju
Hi,
I am trying to calculate facet for update_date of the document. And would
like to get the following values
-  < 24 Hrs
-  < 3 days
- < 1 week
- < 1 month
- < 6 months
- <1 year



The above facet values should change every time someone queries, therefore
a document that was updated today will show will be in the facet  "24 hrs"
and when the same query runs 2 weeks from today, the document will be
marked as "< 1 month".

How can I  set my facets to get the above values?

-- 
Thanks
Jay


Re: facet by update date

2016-01-24 Thread Jay Potharaju
Thanks Pavel,
I was trying it using the range faceting instead of facet.interval. Can
someone comment on performance of using facet.interval with  sharded index
and high number of documents.
Thanks
J

On Sun, Jan 24, 2016 at 1:09 PM, Pavel Polívka 
wrote:

> Hi,
> We are doing this via interval facet:
>
> Something like this:
> facet=on&
> facet.interval=update_date&
> facet.interval.set=[NOW-1DAY,NOW]&
> facet.interval.set=[NOW-3DAY,NOW-1DAY)&
> facet.interval.set=[NOW-7DAY,NOW-3DAY)&
> facet.interval.set=[NOW-1MONTH,NOW-7DAY)&
> facet.interval.set=[NOW-1YEAR,NOW-1MONTH)
>
> I do not know if this is a correct way of doing this, but I did not find
> anything better.
> Here is link for interval faceting in wiki:
>
> https://cwiki.apache.org/confluence/display/solr/Faceting#Faceting-IntervalFaceting
>
> Hope this helps.
>
> Pavel
>
>
> ne 24. 1. 2016 v 17:20 odesílatel Jay Potharaju 
> napsal:
>
> > Hi,
> > I am trying to calculate facet for update_date of the document. And would
> > like to get the following values
> > -  < 24 Hrs
> > -  < 3 days
> > - < 1 week
> > - < 1 month
> > - < 6 months
> > - <1 year
> >
> >  > required="true" multiValued="false" docValues="true"/>
> >
> > The above facet values should change every time someone queries,
> therefore
> > a document that was updated today will show will be in the facet  "24
> hrs"
> > and when the same query runs 2 weeks from today, the document will be
> > marked as "< 1 month".
> >
> > How can I  set my facets to get the above values?
> >
> > --
> > Thanks
> > Jay
> >
>



-- 
Thanks
Jay Potharaju


Re: facet by update date

2016-01-24 Thread Jay Potharaju
Which is a better option facet.interval or facet.query in terms of
performance?
Thanks


On Sun, Jan 24, 2016 at 5:04 PM, Erik Hatcher 
wrote:

> I suggest facet.query is the way to go for a handful of buckets/ranges.
>
> I'm mobile so apologies for not providing some examples but something like
> a few of these kinds of things:
>
>facet.query={!lucene key=under_24_hours}update_date:[NOW-24HOURS TO NOW]
>
> Things get interesting if you want < 3 days to not include under 24 hour
> ones too, but just some considered Lucene query clauses will do the trick.
>
>    Erik
>
> > On Jan 24, 2016, at 18:38, Jay Potharaju  wrote:
> >
> > Thanks Pavel,
> > I was trying it using the range faceting instead of facet.interval. Can
> > someone comment on performance of using facet.interval with  sharded
> index
> > and high number of documents.
> > Thanks
> > J
> >
> > On Sun, Jan 24, 2016 at 1:09 PM, Pavel Polívka 
> > wrote:
> >
> >> Hi,
> >> We are doing this via interval facet:
> >>
> >> Something like this:
> >> facet=on&
> >> facet.interval=update_date&
> >> facet.interval.set=[NOW-1DAY,NOW]&
> >> facet.interval.set=[NOW-3DAY,NOW-1DAY)&
> >> facet.interval.set=[NOW-7DAY,NOW-3DAY)&
> >> facet.interval.set=[NOW-1MONTH,NOW-7DAY)&
> >> facet.interval.set=[NOW-1YEAR,NOW-1MONTH)
> >>
> >> I do not know if this is a correct way of doing this, but I did not find
> >> anything better.
> >> Here is link for interval faceting in wiki:
> >>
> >>
> https://cwiki.apache.org/confluence/display/solr/Faceting#Faceting-IntervalFaceting
> >>
> >> Hope this helps.
> >>
> >> Pavel
> >>
> >>
> >> ne 24. 1. 2016 v 17:20 odesílatel Jay Potharaju 
> >> napsal:
> >>
> >>> Hi,
> >>> I am trying to calculate facet for update_date of the document. And
> would
> >>> like to get the following values
> >>> -  < 24 Hrs
> >>> -  < 3 days
> >>> - < 1 week
> >>> - < 1 month
> >>> - < 6 months
> >>> - <1 year
> >>>
> >>>  >>> required="true" multiValued="false" docValues="true"/>
> >>>
> >>> The above facet values should change every time someone queries,
> >> therefore
> >>> a document that was updated today will show will be in the facet  "24
> >> hrs"
> >>> and when the same query runs 2 weeks from today, the document will be
> >>> marked as "< 1 month".
> >>>
> >>> How can I  set my facets to get the above values?
> >>>
> >>> --
> >>> Thanks
> >>> Jay
> >
> >
> >
> > --
> > Thanks
> > Jay Potharaju
>



-- 
Thanks
Jay Potharaju


Handling fields used for both display & index

2016-01-31 Thread Jay Potharaju
Hi,
I am trying to decide if I should use text_en or string as my field type.
The fields have to be both indexed and stored for display. One solution is
to duplicate fields, one for indexing other for display.One of the field
happens to be a description field which I would like to avoid duplicating.
Solr should return results when someone searches for John or john.Is
storing a copy of the field the best way to go about this problem?


Thanks


Solr-8496

2016-01-31 Thread Jay Potharaju
Hi
I am just starting off on a project (using solr 5.4) which uses multi
select faceting as one of the key features. Should I wait till solr 5.5 is
released and resolves the issue outlined in solr-8496? Is there a
recommended version I should be using?

Is there an ETA on when will 5.5 be released?
Thanks


On Wed, Jan 27, 2016 at 8:28 AM, Shawn Heisey  wrote:

> On 1/27/2016 8:59 AM, David Smith wrote:
> > So we definitely don’t have CP yet — our very first network outage
> resulted in multiple overlapped lost updates.  As a result, I can’t pick
> one replica and make it the new “master”.  I must rebuild this collection
> from scratch, which I can do, but that requires downtime which is a problem
> in our app (24/7 High Availability with few maintenance windows).
>
> I don't have anything to add regarding the initial problem or how you
> can prevent it from happening again.  You're already on the latest minor
> Solr version (5.4), though I think you could probably upgrade to 5.4.1.
> The list of bugfixes for 5.4.1 does not seem to include anything that
> could explain the problem you have described.  There is SOLR-8496, but
> that only affects multi-select faceting, not the numDoc counts on the
> core summary screen.
>
> For the specific issue mentioned above (rebuilding without downtime),
> what I would recommend that you do is build an entirely new collection,
> then delete the old one and create an alias to the new collection using
> the old collection's name.  Deleting the old collection and making the
> alias should happen very quickly.
>
> I don't think any documentation states this, but it seems like a good
> idea to me use an alias from day one, so that you always have the option
> of swapping the "real" collection that you are using without needing to
> change anything else.  I'll need to ask some people if they think this
> is a good documentation addition, and think of a good place to mention
> it in the reference guide.
>
> Thanks,
> Shawn
>
>


-- 
Thanks
Jay Potharaju


Re: Handling fields used for both display & index

2016-02-03 Thread Jay Potharaju
Thanks for the response Sameer & Binoy.
Jay

On Sun, Jan 31, 2016 at 6:13 PM, Binoy Dalal  wrote:

> Adding to sameer's answer, use string types when you want exact matches,
> both in terms of query and case.
> In case you want to perform additional operations on the input, like
> tokenization and applying filters, then you're better off with one of the
> other text types.
> You should take a look at the field type definition in the schema file to
> see if a predefined type fits your need, else create a custom type based on
> your requirements.
>
> On Mon, 1 Feb 2016, 07:36 Sameer Maggon  wrote:
>
> > Hi Jay,
> >
> > You could use one field for both unless there is a specific requirement
> you
> > are looking for that is not being met by that one field (e.g. faceting,
> > etc). Typically, if you have a field that is marked as both "indexed" and
> > "stored", the value that is passed while indexing to that field is stored
> > as is. However, it's indexed based on the field type that you've
> specified
> > for that field.
> >
> > e.g. a description field with the field type of "text_en" would be
> indexed
> > per the pipeline in the text_en fieldtype and the text as is will be
> stored
> > (which is what is returned in your response in the results).
> >
> > Thanks,
> > --
> > *Sameer Maggon*
> > Measured Search | Solr-as-a-Service | Solr Monitoring | Search Analytics
> > www.measuredsearch.com
> > <
> >
> https://mailtrack.io/trace/link/dca98638f8114f38d1ff30ed04feb547877c848e?url=http%3A%2F%2Fmeasuredsearch.com%2F&signature=797ba5008ecc48b8
> > >
> >
> > On Sun, Jan 31, 2016 at 5:56 PM, Jay Potharaju 
> > wrote:
> >
> > > Hi,
> > > I am trying to decide if I should use text_en or string as my field
> type.
> > > The fields have to be both indexed and stored for display. One solution
> > is
> > > to duplicate fields, one for indexing other for display.One of the
> field
> > > happens to be a description field which I would like to avoid
> > duplicating.
> > > Solr should return results when someone searches for John or john.Is
> > > storing a copy of the field the best way to go about this problem?
> > >
> > >
> > > Thanks
> > >
> >
> --
> Regards,
> Binoy Dalal
>



-- 
Thanks
Jay Potharaju


Custom field using PatternCaptureGroupFilterFactory

2016-03-06 Thread Jay Potharaju
Hi,
I have a custom field for getting the first letter of an firstname. For
this I am using PatternCaptureGroupFilterFactory.
This is not working as expected, not able to parse the data and get the
first character for the string. Any suggestions on how to fix this?

 

  







   



-- 
Thanks
Jay


Re: Custom field using PatternCaptureGroupFilterFactory

2016-03-06 Thread Jay Potharaju
On the analysis screen I see the following. Not sure why the regex didnt
work. Any suggestions?
Thanks

KT
text
raw_bytes
start
end
positionLength
type
position
test
[74 65 73 74]
0
4
1
word
1
UCF
text
raw_bytes
start
end
positionLength
type
position
TEST
[54 45 53 54]
0
4
1
word
1
PCGTF
text
raw_bytes
start
end
positionLength
type
position
TEST
[54 45 53 54]
0
4
1
word
1

On Sun, Mar 6, 2016 at 9:56 AM, Binoy Dalal  wrote:

> What do you see under the analysis screen in the solr admin UI?
>
> On Sun, Mar 6, 2016 at 10:55 PM Jay Potharaju 
> wrote:
>
> > Hi,
> > I have a custom field for getting the first letter of an firstname. For
> > this I am using PatternCaptureGroupFilterFactory.
> > This is not working as expected, not able to parse the data and get the
> > first character for the string. Any suggestions on how to fix this?
> >
> >  
> >
> >   
> >
> > 
> >
> > 
> >
> >  > "^[a-zA-Z0-9]{0,1}" preserve_original="false"/>
> >
> >
> >
> > 
> >
> > --
> > Thanks
> > Jay
> >
> --
> Regards,
> Binoy Dalal
>



-- 
Thanks
Jay Potharaju


Re: Custom field using PatternCaptureGroupFilterFactory

2016-03-07 Thread Jay Potharaju
Thanks Jack, the problem was my regex. Following regex worked.

Jay

On Sun, Mar 6, 2016 at 7:43 PM, Jack Krupansky 
wrote:

> The filter name, "Capture Group", says it all - only pattern groups are
> captured and you have not specified even a single group. See the example:
>
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/pattern/PatternCaptureGroupFilterFactory.html
>
> Groups are each enclosed within parentheses, as shown in the Javadoc
> example above.
>
> Since no groups were found, the filter doc applied this rule:
> "If none of the patterns match, or if preserveOriginal is true, the
> original token will be preserved."
>
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/pattern/PatternCaptureGroupTokenFilter.html
>
> That should probably also say "or if no pattern groups match".
>
> To test regular expressions, try an interactive online tool, such as:
> https://regex101.com/
>
> -- Jack Krupansky
>
> On Sun, Mar 6, 2016 at 7:51 PM, Alexandre Rafalovitch 
> wrote:
>
> > I don't see the brackets that mark the group you actually want to
> > capture. As per:
> >
> >
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/pattern/PatternCaptureGroupTokenFilter.html
> >
> > I am also not sure if you actually need "{0,1}" part.
> >
> > Regards,
> >Alex.
> > 
> > Newsletter and resources for Solr beginners and intermediates:
> > http://www.solr-start.com/
> >
> >
> > On 7 March 2016 at 04:25, Jay Potharaju  wrote:
> > > Hi,
> > > I have a custom field for getting the first letter of an firstname. For
> > > this I am using PatternCaptureGroupFilterFactory.
> > > This is not working as expected, not able to parse the data and get the
> > > first character for the string. Any suggestions on how to fix this?
> > >
> > >  
> > >
> > >   
> > >
> > > 
> > >
> > > 
> > >
> > >  > > "^[a-zA-Z0-9]{0,1}" preserve_original="false"/>
> > >
> > >
> > >
> > > 
> > >
> > > --
> > > Thanks
> > > Jay
> >
>



-- 
Thanks
Jay Potharaju


JSON FACET API - multiselect

2016-03-09 Thread Jay Potharaju
Hi,
I am using solr 5.4 and testing the multi select JSON facet feature.
When I select 1 value the results are the same as number of counts for the
facet. But when I select more than 1 facet the number of results returned
are not correct.

*Single Facet selected*
fq: [
"{!tag=FIRSTLETTER}facet_firstLetter_lastname:(Q)"
],
json.facet.name: "{type:terms,field:facet_firstLetter_lastname,
sort:{count:desc}}"
response: {
numFound: 540,
start: 0,
docs: [ ]
},
facets: {
count: 5246,
name: {
buckets: [
{
val: "Q",
count: 540
},
{
val: "X",
count: 1302
},
{
val: "J",
count: 4718
},
{
val: "Z",
count: 7242
},
{
val: "V",
count: 9089
},
{
val: "F",
count: 10053
},
{
val: "P",
count: 14966
},
{
val: "Y",
count: 18520
},
{
val: "W",
count: 20781
},
{
val: "G",
count: 21935
}
]
}
}

*Multi-select facet*
fq: [
"{!tag=FIRSTLETTER}facet_firstLetter_lastname:(Q J)"
],

response: {
numFound: 5246,
start: 0,
docs: [ ]
}

I was expecting the response count to be 540 + 4718 = 5258 but the response
is 5246.

Can someone comment on regarding this?

-- 
Thanks
Jay


Re: JSON FACET API - multiselect

2016-03-09 Thread Jay Potharaju
Actually there is a problem with my data..found my error.
Thanks


On Wed, Mar 9, 2016 at 9:24 AM, Jay Potharaju  wrote:

> Hi,
> I am using solr 5.4 and testing the multi select JSON facet feature.
> When I select 1 value the results are the same as number of counts for the
> facet. But when I select more than 1 facet the number of results returned
> are not correct.
>
> *Single Facet selected*
> fq: [
> "{!tag=FIRSTLETTER}facet_firstLetter_lastname:(Q)"
> ],
> json.facet.name: "{type:terms,field:facet_firstLetter_lastname,
> sort:{count:desc}}"
> response: {
> numFound: 540,
> start: 0,
> docs: [ ]
> },
> facets: {
> count: 5246,
> name: {
> buckets: [
> {
> val: "Q",
> count: 540
> },
> {
> val: "X",
> count: 1302
> },
> {
> val: "J",
> count: 4718
> },
> {
> val: "Z",
> count: 7242
> },
> {
> val: "V",
> count: 9089
> },
> {
> val: "F",
> count: 10053
> },
> {
> val: "P",
> count: 14966
> },
> {
> val: "Y",
> count: 18520
> },
> {
> val: "W",
> count: 20781
> },
> {
> val: "G",
> count: 21935
> }
> ]
> }
> }
>
> *Multi-select facet*
> fq: [
> "{!tag=FIRSTLETTER}facet_firstLetter_lastname:(Q J)"
> ],
>
> response: {
> numFound: 5246,
> start: 0,
> docs: [ ]
> }
>
> I was expecting the response count to be 540 + 4718 = 5258 but the
> response is 5246.
>
> Can someone comment on regarding this?
>
> --
> Thanks
> Jay
>
>



-- 
Thanks
Jay Potharaju


solr & docker in production

2016-03-14 Thread Jay Potharaju
Hi,
I was wondering is running solr inside a  docker container. Are there any
recommendations for this?


-- 
Thanks
Jay


Re: solr & docker in production

2016-03-14 Thread Jay Potharaju
Upayavira,
Thanks for the feedback.  I plan to deploy solr on its own instance rather
than on instance running multiple applications.

Jay

On Mon, Mar 14, 2016 at 3:19 PM, Upayavira  wrote:

> There is a default Docker image for Solr on the Docker Registry. I've
> used it to great effect in creating a custom Solr install.
>
> The main thing I'd say is that Docker generally encourages you to run
> many apps on the same host, whereas Solr benefits hugely from a host of
> its own - so don't be misled into installing Solr alongside lots of
> other things.
>
> Even if the only thing that gets put onto a node is a Docker install,
> then a Solr Docker image, it is *still* way easier to do than anything
> else I've tried and still very worth it.
>
> Upayavira (who doesn't, yet, have Dockerised Solr in production, but
> will soon)
>
> On Mon, 14 Mar 2016, at 07:53 PM, Jay Potharaju wrote:
> > Hi,
> > I was wondering is running solr inside a  docker container. Are there any
> > recommendations for this?
> >
> >
> > --
> > Thanks
> > Jay
>



-- 
Thanks
Jay Potharaju


Re: solr & docker in production

2016-03-15 Thread Jay Potharaju
I have not yet tried in production yet, will post my findings.
Thanks
Jay

> On Mar 14, 2016, at 11:42 PM, Georg Sorst  wrote:
> 
> Hi,
> 
> sounds great!
> 
> Did you run any benchmarks? What's the IO penalty?
> 
> Best,
> Georg
> 
> Jay Potharaju  schrieb am Di., 15. Mär. 2016 04:25:
> 
>> Upayavira,
>> Thanks for the feedback.  I plan to deploy solr on its own instance rather
>> than on instance running multiple applications.
>> 
>> Jay
>> 
>>> On Mon, Mar 14, 2016 at 3:19 PM, Upayavira  wrote:
>>> 
>>> There is a default Docker image for Solr on the Docker Registry. I've
>>> used it to great effect in creating a custom Solr install.
>>> 
>>> The main thing I'd say is that Docker generally encourages you to run
>>> many apps on the same host, whereas Solr benefits hugely from a host of
>>> its own - so don't be misled into installing Solr alongside lots of
>>> other things.
>>> 
>>> Even if the only thing that gets put onto a node is a Docker install,
>>> then a Solr Docker image, it is *still* way easier to do than anything
>>> else I've tried and still very worth it.
>>> 
>>> Upayavira (who doesn't, yet, have Dockerised Solr in production, but
>>> will soon)
>>> 
>>>> On Mon, 14 Mar 2016, at 07:53 PM, Jay Potharaju wrote:
>>>> Hi,
>>>> I was wondering is running solr inside a  docker container. Are there
>> any
>>>> recommendations for this?
>>>> 
>>>> 
>>>> --
>>>> Thanks
>>>> Jay
>> 
>> 
>> 
>> --
>> Thanks
>> Jay Potharaju
> -- 
> *Georg M. Sorst I CTO*
> FINDOLOGIC GmbH
> 
> 
> 
> Jakob-Haringer-Str. 5a | 5020 Salzburg I T.: +43 662 456708
> E.: g.so...@findologic.com
> www.findologic.com Folgen Sie uns auf: XING
> <https://www.xing.com/profile/Georg_Sorst>facebook
> <https://www.facebook.com/Findologic> Twitter
> <https://twitter.com/findologic>
> 
> Wir sehen uns auf dem *Shopware Community Day in Ahaus am 20.05.2016!* Hier
>  Termin
> vereinbaren!
> Wir sehen uns auf der* dmexco in Köln am 14.09. und 15.09.2016!* Hier
>  Termin
> vereinbaren!


Re: Making managed schema unmutable correctly?

2016-03-18 Thread Jay Potharaju
Does using schema API mean that no upconfig to zookeeper and no reloading
of all the nodes in my solrcloud? In which scenario should I not use schema
API, if any?
Thanks
Jay

On Wed, Mar 16, 2016 at 6:22 PM, Shawn Heisey  wrote:

> On 3/16/2016 1:14 AM, Alexandre Rafalovitch wrote:
> > So, I am looking at the Solr 5.5 examples with their all-in by-default
> > managed schemas. And I am scratching my head on the workflow users are
> > expected to follow.
> >
> > One example is straight from documentation:
> > "With the above configuration, you can use the Schema API to modify
> > the schema as much as you want, and then later change the value of
> > mutable to false if you wish to "lock" the schema in place and prevent
> > future changes."
> >
> > Which sounds great, except right above the definition in the
> > solrconfig.xml, it says:
> > "Do NOT hand edit the managed schema - external modifications will be
> > ignored and overwritten as a result of schema modification REST API
> > calls."
> >
> > And the Config API does not seem to provide any API to switch that
> > property (schemaFactory.mutable is not an editable property).
> >
> > So, how is one supposed to lock the schema after modifying it? In the
> > default, non-cloud, example!
> >
> > So far, the nearest I get is to unload the core (losing
> > core.properties), manually modify solrconfig.xml in violation of
> > instructions and add the core back. What am I missing?
>
> Note that you *can* hand edit the managed-schema file.  It's *strongly*
> not recommended if you actually plan to use the Schema API, because any
> changes you make manually will be lost if you subsequently use the
> Schema API before restart/reload ... but if you always hand edit the
> file, or you are *very* careful to make sure that the core/collection
> has been reloaded before using the Schema API, then that won't matter.
> This is difficult to explain in a concise config comment though, so
> hand-editing is simply discouraged.
>
> Thanks,
> Shawn
>
>


-- 
Thanks
Jay Potharaju


Re: Making managed schema unmutable correctly?

2016-03-20 Thread Jay Potharaju
Thanks  appreciate the feedback.

On Wed, Mar 16, 2016 at 8:23 PM, Shawn Heisey  wrote:

> On 3/16/2016 7:51 PM, Jay Potharaju wrote:
> > Does using schema API mean that no upconfig to zookeeper and no reloading
> > of all the nodes in my solrcloud? In which scenario should I not use
> schema
> > API, if any?
>
> The documentation says that a reload occurs automatically after the
> schema modification.  You will not need to upconfig and reload.
>
> https://cwiki.apache.org/confluence/display/solr/Schema+API
>
> I can't really tell you when you should or shouldn't use the API.
> That's something you'll have to decide.  If the API will do everything
> you need with regard to schema changes, then you could use it
> exclusively.  Or you could never use it, and the only thing that would
> change is the name of the file that you upload -- managed-schema instead
> of schema.xml.
>
> You can also reconfigure Solr to use the classic schema instead of the
> managed schema, and rename managed-schema to schema.xml.
>
> Thanks,
> Shawn
>
>


-- 
Thanks
Jay Potharaju


Indexing using CSV

2016-03-20 Thread Jay Potharaju
Hi,
I am trying to index some data using csv files. The data contains
description column, which can include quotes, comma, LF/CR & other special
characters.

I have it working but run into an issue with the following error

line=5,can't read line: 5 values={NO LINES AVAILABLE}.

What is the best way to debug this issue and secondly how do other people
handle indexing data using csv data.

-- 
Thanks
Jay


understanding phonetic matching

2016-03-22 Thread Jay Potharaju
Hi,
I am trying to do name matching using the phonetic filter factory. As part
of that I was analyzing the data using analysis screen in solr UI. If i
search for john, any documents containing john or jon should be found.

Following is my definition of the custom field that I use for indexing the
data. When I look at my solr data I dont see any similar sounding names in
my solr data, even though I have set inject="true". Is that not how it is
supposed to work?
Can someone explain how phonetic matching works?

 

 







 



-- 
Thanks
Jay


Indexing multiple pdf's and partial update of pdf

2016-03-23 Thread Jay Parashar
Hi,

I have couple of questions regarding indexing files (say pdf).

1)  Is there any way to index more than one file to one document with a 
unique id?

One way I think is to do a “extractOnly” of all the documents and then index 
that extract separately. Is there an easier way?

2)  If my Solr document has existing fields populated and then I index a 
pdf, it seems it overwrites the document with the end result being just the 
contents of the pdf. I know we can do partial updates using SolrJ but is it 
possible to do partial updates of pdf using curl?


Thanks
Jay


RE: Indexing multiple pdf's and partial update of pdf

2016-03-24 Thread Jay Parashar


Thanks Reth,



Yes I am using Apache Tike and went by the instructions given in

https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika



Here I see we can index a pdf " solr-word.pdf" to a document with unique key = 
"doc1" as below



curl 
'http://localhost:8983/solr/techproducts/update/extract?literal.id=doc1&commit=true'
 -F 
"myfile=@example/exampledocs/solr-word.pdf<mailto:myfile=@example/exampledocs/solr-word.pdf>"



My requirement is to index another separate pdf to this document with key = 
doc1. Basically I need the contents of both pdfs to be searchable and related 
to the id=doc1.



What comes to my mind is to perform an 'extractOnly' as below on both pdf's and 
then index the concatenation of the contents. Is there another less invasive 
way?



curl "http://localhost:8983/solr/techproducts/update/extract?&extractOnly=true"; 
--data-binary @example/exampledocs/sample.html -H 'Content-type:text/html'



Thanks

Jay



-Original Message-
From: Reth RM [mailto:reth.ik...@gmail.com]
Sent: Thursday, March 24, 2016 12:24 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing multiple pdf's and partial update of pdf



Are you using apache tika parser to parse pdf files?



1) Solr support parent-child block join using which you can index more than one 
file data within document object(if that is what you are looking for) 
https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_solr_Other-2BParsers-23OtherParsers-2DBlockJoinQueryParsers&d=CwIFaQ&c=uGuXJ43KPkPWEl2imVFDmZQlhQUET7pVRA2PDIOxgqw&r=bRfqJEeedEKG5nkp5748YxbNMFrUYT3YiNl0Ni2vUBQ&m=mjl8EQIh28bi_at9AocrELmHdF6oGMDz4_-rPAaBWrI&s=83RBCYuuwc7iI4KAzkPMsyNThtsMqr9Bp9QOk1lr_fU&e=



2) If the unique key of the document that exists in index is equal to new 
document that you are reindexing, it will be overwritten. If you'd like to do 
partial updates via curl, here are some examples listed :

https://urldefense.proofpoint.com/v2/url?u=http-3A__yonik.com_solr_atomic-2Dupdates_&d=CwIFaQ&c=uGuXJ43KPkPWEl2imVFDmZQlhQUET7pVRA2PDIOxgqw&r=bRfqJEeedEKG5nkp5748YxbNMFrUYT3YiNl0Ni2vUBQ&m=mjl8EQIh28bi_at9AocrELmHdF6oGMDz4_-rPAaBWrI&s=RnLUMlzU69Qr6D2NPbCH9wig6JLekcfwfGu9kC9l9DA&e=











On Thu, Mar 24, 2016 at 3:43 AM, Jay Parashar 
mailto:bparas...@slb.com>> wrote:



> Hi,

>

> I have couple of questions regarding indexing files (say pdf).

>

> 1)  Is there any way to index more than one file to one document with

> a unique id?

>

> One way I think is to do a “extractOnly” of all the documents and then

> index that extract separately. Is there an easier way?

>

> 2)  If my Solr document has existing fields populated and then I index

> a pdf, it seems it overwrites the document with the end result being

> just the contents of the pdf. I know we can do partial updates using

> SolrJ but is it possible to do partial updates of pdf using curl?

>

>

> Thanks

> Jay

>


number of zookeeper & aws instances

2016-04-13 Thread Jay Potharaju
Hi,

In my current setup I have about 30 million docs which will grow to 100
million by the end of the year. In order to accommodate scaling and query
load, i am planning to have atleast 2 shards and 2/3 replicas to begin
with. With the above solrcloud setup I plan to have 3 zookeepers in the
quorum.

If the number of replicas and shards increases, the number of solr
instances will also go up. With keeping that in mind I was wondering if
there are any guidelines on the number of zk instances to solr instances.

Secondly are there any recommendations for setting up solr in AWS?

-- 
Thanks
Jay


Re: number of zookeeper & aws instances

2016-04-13 Thread Jay Potharaju
Thanks for the feedback Eric.
I am assuming the number of replicas help in load balancing and reliability. 
That being said are there any recommendation for that, or is it dependent on 
query load and performance sla's.

Any suggestions on aws setup?
Thanks


> On Apr 13, 2016, at 7:12 AM, Erick Erickson  wrote:
> 
> For collections with this few nodes, 3 zookeepers are plenty. From
> what I've seen people don't go to 5 zookeepers until they have
> hundreds and hundreds of nodes.
> 
> 100M docs can fit on 2 shards, I've actually seen many more. That
> said, if the docs are very large and/or the searchers are complex
> performance may not be what you need. Here's a long blog on
> testing a configuration to destruction to be _sure_ you can scale
> as you need:
> 
> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> 
> Best,
> Erick
> 
>> On Wed, Apr 13, 2016 at 6:47 AM, Jay Potharaju  wrote:
>> Hi,
>> 
>> In my current setup I have about 30 million docs which will grow to 100
>> million by the end of the year. In order to accommodate scaling and query
>> load, i am planning to have atleast 2 shards and 2/3 replicas to begin
>> with. With the above solrcloud setup I plan to have 3 zookeepers in the
>> quorum.
>> 
>> If the number of replicas and shards increases, the number of solr
>> instances will also go up. With keeping that in mind I was wondering if
>> there are any guidelines on the number of zk instances to solr instances.
>> 
>> Secondly are there any recommendations for setting up solr in AWS?
>> 
>> --
>> Thanks
>> Jay


RE: Multiple data-config.xml in one collection?

2016-04-14 Thread Jay Parashar
You have to specify which one to run. Each DIH will run only one XML (e.g. 
health-topics-conf.xml)

One thing, and please correct if wrong, I have noticed running DataImport for a 
particular config overwrites the existing data  for a document...that is, there 
is no way to preserve the existing data.
For example if you have a schema of 5 fields and running the 
health-topics-conf.xml  DIH  loads 3 of those fields of a document (id=XYZ)
And then running the encyclopedia-conf.xml DIH will overwrite those 3 fields 
for the same  document id = XYZ.

-Original Message-
From: Yangrui Guo [mailto:guoyang...@gmail.com] 
Sent: Tuesday, April 05, 2016 2:16 PM
To: solr-user@lucene.apache.org
Subject: Re: Multiple data-config.xml in one collection?

Hi Daniel,

So if I implement multiple dataimporthandler and do a full import, does Solr 
perform import of all handlers at once or can just specify which handler to 
import? Thank you

Yangrui

On Tuesday, April 5, 2016, Davis, Daniel (NIH/NLM) [C] 
wrote:

> If Shawn is correct, and you are using DIH, then I have done this by 
> implementing multiple requestHandlers each of them using Data Import 
> Handler, and have each specify a different XML file for the data config.
> Instead of using data-config.xml, I've used a large number of files such as:
> health-topics-conf.xml
> encyclopedia-conf.xml
> ...
> I tend to index a single valued, required field named "source" that I 
> can use in the delete query, and I use the TemplateTranformer to make this 
> easy:
>
>  ...
>transformer="TemplateTransformer">
>
>...
>
> Hope this helps,
>
> -Dan
>
> -Original Message-
> From: Shawn Heisey [mailto:apa...@elyograg.org ]
> Sent: Tuesday, April 05, 2016 10:50 AM
> To: solr-user@lucene.apache.org 
> Subject: Re: Multiple data-config.xml in one collection?
>
> On 4/5/2016 8:12 AM, Yangrui Guo wrote:
> > I'm using Solr Cloud to index a number of databases. The problem is 
> > there is unknown number of databases and each database has its own
> configuration.
> > If I create a single collection for every database the query would 
> > eventually become insanely long. Is it possible to upload different 
> > config to zookeeper for each node in a single collection?
>
> Every shard replica (core) in a collection shares the same 
> configuration, which it gets from zookeeper.  This is one of 
> SolrCloud's guarantees, to prevent problems found with old-style 
> sharding when the configuration is different on each machine.
>
> If you're using the dataimport handler, which you probably are since 
> you mentioned databases, you can parameterize pretty much everything 
> in the DIH config file so it comes from URL parameters on the 
> full-import or delta-import command.
>
> Below is a link to the DIH config that I'm using, redacted slightly.
> I'm not running SolrCloud, but the same thing should work in cloud.  
> It should give you some idea of how to use variables in your config, 
> set by parameters on the URL.
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__apaste.info_jtq&d=
> CwIBaQ&c=uGuXJ43KPkPWEl2imVFDmZQlhQUET7pVRA2PDIOxgqw&r=bRfqJEeedEKG5nk
> p5748YxbNMFrUYT3YiNl0Ni2vUBQ&m=ps8KnPZhgym3oVyuWub8JT0eZI39W0FLsBW4fx5
> 61NY&s=k7H8l9XT7yyH_KHFtnIi793EtkLZnUvOz3lZA1mV01s&e=
>
> Thanks,
> Shawn
>
>


RE: Multiple data-config.xml in one collection?

2016-04-14 Thread Jay Parashar
Thanks a lot Daniel.


-Original Message-
From: Davis, Daniel (NIH/NLM) [C] [mailto:daniel.da...@nih.gov] 
Sent: Thursday, April 14, 2016 11:41 AM
To: solr-user@lucene.apache.org
Subject: RE: Multiple data-config.xml in one collection?

Jay Parashar wrote:
> One thing, and please correct if wrong, I have noticed running 
> DataImport for a particular config overwrites the existing data  for a 
> document...that is, there is no way to preserve the existing data.
> 
> For example if you have a schema of 5 fields and running the 
> health-topics-conf.xml DIH  loads 3 of those fields of a document 
> (id=XYZ) And then running the encyclopedia-conf.xml DIH will overwrite those 
> 3 fields for the same  document id = XYZ.

Not quite so.   You're right that each RequestHandler has a *default* data 
config, 
specified in solrconfig.xml.   As most things Solr, this can be overridden.   
But it is still a 
good best practice.   You are right that if one DataImport imports the same ID 
as another, 
it will overwrite the older copy completely.   However, you can control the 
overlap so that
indexing is independent even into the same collection.

Suppose you have two configured request handlers:

/dataimport/healthtopics - this uses health-topics-conf.xml
/dataimport/encyclopedia - this uses encyclopedia-conf.xml

These two files can load *completely separate records* with different ids, and 
they can 
have different delete queries configured.   An excerpt from my 
health-topics-conf.xml:





   




  
  
  

  

Hope this helps,

Dan Davis, Systems/Applications Architect (Contractor), Office of Computer and 
Communications Systems, National Library of Medicine, NIH



-Original Message-
From: Jay Parashar [mailto:bparas...@slb.com]
Sent: Thursday, April 14, 2016 11:43 AM
To: solr-user@lucene.apache.org
Subject: RE: Multiple data-config.xml in one collection?

You have to specify which one to run. Each DIH will run only one XML (e.g. 
health-topics-conf.xml)


-Original Message-
From: Yangrui Guo [mailto:guoyang...@gmail.com]
Sent: Tuesday, April 05, 2016 2:16 PM
To: solr-user@lucene.apache.org
Subject: Re: Multiple data-config.xml in one collection?

Hi Daniel,

So if I implement multiple dataimporthandler and do a full import, does Solr 
perform import of all handlers at once or can just specify which handler to 
import? Thank you

Yangrui

On Tuesday, April 5, 2016, Davis, Daniel (NIH/NLM) [C] 
wrote:

> If Shawn is correct, and you are using DIH, then I have done this by 
> implementing multiple requestHandlers each of them using Data Import 
> Handler, and have each specify a different XML file for the data config.
> Instead of using data-config.xml, I've used a large number of files such as:
> health-topics-conf.xml
> encyclopedia-conf.xml
> ...
> I tend to index a single valued, required field named "source" that I 
> can use in the delete query, and I use the TemplateTranformer to make this 
> easy:
>
>  ...
>transformer="TemplateTransformer">
>
>...
>
> Hope this helps,
>
> -Dan
>
> -Original Message-
> From: Shawn Heisey [mailto:apa...@elyograg.org ]
> Sent: Tuesday, April 05, 2016 10:50 AM
> To: solr-user@lucene.apache.org 
> Subject: Re: Multiple data-config.xml in one collection?
>
> On 4/5/2016 8:12 AM, Yangrui Guo wrote:
> > I'm using Solr Cloud to index a number of databases. The problem is 
> > there is unknown number of databases and each database has its own
> configuration.
> > If I create a single collection for every database the query would 
> > eventually become insanely long. Is it possible to upload different 
> > config to zookeeper for each node in a single collection?
>
> Every shard replica (core) in a collection shares the same 
> configuration, which it gets from zookeeper.  This is one of 
> SolrCloud's guarantees, to prevent problems found with old-style 
> sharding when the configuration is different on each machine.
>
> If you're using the dataimport handler, which you probably are since 
> you mentioned databases, you can parameterize pretty much everything 
> in the DIH config file so it comes from URL parameters on the 
> full-import or delta-import command.
>
> Below is a link to the DIH config that I'm using, redacted slightly.
> I'm not running SolrCloud, but the same thing should work in cloud.  
> It should give you some idea of how to use variables in your config, 
> set by parameters on the URL.
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__apaste.info_jtq&d=
> CwIBaQ&c=uGuXJ43KPkPWEl2imVFDmZQlhQUET7pVRA2PDIOxgqw&r=bRfqJEeedEKG5nk
> p5748YxbNMFrUYT3YiNl0Ni2vUBQ&m=ps8KnPZhgym3oVyuWub8JT0eZI39W0FLsBW4fx5
> 61NY&s=k7H8l9XT7yyH_KHFtnIi793EtkLZnUvOz3lZA1mV01s&e=
>
> Thanks,
> Shawn
>
>


RE: Solr Support for BM25F

2016-04-14 Thread Jay Parashar
To use per-field similarity you have to add  to your schema.xml file:
And then in individual fields you can use the BM25 with different k1 and b.

-Original Message-
From: David Cawley [mailto:david.cawl...@mail.dcu.ie] 
Sent: Thursday, April 14, 2016 11:42 AM
To: solr-user@lucene.apache.org
Subject: Solr Support for BM25F

Hello,
I am developing an enterprise search engine for a project and I was hoping to 
implement BM25F ranking algorithm to configure the tuning parameters on a per 
field basis. I understand BM25 similarity is now supported in Solr but I was 
hoping to be able to configure k1 and b for different fields such as title, 
description, anchor etc, as they are structured documents.
I am fairly new to Solr so any help would be appreciated. If this is possible 
or any steps as to how I can go about implementing this it would be greatly 
appreciated.

Regards,

David

Current Solr Version 5.4.1


Adding replica on solr - 5.50

2016-04-14 Thread Jay Potharaju
Hi,
I am using solr 5.5 and testing adding a new replica when a solr instance
comes up. When I run the following command I get an error. I have 1 replica
and trying to add another replica.

http://x.x.x.x:8984/solr/admin/collections?action=ADDREPLICA&collection=test2&shard=shard1&node=x.x.x.x:9001_solr

Error:
> org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
> At least one of the node(s) specified are not currently active, no action
> taken.
> 
> At least one of the node(s) specified are not currently
> active, no action taken.
> 400
> 
> 
> 
> org.apache.solr.common.SolrException
> org.apache.solr.common.SolrException
> 
> At least one of the node(s) specified are not currently
> active, no action taken.
> 400
> 
> 


But when i create a new collection with 2 replicas it works fine.
As a side note my clusterstate.json is not updating correctly. Not sure if
that is causing an issue.

 Any suggestions why the Addreplica command is not working. And is it
related to the clusterstate.json? If yes, how can i fix it?

-- 
Thanks
Jay


Re: Adding replica on solr - 5.50

2016-04-14 Thread Jay Potharaju
Curious what command did you use?

On Thu, Apr 14, 2016 at 3:48 PM, John Bickerstaff 
wrote:

> I had a hard time getting replicas made via the API, once I had created the
> collection for the first time although that may have been ignorance on
> my part.
>
> I was able to get it done fairly easily on the Linux command line.  If
> that's an option and you're interested, let me know - I have a rough but
> accurate document. But perhaps others on the list will have the specific
> answer you're looking for.
>
> On Thu, Apr 14, 2016 at 4:19 PM, Jay Potharaju 
> wrote:
>
> > Hi,
> > I am using solr 5.5 and testing adding a new replica when a solr instance
> > comes up. When I run the following command I get an error. I have 1
> replica
> > and trying to add another replica.
> >
> >
> >
> http://x.x.x.x:8984/solr/admin/collections?action=ADDREPLICA&collection=test2&shard=shard1&node=x.x.x.x:9001_solr
> >
> > Error:
> > > org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
> > > At least one of the node(s) specified are not currently active, no
> action
> > > taken.
> > > 
> > > At least one of the node(s) specified are not currently
> > > active, no action taken.
> > > 400
> > > 
> > > 
> > > 
> > > org.apache.solr.common.SolrException
> > > org.apache.solr.common.SolrException
> > > 
> > > At least one of the node(s) specified are not currently
> > > active, no action taken.
> > > 400
> > > 
> > > 
> >
> >
> > But when i create a new collection with 2 replicas it works fine.
> > As a side note my clusterstate.json is not updating correctly. Not sure
> if
> > that is causing an issue.
> >
> >  Any suggestions why the Addreplica command is not working. And is it
> > related to the clusterstate.json? If yes, how can i fix it?
> >
> > --
> > Thanks
> > Jay
> >
>



-- 
Thanks
Jay Potharaju


Re: Adding replica on solr - 5.50

2016-04-14 Thread Jay Potharaju
Thanks John, which version of solr are you using?

On Thu, Apr 14, 2016 at 3:59 PM, John Bickerstaff 
wrote:

> su - solr -c "/opt/solr/bin/solr create -c statdx -d /home/john/conf
> -shards 1 -replicationFactor 2"
>
> However, this won't work by itself.  There is some preparation
> necessary...  I'll send you the doc.
>
> On Thu, Apr 14, 2016 at 4:55 PM, Jay Potharaju 
> wrote:
>
> > Curious what command did you use?
> >
> > On Thu, Apr 14, 2016 at 3:48 PM, John Bickerstaff <
> > j...@johnbickerstaff.com>
> > wrote:
> >
> > > I had a hard time getting replicas made via the API, once I had created
> > the
> > > collection for the first time although that may have been ignorance
> > on
> > > my part.
> > >
> > > I was able to get it done fairly easily on the Linux command line.  If
> > > that's an option and you're interested, let me know - I have a rough
> but
> > > accurate document. But perhaps others on the list will have the
> specific
> > > answer you're looking for.
> > >
> > > On Thu, Apr 14, 2016 at 4:19 PM, Jay Potharaju 
> > > wrote:
> > >
> > > > Hi,
> > > > I am using solr 5.5 and testing adding a new replica when a solr
> > instance
> > > > comes up. When I run the following command I get an error. I have 1
> > > replica
> > > > and trying to add another replica.
> > > >
> > > >
> > > >
> > >
> >
> http://x.x.x.x:8984/solr/admin/collections?action=ADDREPLICA&collection=test2&shard=shard1&node=x.x.x.x:9001_solr
> > > >
> > > > Error:
> > > > > org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
> > > > > At least one of the node(s) specified are not currently active, no
> > > action
> > > > > taken.
> > > > > 
> > > > > At least one of the node(s) specified are not
> > currently
> > > > > active, no action taken.
> > > > > 400
> > > > > 
> > > > > 
> > > > > 
> > > > > org.apache.solr.common.SolrException
> > > > >  > name="root-error-class">org.apache.solr.common.SolrException
> > > > > 
> > > > > At least one of the node(s) specified are not
> > currently
> > > > > active, no action taken.
> > > > > 400
> > > > > 
> > > > > 
> > > >
> > > >
> > > > But when i create a new collection with 2 replicas it works fine.
> > > > As a side note my clusterstate.json is not updating correctly. Not
> sure
> > > if
> > > > that is causing an issue.
> > > >
> > > >  Any suggestions why the Addreplica command is not working. And is it
> > > > related to the clusterstate.json? If yes, how can i fix it?
> > > >
> > > > --
> > > > Thanks
> > > > Jay
> > > >
> > >
> >
> >
> >
> > --
> > Thanks
> > Jay Potharaju
> >
>



-- 
Thanks
Jay Potharaju


Re: Adding replica on solr - 5.50

2016-04-14 Thread Jay Potharaju
Jeff, I couldn't agree more with you. I think the reason it is not working is 
because of screwed up clusterstate.json, not sure how to fix it. Have already 
restarted my zk servers. Any more suggestions regarding the same.

> On Apr 14, 2016, at 5:21 PM, Jeff Wartes  wrote:
> 
> I’m all for finding another way to make something work, but I feel like this 
> is the wrong advice. 
> 
> There are two options:
> 1) You are doing something wrong. In which case, you should probably invest 
> in figuring out what.
> 2) Solr is doing something wrong. In which case, you should probably invest 
> in figuring out what, and then file a bug so it doesn’t happen to anyone else.
> 
> Adding a replica is a pretty basic operation, so whichever option is the 
> case, I feel like you’ll just encounter other problems down the road if you 
> don’t figure out what’s going on.
> 
> I’d probably start by creating the single-replica collection, and then 
> inspecting the live_nodes list in Zookeeper to confirm that the (live) node 
> list is actually what you think it is.
> 
> 
> 
> 
> 
>> On 4/14/16, 4:04 PM, "John Bickerstaff"  wrote:
>> 
>> 5.4
>> 
>> This problem drove me insane for about a month...
>> 
>> I'll send you the doc.
>> 
>> On Thu, Apr 14, 2016 at 5:02 PM, Jay Potharaju 
>> wrote:
>> 
>>> Thanks John, which version of solr are you using?
>>> 
>>> On Thu, Apr 14, 2016 at 3:59 PM, John Bickerstaff <
>>> j...@johnbickerstaff.com>
>>> wrote:
>>> 
>>>> su - solr -c "/opt/solr/bin/solr create -c statdx -d /home/john/conf
>>>> -shards 1 -replicationFactor 2"
>>>> 
>>>> However, this won't work by itself.  There is some preparation
>>>> necessary...  I'll send you the doc.
>>>> 
>>>> On Thu, Apr 14, 2016 at 4:55 PM, Jay Potharaju 
>>>> wrote:
>>>> 
>>>>> Curious what command did you use?
>>>>> 
>>>>> On Thu, Apr 14, 2016 at 3:48 PM, John Bickerstaff <
>>>>> j...@johnbickerstaff.com>
>>>>> wrote:
>>>>> 
>>>>>> I had a hard time getting replicas made via the API, once I had
>>> created
>>>>> the
>>>>>> collection for the first time although that may have been
>>> ignorance
>>>>> on
>>>>>> my part.
>>>>>> 
>>>>>> I was able to get it done fairly easily on the Linux command line.
>>> If
>>>>>> that's an option and you're interested, let me know - I have a rough
>>>> but
>>>>>> accurate document. But perhaps others on the list will have the
>>>> specific
>>>>>> answer you're looking for.
>>>>>> 
>>>>>> On Thu, Apr 14, 2016 at 4:19 PM, Jay Potharaju <
>>> jspothar...@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> I am using solr 5.5 and testing adding a new replica when a solr
>>>>> instance
>>>>>>> comes up. When I run the following command I get an error. I have 1
>>>>>> replica
>>>>>>> and trying to add another replica.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> http://x.x.x.x:8984/solr/admin/collections?action=ADDREPLICA&collection=test2&shard=shard1&node=x.x.x.x:9001_solr
>>>>>>> 
>>>>>>> Error:
>>>>>>>> org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
>>>>>>>> At least one of the node(s) specified are not currently active,
>>> no
>>>>>> action
>>>>>>>> taken.
>>>>>>>> 
>>>>>>>> At least one of the node(s) specified are not
>>>>> currently
>>>>>>>> active, no action taken.
>>>>>>>> 400
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> >> name="error-class">org.apache.solr.common.SolrException
>>>>>>>> >>>> name="root-error-class">org.apache.solr.common.SolrException
>>>>>>>> 
>>>>>>>> At least one of the node(s) specified are not
>>>>> currently
>>>>>>>> active, no action taken.
>>>>>>>> 400
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> But when i create a new collection with 2 replicas it works fine.
>>>>>>> As a side note my clusterstate.json is not updating correctly. Not
>>>> sure
>>>>>> if
>>>>>>> that is causing an issue.
>>>>>>> 
>>>>>>> Any suggestions why the Addreplica command is not working. And is
>>> it
>>>>>>> related to the clusterstate.json? If yes, how can i fix it?
>>>>>>> 
>>>>>>> --
>>>>>>> Thanks
>>>>>>> Jay
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Thanks
>>>>> Jay Potharaju
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Thanks
>>> Jay Potharaju
>>> 


Re: Adding replica on solr - 5.50

2016-04-14 Thread Jay Potharaju
Thanks for the help John.

> On Apr 14, 2016, at 6:22 PM, John Bickerstaff  
> wrote:
> 
> Sure - couldn't agree more.
> 
> I couldn't find any good documentation on the Solr site about how to add a
> replica to a Solr cloud.  The Admin UI appears to require that the
> directories be created anyway.
> 
> There is probably a way to do it through the UI, once Solr is installed on
> a new machine - and IIRC, I did manage that, but my IT guy wanted
> scriptable command lines.
> 
> Also, IIRC, the stuff I did on the command line actually showed the API URL
> as part of the output so Jay could try that and see what the difference
> is...
> 
> Jay - I'm going offline now, but if you're still stuck tomorrow, I'll try
> to recreate... I have a VM snapshot just before I issued the command...
> 
> Keep in mind everything I did was in a Solr Cloud...
> 
>> On Thu, Apr 14, 2016 at 6:21 PM, Jeff Wartes  wrote:
>> 
>> I’m all for finding another way to make something work, but I feel like
>> this is the wrong advice.
>> 
>> There are two options:
>> 1) You are doing something wrong. In which case, you should probably
>> invest in figuring out what.
>> 2) Solr is doing something wrong. In which case, you should probably
>> invest in figuring out what, and then file a bug so it doesn’t happen to
>> anyone else.
>> 
>> Adding a replica is a pretty basic operation, so whichever option is the
>> case, I feel like you’ll just encounter other problems down the road if you
>> don’t figure out what’s going on.
>> 
>> I’d probably start by creating the single-replica collection, and then
>> inspecting the live_nodes list in Zookeeper to confirm that the (live) node
>> list is actually what you think it is.
>> 
>> 
>> 
>> 
>> 
>>> On 4/14/16, 4:04 PM, "John Bickerstaff"  wrote:
>>> 
>>> 5.4
>>> 
>>> This problem drove me insane for about a month...
>>> 
>>> I'll send you the doc.
>>> 
>>> On Thu, Apr 14, 2016 at 5:02 PM, Jay Potharaju 
>>> wrote:
>>> 
>>>> Thanks John, which version of solr are you using?
>>>> 
>>>> On Thu, Apr 14, 2016 at 3:59 PM, John Bickerstaff <
>>>> j...@johnbickerstaff.com>
>>>> wrote:
>>>> 
>>>>> su - solr -c "/opt/solr/bin/solr create -c statdx -d /home/john/conf
>>>>> -shards 1 -replicationFactor 2"
>>>>> 
>>>>> However, this won't work by itself.  There is some preparation
>>>>> necessary...  I'll send you the doc.
>>>>> 
>>>>> On Thu, Apr 14, 2016 at 4:55 PM, Jay Potharaju >> 
>>>>> wrote:
>>>>> 
>>>>>> Curious what command did you use?
>>>>>> 
>>>>>> On Thu, Apr 14, 2016 at 3:48 PM, John Bickerstaff <
>>>>>> j...@johnbickerstaff.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> I had a hard time getting replicas made via the API, once I had
>>>> created
>>>>>> the
>>>>>>> collection for the first time although that may have been
>>>> ignorance
>>>>>> on
>>>>>>> my part.
>>>>>>> 
>>>>>>> I was able to get it done fairly easily on the Linux command line.
>>>> If
>>>>>>> that's an option and you're interested, let me know - I have a
>> rough
>>>>> but
>>>>>>> accurate document. But perhaps others on the list will have the
>>>>> specific
>>>>>>> answer you're looking for.
>>>>>>> 
>>>>>>> On Thu, Apr 14, 2016 at 4:19 PM, Jay Potharaju <
>>>> jspothar...@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> I am using solr 5.5 and testing adding a new replica when a solr
>>>>>> instance
>>>>>>>> comes up. When I run the following command I get an error. I
>> have 1
>>>>>>> replica
>>>>>>>> and trying to add another replica.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> http://x.x.x.x:8984/solr/admin/collections?action=ADDREPLICA&collection=test2&shard=shard1&node=x.x.x.x:9001_solr
>>>>>>>> 
>>>>>>>> Error:
>>>>>>>>> org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
>>>>>>>>> At least one of the node(s) specified are not currently
>> active,
>>>> no
>>>>>>> action
>>>>>>>>> taken.
>>>>>>>>> 
>>>>>>>>> At least one of the node(s) specified are not
>>>>>> currently
>>>>>>>>> active, no action taken.
>>>>>>>>> 400
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> >>> name="error-class">org.apache.solr.common.SolrException
>>>>>>>>> >>>>> name="root-error-class">org.apache.solr.common.SolrException
>>>>>>>>> 
>>>>>>>>> At least one of the node(s) specified are not
>>>>>> currently
>>>>>>>>> active, no action taken.
>>>>>>>>> 400
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> But when i create a new collection with 2 replicas it works
>> fine.
>>>>>>>> As a side note my clusterstate.json is not updating correctly.
>> Not
>>>>> sure
>>>>>>> if
>>>>>>>> that is causing an issue.
>>>>>>>> 
>>>>>>>> Any suggestions why the Addreplica command is not working. And
>> is
>>>> it
>>>>>>>> related to the clusterstate.json? If yes, how can i fix it?
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Thanks
>>>>>>>> Jay
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Thanks
>>>>>> Jay Potharaju
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Thanks
>>>> Jay Potharaju
>>>> 
>> 


Re: Adding replica on solr - 5.50

2016-04-15 Thread Jay Potharaju
I have multiple solr instances running in my dev sandbox. When adding a
replica i was passing the host IP instead of 127.0.1.1 which is recorded in
the live nodes section.
Thanks Eric for pointing that out.

Working URL:
http://x.x.x.x:9000/solr/admin/collections?action=ADDREPLICA&collection=test4&shard=shard2&node=127.0.1.1:9000_solr

Thanks


On Fri, Apr 15, 2016 at 10:19 AM, John Bickerstaff  wrote:

> Oh, and what, if any directories need to exist for the ADDREPLICA command
> to work?
>
> Hopefully nothing past the already existing /var/solr/data created by the
> Solr install script?
>
> On Fri, Apr 15, 2016 at 11:18 AM, John Bickerstaff <
> j...@johnbickerstaff.com
> > wrote:
>
> > Oh, and what, if any directories need to exist for the ADDREPLICA
> >
> > On Fri, Apr 15, 2016 at 11:09 AM, John Bickerstaff <
> > j...@johnbickerstaff.com> wrote:
> >
> >> Thanks again Eric - I'm going to be trying the ADDREPLICA again today or
> >> Monday.  I much prefer that to hand-edit hackery...
> >>
> >> Thanks also for pointing out that cURL makes it "scriptable"...
> >>
> >> On Fri, Apr 15, 2016 at 10:50 AM, Erick Erickson <
> erickerick...@gmail.com
> >> > wrote:
> >>
> >>> bq: Shouldn't this: &node=x.x.x.x:9001_solr
> >>> <
> >>>
> http://x.x.x.x:8984/solr/admin/collections?action=ADDREPLICA&collection=test2&shard=shard1&node=x.x.x.x:9001_solr
> >>> >
> >>>
> >>> Actually be this?  &node=x.x.x.x:9001/solr
> >>> <
> >>>
> http://x.x.x.x:8984/solr/admin/collections?action=ADDREPLICA&collection=test2&shard=shard1&node=x.x.x.x:9001_solr
> >>> >
> >>>
> >>> (Note the / instead of _ )
> >>>
> >>> Good thing you added the note, 'cause I was having trouble seeing the
> >>> difference.
> >>>
> >>> No. The underscore is correct. The "node" in this case is the name
> >>> registered
> >>> in Zookeeper in the "live nodes" znode, _not_ a URL or whatever...
> >>>
> >>> As to your two methods of moving a shard around. Either one is fine,
> >>> although the first one (copying the directory and "doing the right
> thing"
> >>> to edit core.properties) is a little dicier in that you're doing hand
> >>> edits.
> >>>
> >>> Personally I prefer the ADDREPLICA solution. In fact I've moved
> replicas
> >>> around by ADDREPLICA, wait, DELETEREPLICA...
> >>>
> >>> Best,
> >>> Erick
> >>>
> >>> On Fri, Apr 15, 2016 at 3:10 AM, Jaroslaw Rozanski
> >>>  wrote:
> >>> > Hi,
> >>> >
> >>> > Does the `&name=...` actually work for you? When attempting similar
> >>> with
> >>> > Solr 5.3.1, despite what documentation said, I had to use
> >>> > `node_name=...`.
> >>> >
> >>> >
> >>> > Thanks,
> >>> > Jarek
> >>> >
> >>> > On Fri, 15 Apr 2016, at 05:48, John Bickerstaff wrote:
> >>> >> Another thought - again probably not it, but just in case...
> >>> >>
> >>> >> Shouldn't this: &node=x.x.x.x:9001_solr
> >>> >> <
> >>>
> http://x.x.x.x:8984/solr/admin/collections?action=ADDREPLICA&collection=test2&shard=shard1&node=x.x.x.x:9001_solr
> >>> >
> >>> >>
> >>> >> Actually be this?  &node=x.x.x.x:9001/solr
> >>> >> <
> >>>
> http://x.x.x.x:8984/solr/admin/collections?action=ADDREPLICA&collection=test2&shard=shard1&node=x.x.x.x:9001_solr
> >>> >
> >>> >>
> >>> >> (Note the / instead of _ )
> >>> >>
> >>> >> On Thu, Apr 14, 2016 at 10:45 PM, John Bickerstaff
> >>> >>  >>> >> > wrote:
> >>> >>
> >>> >> > Jay - it's probably too simple, but the error says "not currently
> >>> active"
> >>> >> > which could, of course, mean that although it's up and running,
> >>> it's not
> >>> >> > listening on the port you have in the command line...  Or that the
> >>> port is
> >>> >> > blocked by a firewall or other network problem.
> >>&

SOLR-3666

2016-04-15 Thread Jay Potharaju
Hi,
I am using solrCloud with DIH for indexing my data. Is it possible to get
status of all my DIH across all nodes in the cloud? I saw this jira ticket
from couple of years ago.
https://issues.apache.org/jira/browse/SOLR-3666

Can any of contributors comment on whether this would be resolved? The only
alternative I know is to get list of all nodes in the cloud and poll each
one of them to check the DIH status. Not the most effective way but will
work.



-- 
Thanks
Jay


Adding a new shard

2016-04-15 Thread Jay Potharaju
Hi,
I have an existing collection which has 2 shards, one on each node in the
cloud. Now I want to split the existing collection into 3 shards because of
increase in volume of data. And create this new shard  on a new node in the
solrCloud.

 I read about splitting a shard & creating a shard, but not sure it will
work.

Any suggestions how are others handling this scenario in production.
-- 
Thanks
Jay


Re: Adding a new shard

2016-04-15 Thread Jay Potharaju
I found ticket https://issues.apache.org/jira/browse/SOLR-5025 which talks
about sharding in solrcloud. Are there any plans to address this issue in
near future?
Can any of the users on the forum comment how they are handling this
scenario in production?
Thanks

On Fri, Apr 15, 2016 at 4:28 PM, Jay Potharaju 
wrote:

> Hi,
> I have an existing collection which has 2 shards, one on each node in the
> cloud. Now I want to split the existing collection into 3 shards because of
> increase in volume of data. And create this new shard  on a new node in the
> solrCloud.
>
>  I read about splitting a shard & creating a shard, but not sure it will
> work.
>
> Any suggestions how are others handling this scenario in production.
> --
> Thanks
> Jay
>
>



-- 
Thanks
Jay Potharaju


Re: Adding a new shard

2016-04-17 Thread Jay Potharaju
Erik thanks for the reply. In my current prod setup I anticipate the number
of documents to grow almost 5 times by the end of the year and therefore
planning on how to scale when required. We have high query volume and
growing dataset, that is why would like to scale by sharding & replication.

In my dev sandbox, I have 2 replicas & 2 shards created using compositeId
as my routing option. If I split the shard, it will create 2 new shards on
each the solr instances including replicas and my request will start going
to the new shards.
So inorder for me  move the shards to its own instances, I will have to
take a down time and move the newly created shards & replicas to its own
instances. Is that a correct interpretation of the how shard splitting
would work

 I was hoping that solr will automagically split the existing shard &
create replicas on the new instances  rather than the existing nodes. That
is why I said the current shard splitting will not work for me.
Thanks

On Sat, Apr 16, 2016 at 8:08 PM, Erick Erickson 
wrote:

> Why don't you think splitting the shards will do what you need?
> Admittedly it will have to be applied to each shard and will
> double the number of shards you have, that's the current
> limitation. At the end, though, you will have 4 shards when
> you used to have 2 and you can move them around to whatever
> hardware you can scrape up.
>
> This assumes you're using the default compositeId routing
> scheme and not implicit routing. If you are using compositeId
> there is no provision to add another shard.
>
> As far as SOLR-5025 is concerned, nobody's working on that
> that I know of.
>
> I have to ask though whether you've tuned your existing
> machines. How many docs are on each? Why do you think
> you need more shards? Query speed? OOMs? Java heaps
> getting too big?
>
> Best,
> Erick
>
> On Fri, Apr 15, 2016 at 10:50 PM, Jay Potharaju 
> wrote:
> > I found ticket https://issues.apache.org/jira/browse/SOLR-5025 which
> talks
> > about sharding in solrcloud. Are there any plans to address this issue in
> > near future?
> > Can any of the users on the forum comment how they are handling this
> > scenario in production?
> > Thanks
> >
> > On Fri, Apr 15, 2016 at 4:28 PM, Jay Potharaju 
> > wrote:
> >
> >> Hi,
> >> I have an existing collection which has 2 shards, one on each node in
> the
> >> cloud. Now I want to split the existing collection into 3 shards
> because of
> >> increase in volume of data. And create this new shard  on a new node in
> the
> >> solrCloud.
> >>
> >>  I read about splitting a shard & creating a shard, but not sure it will
> >> work.
> >>
> >> Any suggestions how are others handling this scenario in production.
> >> --
> >> Thanks
> >> Jay
> >>
> >>
> >
> >
> >
> > --
> > Thanks
> > Jay Potharaju
>



-- 
Thanks
Jay Potharaju


Re: Adding a new shard

2016-04-18 Thread Jay Potharaju
Thanks for the explaination Erick!. I will try out your recommendation.


On Sun, Apr 17, 2016 at 3:34 PM, Erick Erickson 
wrote:

> bq: So inorder for me  move the shards to its own instances, I will have to
> take a down time and move the newly created shards & replicas to its own
> instances.
>
> No, this is not true.
>
> The easiest way to move things around is use the collections API
> ADDREPLICA command after splitting.
>
> Let's call this particular shard S1 on machine M1, and the results of
> the SPLITSHARD command S1.1 and S1.2 Further, let's say that your goal
> is to move _one_ of the subshards from machine M1 to M2
>
> So the sequence is:
>
> 1> issue SPLITSHARD and wait for it to complete. This requires no
> downtime and after the split the old shard becomes inactive and the
> two new subshards are servicing all requests. I'd probably stop
> indexing during this operation just to be on the safe side, although
> that's not necessary. So now you have both S1.1 and S1.2 running on M1
>
> 2> Use the ADDREPLICA command to add a replica of S1.2 to M2. Again,
> no downtime required. Wait until the new replica is "active", at which
> point it's fully operational. So now we have S1.1 and S1.2 running on
> M1 and S1.2.2 running on M2.
>
> 3> Use the DELETEREPLICA command to remove S1.2 from M1. Now you have
> S1.1 running on M1 and S1.2.1 running on M2. No downtime during any of
> this.
>
> 4> You should be able to delete S1 now from M1 just to tidy up.
>
> 5> Repeat for the other shards.
>
> Best,
> Erick
>
>
> On Sun, Apr 17, 2016 at 3:09 PM, Jay Potharaju 
> wrote:
> > Erik thanks for the reply. In my current prod setup I anticipate the
> number
> > of documents to grow almost 5 times by the end of the year and therefore
> > planning on how to scale when required. We have high query volume and
> > growing dataset, that is why would like to scale by sharding &
> replication.
> >
> > In my dev sandbox, I have 2 replicas & 2 shards created using compositeId
> > as my routing option. If I split the shard, it will create 2 new shards
> on
> > each the solr instances including replicas and my request will start
> going
> > to the new shards.
> > So inorder for me  move the shards to its own instances, I will have to
> > take a down time and move the newly created shards & replicas to its own
> > instances. Is that a correct interpretation of the how shard splitting
> > would work
> >
> >  I was hoping that solr will automagically split the existing shard &
> > create replicas on the new instances  rather than the existing nodes.
> That
> > is why I said the current shard splitting will not work for me.
> > Thanks
> >
> > On Sat, Apr 16, 2016 at 8:08 PM, Erick Erickson  >
> > wrote:
> >
> >> Why don't you think splitting the shards will do what you need?
> >> Admittedly it will have to be applied to each shard and will
> >> double the number of shards you have, that's the current
> >> limitation. At the end, though, you will have 4 shards when
> >> you used to have 2 and you can move them around to whatever
> >> hardware you can scrape up.
> >>
> >> This assumes you're using the default compositeId routing
> >> scheme and not implicit routing. If you are using compositeId
> >> there is no provision to add another shard.
> >>
> >> As far as SOLR-5025 is concerned, nobody's working on that
> >> that I know of.
> >>
> >> I have to ask though whether you've tuned your existing
> >> machines. How many docs are on each? Why do you think
> >> you need more shards? Query speed? OOMs? Java heaps
> >> getting too big?
> >>
> >> Best,
> >> Erick
> >>
> >> On Fri, Apr 15, 2016 at 10:50 PM, Jay Potharaju 
> >> wrote:
> >> > I found ticket https://issues.apache.org/jira/browse/SOLR-5025 which
> >> talks
> >> > about sharding in solrcloud. Are there any plans to address this
> issue in
> >> > near future?
> >> > Can any of the users on the forum comment how they are handling this
> >> > scenario in production?
> >> > Thanks
> >> >
> >> > On Fri, Apr 15, 2016 at 4:28 PM, Jay Potharaju  >
> >> > wrote:
> >> >
> >> >> Hi,
> >> >> I have an existing collection which has 2 shards, one on each node in
> >> the
> >> >> cloud. Now I want to split the existing collection into 3 shards
> >> because of
> >> >> increase in volume of data. And create this new shard  on a new node
> in
> >> the
> >> >> solrCloud.
> >> >>
> >> >>  I read about splitting a shard & creating a shard, but not sure it
> will
> >> >> work.
> >> >>
> >> >> Any suggestions how are others handling this scenario in production.
> >> >> --
> >> >> Thanks
> >> >> Jay
> >> >>
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Thanks
> >> > Jay Potharaju
> >>
> >
> >
> >
> > --
> > Thanks
> > Jay Potharaju
>



-- 
Thanks
Jay Potharaju


Re: NoSuchFileException errors common on version 5.5.0

2016-04-21 Thread Jay Potharaju
Hi.
I am seeing lot of these errors in my current 5.5.0 dev install. Would it
make sense to use 5.5 in production or a different version is recommended ?
I am using DIH, not sure if that matters in this case.

Thanks


On Fri, Mar 11, 2016 at 3:57 AM, Shai Erera  wrote:

> Hey Shawn,
>
> I added segments file information (name and size) to Core admin status API.
> Turns out that you might get into NoSuchFileException if indexing happens
> and the commit point has changed, but the IndexReader LukeRequestHandler
> receives hasn't picked up the new commit yet, in which case the old
> segments_N file was deleted and computing its size resulted in that
> exception.
>
> I pushed a fix for it which will be released in any one of future releases,
> including 5.5.1 if we'll have any. The fix includes logging the exception
> and returning -1 as the file size.
>
> Shai
>
> On Fri, Mar 11, 2016 at 12:21 AM Shawn Heisey  wrote:
>
> > On 3/10/2016 12:18 PM, Shawn Heisey wrote:
> > > I pulled down branch_5_5 and installed a 5.5.1 snapshot.  Had to edit
> > > lucene/version.properties to get it to be 5.5.1.  I also had to edit
> the
> > > SolrIdentifierValidator class to allow hyphens, since I have them in
> > > some of my core names.  The NoSuchFileException errors are gone now.
> >
> > Spoke too soon.
> >
> > The log message did change a little bit.  Now it's only one log entry on
> > LukeRequestHandler instead of two separate log entries, and it's a WARN
> > instead of ERROR.
> >
> > 2016-03-10 14:35:00.038 WARN  (qtp1012570586-11405) [   x:spark3live]
> > org.apache.solr.handler.admin.LukeRequestHandler Error getting file
> > length for [segments_c5t]
> > java.nio.file.NoSuchFileException:
> > /index/solr5/data/data/spark3_0/index/segments_c5t
> > at
> > sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
> > at
> > sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
> > at
> > sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
> > at
> >
> >
> sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
> > at
> >
> >
> sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
> > at
> >
> >
> sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
> > at java.nio.file.Files.readAttributes(Files.java:1737)
> > at java.nio.file.Files.size(Files.java:2332)
> > at
> > org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:210)
> >
> > Something else to note:  It wasn't 5.5.0 that I had installed, it was
> > 5.5.0-SNAPSHOT -- I installed it some time before 5.5.0 was released.
> > Looks like I did the install of that version on January 29th.
> >
> > Thanks,
> > Shawn
> >
> >
>



-- 
Thanks
Jay Potharaju


measuring query performance & qps per node

2016-04-25 Thread Jay Potharaju
Hi,
I am trying to measure how will are queries performing ie how long are they
taking. In order to measure query speed I am using solrmeter with 50k
unique filter queries. And then checking if any of the queries are slower
than 50ms. Is this a good approach to measure query performance?

Are there any guidelines on how to measure if a given instance can handle a
given number of qps(query per sec)? For example if my doc size is 30
million docs and index size is 40 GB of data and the RAM on the instance is
60 GB, then how many qps can it handle? Or is this a hard question to
answer and it depends on the load and type of query running at a given time.

-- 
Thanks
Jay


Re: measuring query performance & qps per node

2016-04-25 Thread Jay Potharaju
Thanks for the response Erick. I knew that it would depend on the number of
factors like you mentioned.I just wanted to know whether a  good
combination of queries, facets & filters should be a good estimate of how
solr might behave.

what did you mean by "Add stats to pivots in Cloud mode."

Thanks

On Mon, Apr 25, 2016 at 5:05 PM, Erick Erickson 
wrote:

>  Impossible to answer. For instance, a facet query can be very
> heavy-duty. Add stats
> to pivots in Cloud mode.
>
> As for using a bunch of fq clauses, It Depends (tm). If your expected usage
> pattern is all queries like 'q=*:*&fq=clause1&fq=clause2" then it's
> fine. It totally
> falls down if, for instance, you have a bunch of facets. Or grouping.
> Or.
>
> Best,
> Erick
>
> On Mon, Apr 25, 2016 at 3:48 PM, Jay Potharaju 
> wrote:
> > Hi,
> > I am trying to measure how will are queries performing ie how long are
> they
> > taking. In order to measure query speed I am using solrmeter with 50k
> > unique filter queries. And then checking if any of the queries are slower
> > than 50ms. Is this a good approach to measure query performance?
> >
> > Are there any guidelines on how to measure if a given instance can
> handle a
> > given number of qps(query per sec)? For example if my doc size is 30
> > million docs and index size is 40 GB of data and the RAM on the instance
> is
> > 60 GB, then how many qps can it handle? Or is this a hard question to
> > answer and it depends on the load and type of query running at a given
> time.
> >
> > --
> > Thanks
> > Jay
>



-- 
Thanks
Jay Potharaju


Re: Decide on facets from results

2016-04-28 Thread Jay Potharaju
On the same lines as Erik suggested but using facet stats instead. you can get 
stats on your facet fields in the first pass and then include the facets that 
you need in the second pass. 


> On Apr 27, 2016, at 1:21 PM, Mark Robinson  wrote:
> 
> Thanks Eric!
> So that will mean another call will be definitely required to SOLR with the
> facets,  before the results can be send back (with the facet fields being
> derived traversing through the response).
> 
> I was basically checking on whether in the "process" method (I believe
> results will be accessed in the process method), we can dynamically
> generate facets after traversing through the results and identifying the
> fields for faceting, using some aggregation function or so, without having
> to make another call using facet=on&facet.field=, before the
> response is send back to the user.
> 
> Cheers!
> 
> On Wed, Apr 27, 2016 at 2:27 PM, Erik Hatcher 
> wrote:
> 
>> Results will vary based on how you indexed those fields, but sure…
>> &facet=on&facet.field= - with sufficient RAM, lots of fun to be
>> had!
>> 
>> —
>> Erik Hatcher, Senior Solutions Architect
>> http://www.lucidworks.com 
>> 
>> 
>> 
 On Apr 27, 2016, at 12:13 PM, Mark Robinson 
>>> wrote:
>>> 
>>> Hi,
>>> 
>>> If I don't have my facet list at query time, from the results can I
>> select
>>> some fields and by any means create a facet on them? ie after I get the
>>> results I want to identify some fields as facets and send back facets for
>>> them in the response.
>>> 
>>> A kind of very dynamic faceting based on the results!
>>> 
>>> Cld some one pls share their idea.
>>> 
>>> Thanks!
>>> Anil.
>> 
>> 


Using updateRequest Processor with DIH

2016-05-01 Thread Jay Potharaju
Hi,
I was wondering if it is possible to use Update Request Processor with DIH.
I would like to update an index_time field whenever documents are
added/updated in the collection.
I know that I could easily pass a time stamp which would update the field
in my collection but I was trying to do it using Request processor.

I tried the following but got an error. Any recommendations on how to use
this correctly?



index_time




  data-config.xml
update_indextime



Error:
Error from server at unknown UpdateRequestProcessorChain: update_indextime

-- 
Thanks
Jay


Re: query action with wrong result size zero

2016-05-05 Thread Jay Potharaju
Can you check if the field you are searching on is case sensitive? You can
quickly test it by copying the exact contents of the brand field into your
query and comparing it against the query you have posted above.

On Thu, May 5, 2016 at 8:57 AM, mixiangliu <852262...@qq.com> wrote:

>
> i found a strange thing  with solr query,when i set the value of query
> field like "brand:amd",the  size of query result is zero,but the real data
> is not zero,can  some body tell me why,thank you very much!!
> my english is not very good,wish some body understand my words!
>



-- 
Thanks
Jay Potharaju


Filter queries & caching

2016-05-05 Thread Jay Potharaju
Hi,
I have a filter query that gets  documents based on date ranges from last n
days to anytime in future.

The objective is to get documents between a date range, but the start date
and end date values are stored in different fields and that is why I wrote
the filter query as below

fq=fromfield:[* TO NOW/DAY+1DAY]&& tofield:[NOW/DAY-7DAY TO *] && type:"abc"

The way these queries are currently written I think wont leverage the
filter cache because of "*". Is there a better way to write this query so
that I can leverage the cache.



-- 
Thanks
Jay


Re: Filter queries & caching

2016-05-05 Thread Jay Potharaju
Are you suggesting rewriting it like this ?
fq=filter(fromfield:[* TO NOW/DAY+1DAY]&& tofield:[NOW/DAY-7DAY TO *] )
fq=filter(type:abc)

Is this a better use of the cache as supposed to fq=fromfield:[* TO
NOW/DAY+1DAY]&& tofield:[NOW/DAY-7DAY TO *] && type:"abc"

Thanks

On Thu, May 5, 2016 at 12:50 PM, Ahmet Arslan 
wrote:

> Hi,
>
> Cache enemy is not * but NOW. Since you round it to DAY, cache will work
> within-day.
> I would use separate filer queries, especially fq=type:abc for the
> structured query so it will be cached independently.
>
> Also consider disabling caching (using cost) in expensive queries:
> http://yonik.com/advanced-filter-caching-in-solr/
>
> Ahmet
>
>
>
> On Thursday, May 5, 2016 8:25 PM, Jay Potharaju 
> wrote:
> Hi,
> I have a filter query that gets  documents based on date ranges from last n
> days to anytime in future.
>
> The objective is to get documents between a date range, but the start date
> and end date values are stored in different fields and that is why I wrote
> the filter query as below
>
> fq=fromfield:[* TO NOW/DAY+1DAY]&& tofield:[NOW/DAY-7DAY TO *] &&
> type:"abc"
>
> The way these queries are currently written I think wont leverage the
> filter cache because of "*". Is there a better way to write this query so
> that I can leverage the cache.
>
>
>
> --
> Thanks
> Jay
>



-- 
Thanks
Jay Potharaju


Re: Filter queries & caching

2016-05-05 Thread Jay Potharaju
I have almost 50 million docs and growing ...that being said in a high
query volume case does it make sense to use

 fq=filter(fromfield:[* TO NOW/DAY+1DAY]&& tofield:[NOW/DAY-7DAY TO *]  &&
type:"abc")

OR
fq=filter(fromfield:[* TO NOW/DAY+1DAY]&& tofield:[NOW/DAY-7DAY TO *] )
fq=filter(type:abc)

Is this something that I would need to determine by running some test
Thanks

On Thu, May 5, 2016 at 1:44 PM, Jay Potharaju  wrote:

> Are you suggesting rewriting it like this ?
> fq=filter(fromfield:[* TO NOW/DAY+1DAY]&& tofield:[NOW/DAY-7DAY TO *] )
> fq=filter(type:abc)
>
> Is this a better use of the cache as supposed to fq=fromfield:[* TO
> NOW/DAY+1DAY]&& tofield:[NOW/DAY-7DAY TO *] && type:"abc"
>
> Thanks
>
> On Thu, May 5, 2016 at 12:50 PM, Ahmet Arslan 
> wrote:
>
>> Hi,
>>
>> Cache enemy is not * but NOW. Since you round it to DAY, cache will work
>> within-day.
>> I would use separate filer queries, especially fq=type:abc for the
>> structured query so it will be cached independently.
>>
>> Also consider disabling caching (using cost) in expensive queries:
>> http://yonik.com/advanced-filter-caching-in-solr/
>>
>> Ahmet
>>
>>
>>
>> On Thursday, May 5, 2016 8:25 PM, Jay Potharaju 
>> wrote:
>> Hi,
>> I have a filter query that gets  documents based on date ranges from last
>> n
>> days to anytime in future.
>>
>> The objective is to get documents between a date range, but the start date
>> and end date values are stored in different fields and that is why I wrote
>> the filter query as below
>>
>> fq=fromfield:[* TO NOW/DAY+1DAY]&& tofield:[NOW/DAY-7DAY TO *] &&
>> type:"abc"
>>
>> The way these queries are currently written I think wont leverage the
>> filter cache because of "*". Is there a better way to write this query so
>> that I can leverage the cache.
>>
>>
>>
>> --
>> Thanks
>> Jay
>>
>
>
>
> --
> Thanks
> Jay Potharaju
>
>



-- 
Thanks
Jay Potharaju


Re: Filter queries & caching

2016-05-06 Thread Jay Potharaju
Thanks Shawn,Erick & Ahmet , this was very helpful. 

> On May 6, 2016, at 6:19 AM, Shawn Heisey  wrote:
> 
>> On 5/5/2016 2:44 PM, Jay Potharaju wrote:
>> Are you suggesting rewriting it like this ?
>> fq=filter(fromfield:[* TO NOW/DAY+1DAY]&& tofield:[NOW/DAY-7DAY TO *] )
>> fq=filter(type:abc)
>> 
>> Is this a better use of the cache as supposed to fq=fromfield:[* TO
>> NOW/DAY+1DAY]&& tofield:[NOW/DAY-7DAY TO *] && type:"abc"
> 
> I keep writing emails and forgetting to send them.  Supplementing the
> excellent information you've already gotten:
> 
> Because all three clauses are ANDed together, what I would suggest doing
> is three filter queries:
> 
> fq=fromfield:[* TO NOW/DAY+1DAY]
> fq=tofield:[NOW/DAY-7DAY TO *]
> fq=type:abc
> 
> Whether or not to split your fq like this will depend on how you use
> filters, and how much memory you can let them use.  With three separate
> fq parameters, you'll get three cache entries in filterCache from the
> one query.  If the next query changes only one of those filters to
> something that's not in the cache yet, but leaves the other two alone,
> then Solr can get the results from the cache for two of them, and then
> will only need to run the query for one of them, saving time and system
> resources.
> 
> I removed the quotes from "abc" because for that specific example,
> quotes are not necessary.  For more complex information than abc, quotes
> might be important.  Experiment, and use what gets you the results you want.
> 
> Thanks,
> Shawn
> 


Re: Filter queries & caching

2016-05-06 Thread Jay Potharaju
We have high query load and considering that I think the suggestions made
above will help with performance.
Thanks
Jay

On Fri, May 6, 2016 at 7:26 AM, Shawn Heisey  wrote:

> On 5/6/2016 7:19 AM, Shawn Heisey wrote:
> > With three separate
> > fq parameters, you'll get three cache entries in filterCache from the
> > one query.
>
> One more tidbit of information related to this:
>
> When you have multiple filters and they aren't cached, I am reasonably
> certain that they run in parallel.  Instead of one complex filter, you
> would have three simple filters running simultaneously.  For low to
> medium query loads on a server with a whole bunch of CPUs, where there
> is plenty of spare CPU power, this can be a real gain in performance ...
> but if the query load is really high, it might be a bad thing.
>
> Thanks,
> Shawn
>
>


-- 
Thanks
Jay Potharaju


Re: understanding phonetic matching

2016-05-07 Thread Jay Potharaju
Thanks for the feedback, I was getting correct results when searching for
jon & john. But when I tried other names like 'khloe' it matched on
'collier' because the phonetic filter generated KL as the token.
Is phonetic filter the best way to find similar sounding names?


On Wed, Mar 23, 2016 at 12:01 AM, davidphilip cherian <
davidphilipcher...@gmail.com> wrote:

> The "phonetic_en" analyzer definition available in solr-schema does return
> documents having "Jon", "JN", "John" when search term is "John". Checkout
> screen shot here : http://imgur.com/0R6SvX2
>
> This wiki page explains how phonetic matching works :
>
> https://cwiki.apache.org/confluence/display/solr/Phonetic+Matching#PhoneticMatching-DoubleMetaphone
>
>
> Hope that helps.
>
>
>
> On Wed, Mar 23, 2016 at 11:18 AM, Alexandre Rafalovitch <
> arafa...@gmail.com>
> wrote:
>
> > I'd start by putting LowerCaseFF before the PhoneticFilter.
> >
> > But then, you say you were using Analysis screen and what? Do you get
> > the matches when you put your sample text and the query text in the
> > two boxes in the UI? I am not sure what "look at my solr data" means
> > in this particular context.
> >
> > Regards,
> >Alex.
> > 
> > Newsletter and resources for Solr beginners and intermediates:
> > http://www.solr-start.com/
> >
> >
> > On 23 March 2016 at 16:27, Jay Potharaju  wrote:
> > > Hi,
> > > I am trying to do name matching using the phonetic filter factory. As
> > part
> > > of that I was analyzing the data using analysis screen in solr UI. If i
> > > search for john, any documents containing john or jon should be found.
> > >
> > > Following is my definition of the custom field that I use for indexing
> > the
> > > data. When I look at my solr data I dont see any similar sounding names
> > in
> > > my solr data, even though I have set inject="true". Is that not how it
> is
> > > supposed to work?
> > > Can someone explain how phonetic matching works?
> > >
> > >   > positionIncrementGap
> > > ="100">
> > >
> > >  
> > >
> > > 
> > >
> > >  > encoder="DoubleMetaphone"
> > > inject="true" maxCodeLength="5"/>
> > >
> > > 
> > >
> > >  
> > >
> > > 
> > >
> > > --
> > > Thanks
> > > Jay
> >
>



-- 
Thanks
Jay Potharaju


Re: understanding phonetic matching

2016-05-07 Thread Jay Potharaju
Thanks will check it out.


On Sat, May 7, 2016 at 7:05 PM, Susheel Kumar  wrote:

> Jay,
>
> There are mainly three phonetics algorithms available in Solr i.e.
> RefinedSoundex, DoubleMetaphone & BeiderMorse.  We did extensive comparison
> considering various tests cases and found BeiderMorse to be the best among
> those for finding sound like matches and it also supports multiple
> languages.  We also customized Beider Morse extensively for our use case.
>
> So please take a closer look at Beider Morse and i am sure it will help you
> out.
>
> Thanks,
> Susheel
>
> On Sat, May 7, 2016 at 2:13 PM, Jay Potharaju 
> wrote:
>
> > Thanks for the feedback, I was getting correct results when searching for
> > jon & john. But when I tried other names like 'khloe' it matched on
> > 'collier' because the phonetic filter generated KL as the token.
> > Is phonetic filter the best way to find similar sounding names?
> >
> >
> > On Wed, Mar 23, 2016 at 12:01 AM, davidphilip cherian <
> > davidphilipcher...@gmail.com> wrote:
> >
> > > The "phonetic_en" analyzer definition available in solr-schema does
> > return
> > > documents having "Jon", "JN", "John" when search term is "John".
> Checkout
> > > screen shot here : http://imgur.com/0R6SvX2
> > >
> > > This wiki page explains how phonetic matching works :
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/solr/Phonetic+Matching#PhoneticMatching-DoubleMetaphone
> > >
> > >
> > > Hope that helps.
> > >
> > >
> > >
> > > On Wed, Mar 23, 2016 at 11:18 AM, Alexandre Rafalovitch <
> > > arafa...@gmail.com>
> > > wrote:
> > >
> > > > I'd start by putting LowerCaseFF before the PhoneticFilter.
> > > >
> > > > But then, you say you were using Analysis screen and what? Do you get
> > > > the matches when you put your sample text and the query text in the
> > > > two boxes in the UI? I am not sure what "look at my solr data" means
> > > > in this particular context.
> > > >
> > > > Regards,
> > > >Alex.
> > > > 
> > > > Newsletter and resources for Solr beginners and intermediates:
> > > > http://www.solr-start.com/
> > > >
> > > >
> > > > On 23 March 2016 at 16:27, Jay Potharaju 
> > wrote:
> > > > > Hi,
> > > > > I am trying to do name matching using the phonetic filter factory.
> As
> > > > part
> > > > > of that I was analyzing the data using analysis screen in solr UI.
> > If i
> > > > > search for john, any documents containing john or jon should be
> > found.
> > > > >
> > > > > Following is my definition of the custom field that I use for
> > indexing
> > > > the
> > > > > data. When I look at my solr data I dont see any similar sounding
> > names
> > > > in
> > > > > my solr data, even though I have set inject="true". Is that not how
> > it
> > > is
> > > > > supposed to work?
> > > > > Can someone explain how phonetic matching works?
> > > > >
> > > > >   > > > positionIncrementGap
> > > > > ="100">
> > > > >
> > > > >  
> > > > >
> > > > > 
> > > > >
> > > > >  > > > encoder="DoubleMetaphone"
> > > > > inject="true" maxCodeLength="5"/>
> > > > >
> > > > > 
> > > > >
> > > > >  
> > > > >
> > > > > 
> > > > >
> > > > > --
> > > > > Thanks
> > > > > Jay
> > > >
> > >
> >
> >
> >
> > --
> > Thanks
> > Jay Potharaju
> >
>



-- 
Thanks
Jay Potharaju


Re: Filter queries & caching

2016-05-08 Thread Jay Potharaju
As mentioned above adding filter() will add the filter query to the cache.
This would mean that results are fetched from cache instead of running n
number of filter queries  in parallel.
Is it necessary to use the filter() option? I was under the impression that
all filter queries will get added to the "filtercache". What is the
advantage of using filter()?

*From
doc: 
https://cwiki.apache.org/confluence/display/solr/Query+Settings+in+SolrConfig
<https://cwiki.apache.org/confluence/display/solr/Query+Settings+in+SolrConfig>*
This cache is used by SolrIndexSearcher for filters (DocSets) for unordered
sets of all documents that match a query. The numeric attributes control
the number of entries in the cache.
Solr uses the filterCache to cache results of queries that use the fq
search parameter. Subsequent queries using the same parameter setting
result in cache hits and rapid returns of results. See Searching for a
detailed discussion of the fq parameter.

*From Yonik's site: http://yonik.com/solr/query-syntax/#FilterQuery
<http://yonik.com/solr/query-syntax/#FilterQuery>*

(Since Solr 5.4)

A filter query retrieves a set of documents matching a query from the
filter cache. Since scores are not cached, all documents that match the
filter produce the same score (0 by default). Cached filters will be
extremely fast when they are used again in another query.


Thanks


On Fri, May 6, 2016 at 9:46 AM, Jay Potharaju  wrote:

> We have high query load and considering that I think the suggestions made
> above will help with performance.
> Thanks
> Jay
>
> On Fri, May 6, 2016 at 7:26 AM, Shawn Heisey  wrote:
>
>> On 5/6/2016 7:19 AM, Shawn Heisey wrote:
>> > With three separate
>> > fq parameters, you'll get three cache entries in filterCache from the
>> > one query.
>>
>> One more tidbit of information related to this:
>>
>> When you have multiple filters and they aren't cached, I am reasonably
>> certain that they run in parallel.  Instead of one complex filter, you
>> would have three simple filters running simultaneously.  For low to
>> medium query loads on a server with a whole bunch of CPUs, where there
>> is plenty of spare CPU power, this can be a real gain in performance ...
>> but if the query load is really high, it might be a bad thing.
>>
>> Thanks,
>> Shawn
>>
>>
>
>
> --
> Thanks
> Jay Potharaju
>
>



-- 
Thanks
Jay Potharaju


Re: Filter queries & caching

2016-05-09 Thread Jay Potharaju
Thanks Ahmet...but I am not still clear how is adding filter() option
better or is it the same as filtercache?

My question is below.

"As mentioned above adding filter() will add the filter query to the cache.
This would mean that results are fetched from cache instead of running n
number of filter queries  in parallel.
Is it necessary to use the filter() option? I was under the impression that
all filter queries will get added to the "filtercache". What is the
advantage of using filter()?"

Thanks

On Sun, May 8, 2016 at 6:30 PM, Ahmet Arslan 
wrote:

> Hi,
>
> As I understand it useful incase you use an OR operator between two
> restricting clauses.
> Recall that multiple fq means implicit AND.
>
> ahmet
>
>
>
> On Monday, May 9, 2016 4:02 AM, Jay Potharaju 
> wrote:
> As mentioned above adding filter() will add the filter query to the cache.
> This would mean that results are fetched from cache instead of running n
> number of filter queries  in parallel.
> Is it necessary to use the filter() option? I was under the impression that
> all filter queries will get added to the "filtercache". What is the
> advantage of using filter()?
>
> *From
> doc:
> https://cwiki.apache.org/confluence/display/solr/Query+Settings+in+SolrConfig
> <
> https://cwiki.apache.org/confluence/display/solr/Query+Settings+in+SolrConfig
> >*
> This cache is used by SolrIndexSearcher for filters (DocSets) for unordered
> sets of all documents that match a query. The numeric attributes control
> the number of entries in the cache.
> Solr uses the filterCache to cache results of queries that use the fq
> search parameter. Subsequent queries using the same parameter setting
> result in cache hits and rapid returns of results. See Searching for a
> detailed discussion of the fq parameter.
>
> *From Yonik's site: http://yonik.com/solr/query-syntax/#FilterQuery
> <http://yonik.com/solr/query-syntax/#FilterQuery>*
>
> (Since Solr 5.4)
>
> A filter query retrieves a set of documents matching a query from the
> filter cache. Since scores are not cached, all documents that match the
> filter produce the same score (0 by default). Cached filters will be
> extremely fast when they are used again in another query.
>
>
> Thanks
>
>
> On Fri, May 6, 2016 at 9:46 AM, Jay Potharaju 
> wrote:
>
> > We have high query load and considering that I think the suggestions made
> > above will help with performance.
> > Thanks
> > Jay
> >
> > On Fri, May 6, 2016 at 7:26 AM, Shawn Heisey 
> wrote:
> >
> >> On 5/6/2016 7:19 AM, Shawn Heisey wrote:
> >> > With three separate
> >> > fq parameters, you'll get three cache entries in filterCache from the
> >> > one query.
> >>
> >> One more tidbit of information related to this:
> >>
> >> When you have multiple filters and they aren't cached, I am reasonably
> >> certain that they run in parallel.  Instead of one complex filter, you
> >> would have three simple filters running simultaneously.  For low to
> >> medium query loads on a server with a whole bunch of CPUs, where there
> >> is plenty of spare CPU power, this can be a real gain in performance ...
> >> but if the query load is really high, it might be a bad thing.
> >>
> >> Thanks,
> >> Shawn
> >>
> >>
> >
> >
> > --
> > Thanks
> > Jay Potharaju
>
> >
> >
>
>
>
> --
> Thanks
> Jay Potharaju
>



-- 
Thanks
Jay Potharaju


Error on creating new collection with existing configs

2016-05-09 Thread Jay Potharaju
Hi,
I created a new config and uploaded it to zk with the name test_conf. And
then created a collection which uses this config.

CREATE COLLECTION:
/solr/admin/collections?action=CREATE&name=test2&numShards=1&replicationFactor=2&collection.configName=test_conf

 When indexing the data using DIH I get an error.

org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode

for /configs/test2/dataimport.properties


When I create the collection using command line and dont pass the
configname but just the confdir, DIH indexing works.

Using Solr 5.5

Am I missing something??

-- 
Thanks
Jay


Re: Filter queries & caching

2016-05-09 Thread Jay Potharaju
Thanks for the explanation Eric.

So that I understand this clearly


1)  fq=filter(fromfield:[* TO NOW/DAY+1DAY]&& tofield:[NOW/DAY-7DAY TO *])
&& fq=type:abc
2) fq= fromfield:[* TO NOW/DAY+1DAY]&& fq=tofield:[NOW/DAY-7DAY TO *]) &&
fq=type:abc

Using 1) would benefit from having 2 separate filter caches instead of 3
slots in the cache. But in general both would be using the filter cache.
And secondly it would  be more useful to use filter() in a scenario like
above(mentioned in your email).
Thanks




On Mon, May 9, 2016 at 9:43 PM, Erick Erickson 
wrote:

> You're confusing a query clause with fq when thinking about filter() I
> think.
>
> Essentially they don't need to be used together, i.e.
>
> q=myclause AND filter(field:value)
>
> is identical to
>
> q=myclause&fq=field:value
>
> both in docs returned and filterCache usage.
>
> q=myclause&filter(fq=field:value)
>
> actually uses two filterCache entries, so is probably not what you want to
> use.
>
> the filter() syntax attached to a q clause (not an fq clause) is meant
> to allow you to get speedups
> you want to use compound clauses without having every combination be
> separate filterCache entries.
>
> Consider the following:
> fq=A OR B
> fq=A AND B
> fq=A
> fq=B
>
> These would require 4 filterCache entries.
>
> q=filter(A) OR filter(B)
> q=filter(A) AND filter(B)
> q=filter(A)
> q=filter(B)
>
> would only require two. Yet all of them would be satisfied only by
> looking at the filterCache.
>
> Aside from the example immediately above, which one you use is largely
> a matter of taste.
>
> Best,
> Erick
>
> On Mon, May 9, 2016 at 12:47 PM, Jay Potharaju 
> wrote:
> > Thanks Ahmet...but I am not still clear how is adding filter() option
> > better or is it the same as filtercache?
> >
> > My question is below.
> >
> > "As mentioned above adding filter() will add the filter query to the
> cache.
> > This would mean that results are fetched from cache instead of running n
> > number of filter queries  in parallel.
> > Is it necessary to use the filter() option? I was under the impression
> that
> > all filter queries will get added to the "filtercache". What is the
> > advantage of using filter()?"
> >
> > Thanks
> >
> > On Sun, May 8, 2016 at 6:30 PM, Ahmet Arslan 
> > wrote:
> >
> >> Hi,
> >>
> >> As I understand it useful incase you use an OR operator between two
> >> restricting clauses.
> >> Recall that multiple fq means implicit AND.
> >>
> >> ahmet
> >>
> >>
> >>
> >> On Monday, May 9, 2016 4:02 AM, Jay Potharaju 
> >> wrote:
> >> As mentioned above adding filter() will add the filter query to the
> cache.
> >> This would mean that results are fetched from cache instead of running n
> >> number of filter queries  in parallel.
> >> Is it necessary to use the filter() option? I was under the impression
> that
> >> all filter queries will get added to the "filtercache". What is the
> >> advantage of using filter()?
> >>
> >> *From
> >> doc:
> >>
> https://cwiki.apache.org/confluence/display/solr/Query+Settings+in+SolrConfig
> >> <
> >>
> https://cwiki.apache.org/confluence/display/solr/Query+Settings+in+SolrConfig
> >> >*
> >> This cache is used by SolrIndexSearcher for filters (DocSets) for
> unordered
> >> sets of all documents that match a query. The numeric attributes control
> >> the number of entries in the cache.
> >> Solr uses the filterCache to cache results of queries that use the fq
> >> search parameter. Subsequent queries using the same parameter setting
> >> result in cache hits and rapid returns of results. See Searching for a
> >> detailed discussion of the fq parameter.
> >>
> >> *From Yonik's site: http://yonik.com/solr/query-syntax/#FilterQuery
> >> <http://yonik.com/solr/query-syntax/#FilterQuery>*
> >>
> >> (Since Solr 5.4)
> >>
> >> A filter query retrieves a set of documents matching a query from the
> >> filter cache. Since scores are not cached, all documents that match the
> >> filter produce the same score (0 by default). Cached filters will be
> >> extremely fast when they are used again in another query.
> >>
> >>
> >> Thanks
> >>
> >>
> >> On Fri, May 6, 2016 at 9:46 AM, Jay Potharaju 
> >> wrote:
> >>
> >&

solr multicore vs sharding vs 1 big collection

2015-08-01 Thread Jay Potharaju
Hi

I currently have a single collection with 40 million documents and index
size of 25 GB. The collections gets updated every n minutes and as a result
the number of deleted documents is constantly growing. The data in the
collection is an amalgamation of more than 1000+ customer records. The
number of documents per each customer is around 100,000 records on average.

Now that being said, I 'm trying to get an handle on the growing deleted
document size. Because of the growing index size both the disk space and
memory is being used up. And would like to reduce it to a manageable size.

I have been thinking of splitting the data into multiple core, 1 for each
customer. This would allow me manage the smaller collection easily and can
create/update the collection also fast. My concern is that number of
collections might become an issue. Any suggestions on how to address this
problem. What are my other alternatives to moving to a multicore
collections.?

Solr: 4.9
Index size:25 GB
Max doc: 40 million
Doc count:29 million

Replication:4

4 servers in solrcloud.

Thanks
Jay


Re: solr multicore vs sharding vs 1 big collection

2015-08-02 Thread Jay Potharaju
The document contains around 30 fields and have stored set to true for
almost 15 of them. And these stored fields are queried and updated all the
time. You will notice that the deleted documents is almost 30% of the
docs.  And it has stayed around that percent and has not come down.
I did try optimize but that was disruptive as it caused search errors.
I have been playing with merge factor to see if that helps with deleted
documents or not. It is currently set to 5.

The server has 24 GB of memory out of which memory consumption is around 23
GB normally and the jvm is set to 6 GB. And have noticed that the available
memory on the server goes to 100 MB at times during a day.
All the updates are run through DIH.

Every day at least once i see the following error, which result in search
errors on the front end of the site.

ERROR org.apache.solr.servlet.SolrDispatchFilter -
null:org.eclipse.jetty.io.EofException

>From what I have read these are mainly due to timeout and my timeout is set
to 30 seconds and cant set it to a higher number. I was thinking maybe due
to high memory usage, sometimes it leads to bad performance/errors.

My objective is to stop the errors, adding more memory to the server is not
a good scaling strategy. That is why i was thinking maybe there is a issue
with the way things are set up and need to be revisited.

Thanks


On Sat, Aug 1, 2015 at 7:06 PM, Shawn Heisey  wrote:

> On 8/1/2015 6:49 PM, Jay Potharaju wrote:
> > I currently have a single collection with 40 million documents and index
> > size of 25 GB. The collections gets updated every n minutes and as a
> result
> > the number of deleted documents is constantly growing. The data in the
> > collection is an amalgamation of more than 1000+ customer records. The
> > number of documents per each customer is around 100,000 records on
> average.
> >
> > Now that being said, I 'm trying to get an handle on the growing deleted
> > document size. Because of the growing index size both the disk space and
> > memory is being used up. And would like to reduce it to a manageable
> size.
> >
> > I have been thinking of splitting the data into multiple core, 1 for each
> > customer. This would allow me manage the smaller collection easily and
> can
> > create/update the collection also fast. My concern is that number of
> > collections might become an issue. Any suggestions on how to address this
> > problem. What are my other alternatives to moving to a multicore
> > collections.?
> >
> > Solr: 4.9
> > Index size:25 GB
> > Max doc: 40 million
> > Doc count:29 million
> >
> > Replication:4
> >
> > 4 servers in solrcloud.
>
> Creating 1000+ collections in SolrCloud is definitely problematic.  If
> you need to choose between a lot of shards and a lot of collections, I
> would definitely go with a lot of shards.  I would also want a lot of
> servers for an index with that many pieces.
>
> https://issues.apache.org/jira/browse/SOLR-7191
>
> I don't think it would matter how many collections or shards you have
> when it comes to how many deleted documents are in your index.  If you
> want to clean up a large number of deletes in an index, the best option
> is an optimize.  An optimize requires a large amount of disk I/O, so it
> can be extremely disruptive if the query volume is high.  It should be
> done when the query volume is at its lowest.  For the index you
> describe, a nightly or weekly optimize seems like a good option.
>
> Aside from having a lot of deleted documents in your index, what kind of
> problems are you trying to solve?
>
> Thanks,
> Shawn
>
>


-- 
Thanks
Jay Potharaju


Re: solr multicore vs sharding vs 1 big collection

2015-08-02 Thread Jay Potharaju
Shawn,
Thanks for the feedback. I agree that increasing timeout might alleviate
the timeout issue. The main problem with increasing timeout is the
detrimental effect it will have on the user experience, therefore can't
increase it.
I have looked at the queries that threw errors, next time I try it
everything seems to work fine. Not sure how to reproduce the error.
My concern with increasing the memory to 32GB is what happens when the
index size grows over the next few months.
One of the other solutions I have been thinking about is to rebuild
index(weekly) and create a new collection and use it. Are there any good
references for doing that?
Thanks
Jay

On Sun, Aug 2, 2015 at 10:19 AM, Shawn Heisey  wrote:

> On 8/2/2015 8:29 AM, Jay Potharaju wrote:
> > The document contains around 30 fields and have stored set to true for
> > almost 15 of them. And these stored fields are queried and updated all
> the
> > time. You will notice that the deleted documents is almost 30% of the
> > docs.  And it has stayed around that percent and has not come down.
> > I did try optimize but that was disruptive as it caused search errors.
> > I have been playing with merge factor to see if that helps with deleted
> > documents or not. It is currently set to 5.
> >
> > The server has 24 GB of memory out of which memory consumption is around
> 23
> > GB normally and the jvm is set to 6 GB. And have noticed that the
> available
> > memory on the server goes to 100 MB at times during a day.
> > All the updates are run through DIH.
>
> Using all availble memory is completely normal operation for ANY
> operating system.  If you hold up Windows as an example of one that
> doesn't ... it lies to you about "available" memory.  All modern
> operating systems will utilize memory that is not explicitly allocated
> for the OS disk cache.
>
> The disk cache will instantly give up any of the memory it is using for
> programs that request it.  Linux doesn't try to hide the disk cache from
> you, but older versions of Windows do.  In the newer versions of Windows
> that have the Resource Monitor, you can go there to see the actual
> memory usage including the cache.
>
> > Every day at least once i see the following error, which result in search
> > errors on the front end of the site.
> >
> > ERROR org.apache.solr.servlet.SolrDispatchFilter -
> > null:org.eclipse.jetty.io.EofException
> >
> > From what I have read these are mainly due to timeout and my timeout is
> set
> > to 30 seconds and cant set it to a higher number. I was thinking maybe
> due
> > to high memory usage, sometimes it leads to bad performance/errors.
>
> Although this error can be caused by timeouts, it has a specific
> meaning.  It means that the client disconnected before Solr responded to
> the request, so when Solr tried to respond (through jetty), it found a
> closed TCP connection.
>
> Client timeouts need to either be completely removed, or set to a value
> much longer than any request will take.  Five minutes is a good starting
> value.
>
> If all your client timeout is set to 30 seconds and you are seeing
> EofExceptions, that means that your requests are taking longer than 30
> seconds, and you likely have some performance issues.  It's also
> possible that some of your client timeouts are set a lot shorter than 30
> seconds.
>
> > My objective is to stop the errors, adding more memory to the server is
> not
> > a good scaling strategy. That is why i was thinking maybe there is a
> issue
> > with the way things are set up and need to be revisited.
>
> You're right that adding more memory to the servers is not a good
> scaling strategy for the general case ... but in this situation, I think
> it might be prudent.  For your index and heap sizes, I would want the
> company to pay for at least 32GB of RAM.
>
> Having said that ... I've seen Solr installs work well with a LOT less
> memory than the ideal.  I don't know that adding more memory is
> necessary, unless your system (CPU, storage, and memory speeds) is
> particularly slow.  Based on your document count and index size, your
> documents are quite small, so I think your memory size is probably good
> -- if the CPU, memory bus, and storage are very fast.  If one or more of
> those subsystems aren't fast, then make up the difference with lots of
> memory.
>
> Some light reading, where you will learn why I think 32GB is an ideal
> memory size for your system:
>
> https://wiki.apache.org/solr/SolrPerformanceProblems
>
> It is possible that your 6GB heap is not quite big enough for good
> performance, or that your GC is not well-tuned.  These topics are also
> discussed on that wiki page.  If you increase your heap size, then the
> likelihood of needing more memory in the system becomes greater, because
> there will be less memory available for the disk cache.
>
> Thanks,
> Shawn
>
>


-- 
Thanks
Jay Potharaju


Re: solr multicore vs sharding vs 1 big collection

2015-08-04 Thread Jay Potharaju
For the last few days I have been trying to correlate the timeouts with GC.
I noticed in the GC logs that full GC takes long time once in a while. Does
this mean that the jvm memory is to high or is it set to low?


 [GC 4730643K->3552794K(4890112K), 0.0433146 secs]
1973853.751: [Full GC 3552794K->2926402K(4635136K), 0.3123954 secs]
1973864.170: [GC 4127554K->2972129K(4644864K), 0.0418248 secs]
1973873.341: [GC 4185569K->2990123K(4640256K), 0.0451723 secs]
1973882.452: [GC 4201770K->2999178K(4645888K), 0.0611839 secs]
1973890.684: [GC 4220298K->3010751K(4646400K), 0.0302890 secs]
1973900.539: [GC 4229514K->3015049K(4646912K), 0.0470857 secs]
1973911.179: [GC 4237193K->3040837K(4646912K), 0.0373900 secs]
1973920.822: [GC 4262981K->3072045K(4655104K), 0.0450480 secs]
1973927.136: [GC 4307501K->3129835K(4635648K), 0.0392559 secs]
1973933.057: [GC 4363058K->3178923K(4647936K), 0.0426612 secs]
1973940.981: [GC 4405163K->3210677K(4648960K), 0.0557622 secs]
1973946.680: [GC 4436917K->3239408K(4656128K), 0.0430889 secs]
1973953.560: [GC 4474277K->3300411K(4641280K), 0.0423129 secs]
1973960.674: [GC 4536894K->3371225K(4630016K), 0.0560341 secs]
1973960.731: [Full GC 3371225K->3339436K(5086208K), 15.5285889 secs]
1973990.516: [GC 4548268K->3405111K(5096448K), 0.0657788 secs]
1973998.191: [GC 4613934K->3527257K(5086208K), 0.1304232 secs]
1974006.505: [GC 4723801K->3597899K(5132800K), 0.0899599 secs]
1974014.748: [GC 4793955K->3654280K(5163008K), 0.0989430 secs]
1974025.349: [GC 4880823K->3672457K(5182464K), 0.0683296 secs]
1974037.517: [GC 4899721K->3681560K(5234688K), 0.1028356 secs]
1974050.066: [GC 4938520K->3718901K(5256192K), 0.0796073 secs]
1974061.466: [GC 4974356K->3726357K(5308928K), 0.1324846 secs]
1974071.726: [GC 5003687K->3757516K(5336064K), 0.0734227 secs]
1974081.917: [GC 5036492K->3777662K(5387264K), 0.1475958 secs]
1974091.853: [GC 5074558K->3800799K(5421056K), 0.0799311 secs]
1974101.882: [GC 5097363K->3846378K(5434880K), 0.3011178 secs]
1974109.234: [GC 5121936K->3930457K(5478912K), 0.0956342 secs]
1974116.082: [GC 5206361K->3974011K(5215744K), 0.1967284 secs]

Thanks
Jay

On Mon, Aug 3, 2015 at 1:53 PM, Bill Bell  wrote:

> Yeah a separate by month or year is good and can really help in this case.
>
> Bill Bell
> Sent from mobile
>
>
> > On Aug 2, 2015, at 5:29 PM, Jay Potharaju  wrote:
> >
> > Shawn,
> > Thanks for the feedback. I agree that increasing timeout might alleviate
> > the timeout issue. The main problem with increasing timeout is the
> > detrimental effect it will have on the user experience, therefore can't
> > increase it.
> > I have looked at the queries that threw errors, next time I try it
> > everything seems to work fine. Not sure how to reproduce the error.
> > My concern with increasing the memory to 32GB is what happens when the
> > index size grows over the next few months.
> > One of the other solutions I have been thinking about is to rebuild
> > index(weekly) and create a new collection and use it. Are there any good
> > references for doing that?
> > Thanks
> > Jay
> >
> >> On Sun, Aug 2, 2015 at 10:19 AM, Shawn Heisey 
> wrote:
> >>
> >>> On 8/2/2015 8:29 AM, Jay Potharaju wrote:
> >>> The document contains around 30 fields and have stored set to true for
> >>> almost 15 of them. And these stored fields are queried and updated all
> >> the
> >>> time. You will notice that the deleted documents is almost 30% of the
> >>> docs.  And it has stayed around that percent and has not come down.
> >>> I did try optimize but that was disruptive as it caused search errors.
> >>> I have been playing with merge factor to see if that helps with deleted
> >>> documents or not. It is currently set to 5.
> >>>
> >>> The server has 24 GB of memory out of which memory consumption is
> around
> >> 23
> >>> GB normally and the jvm is set to 6 GB. And have noticed that the
> >> available
> >>> memory on the server goes to 100 MB at times during a day.
> >>> All the updates are run through DIH.
> >>
> >> Using all availble memory is completely normal operation for ANY
> >> operating system.  If you hold up Windows as an example of one that
> >> doesn't ... it lies to you about "available" memory.  All modern
> >> operating systems will utilize memory that is not explicitly allocated
> >> for the OS disk cache.
> >>
> >> The disk cache will instantly give up any of the memory it is using for
> >> programs that request it.  Linux doesn't try to hide the disk cache from
> >

Solr packages in Apache BigTop.

2015-03-07 Thread jay vyas
Hi Solr.

I work on the apache bigtop project, and am interested in integrating it
deeper with Solr, for example, for testing spark / solr integration cases.

Is anyone in the Solr community interested in collborating on testing
releases with us and maintaining Solr packagins in bigtop (with our help of
course) ?

The advantage here is that we can synergize efforts:  When new SOLR
releases come out, we can test them in bigtop to gaurantee that there are
rpm/deb packages which work well with the hadoop ecosystem.

For those that don't know, bigtop is the upstream apache bigdata packaging
project, we build hadoop, spark, solr, hbase and so on in rpm/deb format,
and supply puppet provisioners along with vagrant recipse for testing.

-- 
jay vyas


debugging solr query

2016-05-24 Thread Jay Potharaju
Hi,
I am trying to debug solr performance problems on an old version of solr,
4.3.1.
The queries are taking really long -in the range of 2-5 seconds!!.
Running filter query with only one condition also takes about a second.

There is memory available on the box for solr to use. I have been looking
at the following link but was looking for some more reference that would
tell me why a particular query is slow.

https://wiki.apache.org/solr/SolrPerformanceProblems

Solr version:4.3.1
Index size:128 GB
Heap:65 GB
Index size:75 GB
Memory usage:70 GB

Even though there is available memory is high all is not being used ..i
would expect the complete index to be in memory but it doesnt look like it
is. Any recommendations ??

-- 
Thanks
Jay


Re: debugging solr query

2016-05-25 Thread Jay Potharaju
Hi,
Thanks for the feedback. The queries I run are very basic filter queries
with some sorting.

q:*:*&fq=(dt1:[date1 TO *] && dt2:[* TO NOW/DAY+1]) && fieldA:abc &&
fieldB:(123 OR 456)&sort=dt1 asc,field2 asc, fieldC desc

I noticed that the date fields(dt1,dt2) are using date instead of tdate
fields & there are no docValues set on any of the fields used for sorting.

In order to fix this I plan to add a new field using tdate & docvalues
where required to the schema & update the new columns only for documents
that have fieldA set to abc. Once the fields are updated query on the new
fields to measure query performance .


   - Would the new added fields be used effectively by the solr index when
   querying & filtering? What I am not sure is whether only populating small
   number of documents(fieldA:abc) that are used for the above query provide
   performance benefits.
   - Would there be a performance penalty because majority of the
   documents(!fieldA:abc) dont have values in the new columns?


Thanks
Jay

On Tue, May 24, 2016 at 8:06 PM, Erick Erickson 
wrote:

> Try adding debug=timing, that'll give you an idea of what component is
> taking all the time.
> From there, it's "more art than science".
>
> But you haven't given us much to go on. What is the query? Are you
> grouping?
> Faceting on high-cardinality fields? Returning 10,000 rows?
>
> Best,
> Erick
>
> On Tue, May 24, 2016 at 4:52 PM, Ahmet Arslan 
> wrote:
> >
> >
> > Hi,
> >
> > Is it QueryComponent taking time?
> > Ot other components?
> >
> > Also make sure there is plenty of RAM for OS cache.
> >
> > Ahmet
> >
> > On Wednesday, May 25, 2016 1:47 AM, Jay Potharaju 
> wrote:
> >
> >
> >
> > Hi,
> > I am trying to debug solr performance problems on an old version of solr,
> > 4.3.1.
> > The queries are taking really long -in the range of 2-5 seconds!!.
> > Running filter query with only one condition also takes about a second.
> >
> > There is memory available on the box for solr to use. I have been looking
> > at the following link but was looking for some more reference that would
> > tell me why a particular query is slow.
> >
> > https://wiki.apache.org/solr/SolrPerformanceProblems
> >
> > Solr version:4.3.1
> > Index size:128 GB
> > Heap:65 GB
> > Index size:75 GB
> > Memory usage:70 GB
> >
> > Even though there is available memory is high all is not being used ..i
> > would expect the complete index to be in memory but it doesnt look like
> it
> > is. Any recommendations ??
> >
> > --
> > Thanks
> > Jay
>



-- 
Thanks
Jay Potharaju


Re: How to perform a contains query

2016-05-25 Thread Jay Potharaju
> execute
>> : > arbitrary code by uploading and accessing a JSP file.",
>> : >
>> : > "summary": "A certain tomcat7 package for Apache Tomcat 7 in
>> Red Hat
>> : > Enterprise Linux (RHEL) 7 allows remote attackers to cause a denial of
>> : > service (CPU consumption) via a crafted request.  NOTE: this
>> vulnerability
>> : > exists because of an unspecified regression.",
>> : >
>> : > "summary": "Apache Tomcat 7.0.0 through 7.0.3, 6.0.x, and
>> 5.5.x,
>> : > when running within a SecurityManager, does not make the
>> ServletContext
>> : > attribute read-only, which allows local web applications to read or
>> write
>> : > files outside of the intended working directory, as demonstrated
>> using a
>> : > directory traversal attack.",
>> : >
>> : > "summary": "Apache Tomcat 7.0.11, when web.xml has no login
>> : > configuration, does not follow security constraints, which allows
>> remote
>> : > attackers to bypass intended access restrictions via HTTP requests to
>> a
>> : > meta-data complete web application.  NOTE: this vulnerability exists
>> because
>> : > of an incorrect fix for CVE-2011-1088 and CVE-2011-1419.",
>> : >
>> : > "summary": "Apache Tomcat 7.x before 7.0.11, when web.xml has
>> no
>> : > security constraints, does not follow ServletSecurity annotations,
>> which
>> : > allows remote attackers to bypass intended access restrictions via
>> HTTP
>> : > requests to a web application.  NOTE: this vulnerability exists
>> because of
>> : > an incomplete fix for CVE-2011-1088.",
>> : >
>> : > "summary": "The HTTP BIO connector in Apache Tomcat 7.0.x
>> before
>> : > 7.0.12 does not properly handle HTTP pipelining, which allows remote
>> : > attackers to read responses intended for other clients in
>> opportunistic
>> : > circumstances by examining the application data in HTTP packets,
>> related to
>> : > \"a mix-up of responses for requests from different users.\"",
>> : >
>> : > "summary": "Apache Tomcat 7.0.12 and 7.0.13 processes the
>> first
>> : > request to a servlet without following security constraints that have
>> been
>> : > configured through annotations, which allows remote attackers to
>> bypass
>> : > intended access restrictions via HTTP requests. NOTE: this
>> vulnerability
>> : > exists because of an incomplete fix for CVE-2011-1088, CVE-2011-1183,
>> and
>> : > CVE-2011-1419.",
>> : >
>> : > "summary": "Apache Tomcat 7.0.x before 7.0.17 permits web
>> : > applications to replace an XML parser used for other web
>> applications, which
>> : > allows local users to read or modify the (1) web.xml, (2)
>> context.xml, or
>> : > (3) tld files of arbitrary web applications via a crafted application
>> that
>> : > is loaded earlier than the target application.  NOTE: this
>> vulnerability
>> : > exists because of a CVE-2009-0783 regression.",
>> : >
>> : > "summary": "Certain AJP protocol connector implementations in
>> Apache
>> : > Tomcat 7.0.0 through 7.0.20, 6.0.0 through 6.0.33, 5.5.0 through
>> 5.5.33, and
>> : > possibly other versions allow remote attackers to spoof AJP requests,
>> bypass
>> : > authentication, and obtain sensitive information by causing the
>> connector to
>> : > interpret a request body as a new request.",
>> : >
>> : > "summary": "** DISPUTED ** Apache Tomcat 7.x uses
>> world-readable
>> : > permissions for the log directory and its files, which might allow
>> local
>> : > users to obtain sensitive information by reading a file. NOTE: One
>> Tomcat
>> : > distributor has stated \"The tomcat log directory does not contain any
>> : > sensitive information.\"",
>> : >
>> : > "summary":
>> "java/org/apache/catalina/core/AsyncContextImpl.java in
>> : > Apache Tomcat 7.x before 7.0.40 does not properly handle the throwing
>> of a
>> : > RuntimeException in an AsyncListener in an application, which allows
>> : > context-dependent attackers to obtain sensitive request information
>> intended
>> : > for other applications in opportunistic circumstances via an
>> application
>> : > that records the requests that it processes.",
>> : >
>> : > "summary": "Session fixation vulnerability in Apache Tomcat
>> 7.x
>> : > before 7.0.66, 8.x before 8.0.30, and 9.x before 9.0.0.M2, when
>> different
>> : > session settings are used for deployments of multiple versions of the
>> same
>> : > web application, might allow remote attackers to hijack web sessions
>> by
>> : > leveraging use of a requestedSessionSSL field for an unintended
>> request,
>> : > related to CoyoteAdapter.java and Request.java.",
>> : >
>> : > "summary": "The (1) Manager and (2) Host Manager applications
>> in
>> : > Apache Tomcat 7.x before 7.0.68, 8.x before 8.0.31, and 9.x before
>> 9.0.0.M2
>> : > establish sessions and send CSRF tokens for arbitrary new requests,
>> which
>> : > allows remote attackers to bypass a CSRF protection mechanism by
>> using a
>> : > token.",
>> : >
>> : > "summary": "The setGlobalContext method in
>> : > org/apache/naming/factory/ResourceLinkFactory.java in Apache Tomcat
>> 7.x
>> : > before 7.0.68, 8.x before 8.0.31, and 9.x before 9.0.0.M3 does not
>> consider
>> : > whether ResourceLinkFactory.setGlobalContext callers are authorized,
>> which
>> : > allows remote authenticated users to bypass intended SecurityManager
>> : > restrictions and read or write to arbitrary application data, or
>> cause a
>> : > denial of service (application disruption), via a web application
>> that sets
>> : > a crafted global context.",
>> :
>> :
>>
>> -Hoss
>> http://www.lucidworks.com/
>>
>
>


-- 
Thanks
Jay Potharaju


Re: How to save index data to other place? [scottchu]

2016-05-25 Thread Jay Potharaju
use property.*dataDir*=*value*
https://cwiki.apache.org/confluence/display/solr/Defining+core.properties

On Wed, May 25, 2016 at 8:20 PM, scott.chu  wrote:

>
> When I create a collection, say named 'cugna'. Solr create a folder with
> same name under server\slolr, e.g. /local/solr-5.4.1/server/solr/cugna.
> Index data is also saved there. But wish to save index data on other
> folder, say /var/sc_data/cugna. How can I dothis?
>
> scott.chu,scott@udngroup.com
> 2016/5/26 (週四)
>



-- 
Thanks
Jay Potharaju


Re: debugging solr query

2016-05-25 Thread Jay Potharaju
Any links that illustrate and talk about solr internals and how
indexing/querying works would be a great help.
Thanks
Jay

On Wed, May 25, 2016 at 6:30 PM, Jay Potharaju 
wrote:

> Hi,
> Thanks for the feedback. The queries I run are very basic filter queries
> with some sorting.
>
> q:*:*&fq=(dt1:[date1 TO *] && dt2:[* TO NOW/DAY+1]) && fieldA:abc &&
> fieldB:(123 OR 456)&sort=dt1 asc,field2 asc, fieldC desc
>
> I noticed that the date fields(dt1,dt2) are using date instead of tdate
> fields & there are no docValues set on any of the fields used for sorting.
>
> In order to fix this I plan to add a new field using tdate & docvalues
> where required to the schema & update the new columns only for documents
> that have fieldA set to abc. Once the fields are updated query on the new
> fields to measure query performance .
>
>
>- Would the new added fields be used effectively by the solr index
>when querying & filtering? What I am not sure is whether only populating
>small number of documents(fieldA:abc) that are used for the above query
>provide performance benefits.
>- Would there be a performance penalty because majority of the
>documents(!fieldA:abc) dont have values in the new columns?
>
>
> Thanks
> Jay
>
> On Tue, May 24, 2016 at 8:06 PM, Erick Erickson 
> wrote:
>
>> Try adding debug=timing, that'll give you an idea of what component is
>> taking all the time.
>> From there, it's "more art than science".
>>
>> But you haven't given us much to go on. What is the query? Are you
>> grouping?
>> Faceting on high-cardinality fields? Returning 10,000 rows?
>>
>> Best,
>> Erick
>>
>> On Tue, May 24, 2016 at 4:52 PM, Ahmet Arslan 
>> wrote:
>> >
>> >
>> > Hi,
>> >
>> > Is it QueryComponent taking time?
>> > Ot other components?
>> >
>> > Also make sure there is plenty of RAM for OS cache.
>> >
>> > Ahmet
>> >
>> > On Wednesday, May 25, 2016 1:47 AM, Jay Potharaju <
>> jspothar...@gmail.com> wrote:
>> >
>> >
>> >
>> > Hi,
>> > I am trying to debug solr performance problems on an old version of
>> solr,
>> > 4.3.1.
>> > The queries are taking really long -in the range of 2-5 seconds!!.
>> > Running filter query with only one condition also takes about a second.
>> >
>> > There is memory available on the box for solr to use. I have been
>> looking
>> > at the following link but was looking for some more reference that would
>> > tell me why a particular query is slow.
>> >
>> > https://wiki.apache.org/solr/SolrPerformanceProblems
>> >
>> > Solr version:4.3.1
>> > Index size:128 GB
>> > Heap:65 GB
>> > Index size:75 GB
>> > Memory usage:70 GB
>> >
>> > Even though there is available memory is high all is not being used ..i
>> > would expect the complete index to be in memory but it doesnt look like
>> it
>> > is. Any recommendations ??
>> >
>> > --
>> > Thanks
>> > Jay
>>
>
>
>
> --
> Thanks
> Jay Potharaju
>
>



-- 
Thanks
Jay Potharaju


Re: debugging solr query

2016-05-26 Thread Jay Potharaju
Hi,
Thanks for the feedback. The queries I run are very basic filter queries
with some sorting.

q:*:*&fq=(dt1:[date1 TO *] && dt2:[* TO NOW/DAY+1]) && fieldA:abc &&
fieldB:(123 OR 456)&sort=dt1 asc,field2 asc, fieldC desc

I noticed that the date fields(dt1,dt2) are using date instead of tdate
fields & there are no docValues set on any of the fields used for sorting.

In order to fix this I plan to add a new field using tdate & docvalues
where required to the schema & update the new columns only for documents
that have fieldA set to abc. Once the fields are updated query on the new
fields to measure query performance .


   - Would the new added fields be used effectively by the solr index when
   querying & filtering? What I am not sure is whether only populating small
   number of documents(fieldA:abc) that are used for the above query provide
   performance benefits.
   - Would there be a performance penalty because majority of the
   documents(!fieldA:abc) dont have values in the new columns?

Thanks

On Wed, May 25, 2016 at 8:40 PM, Jay Potharaju 
wrote:

> Any links that illustrate and talk about solr internals and how
> indexing/querying works would be a great help.
> Thanks
> Jay
>
> On Wed, May 25, 2016 at 6:30 PM, Jay Potharaju 
> wrote:
>
>> Hi,
>> Thanks for the feedback. The queries I run are very basic filter queries
>> with some sorting.
>>
>> q:*:*&fq=(dt1:[date1 TO *] && dt2:[* TO NOW/DAY+1]) && fieldA:abc &&
>> fieldB:(123 OR 456)&sort=dt1 asc,field2 asc, fieldC desc
>>
>> I noticed that the date fields(dt1,dt2) are using date instead of tdate
>> fields & there are no docValues set on any of the fields used for sorting.
>>
>> In order to fix this I plan to add a new field using tdate & docvalues
>> where required to the schema & update the new columns only for documents
>> that have fieldA set to abc. Once the fields are updated query on the new
>> fields to measure query performance .
>>
>>
>>- Would the new added fields be used effectively by the solr index
>>when querying & filtering? What I am not sure is whether only populating
>>small number of documents(fieldA:abc) that are used for the above query
>>provide performance benefits.
>>- Would there be a performance penalty because majority of the
>>documents(!fieldA:abc) dont have values in the new columns?
>>
>>
>> Thanks
>> Jay
>>
>> On Tue, May 24, 2016 at 8:06 PM, Erick Erickson 
>> wrote:
>>
>>> Try adding debug=timing, that'll give you an idea of what component is
>>> taking all the time.
>>> From there, it's "more art than science".
>>>
>>> But you haven't given us much to go on. What is the query? Are you
>>> grouping?
>>> Faceting on high-cardinality fields? Returning 10,000 rows?
>>>
>>> Best,
>>> Erick
>>>
>>> On Tue, May 24, 2016 at 4:52 PM, Ahmet Arslan 
>>> wrote:
>>> >
>>> >
>>> > Hi,
>>> >
>>> > Is it QueryComponent taking time?
>>> > Ot other components?
>>> >
>>> > Also make sure there is plenty of RAM for OS cache.
>>> >
>>> > Ahmet
>>> >
>>> > On Wednesday, May 25, 2016 1:47 AM, Jay Potharaju <
>>> jspothar...@gmail.com> wrote:
>>> >
>>> >
>>> >
>>> > Hi,
>>> > I am trying to debug solr performance problems on an old version of
>>> solr,
>>> > 4.3.1.
>>> > The queries are taking really long -in the range of 2-5 seconds!!.
>>> > Running filter query with only one condition also takes about a second.
>>> >
>>> > There is memory available on the box for solr to use. I have been
>>> looking
>>> > at the following link but was looking for some more reference that
>>> would
>>> > tell me why a particular query is slow.
>>> >
>>> > https://wiki.apache.org/solr/SolrPerformanceProblems
>>> >
>>> > Solr version:4.3.1
>>> > Index size:128 GB
>>> > Heap:65 GB
>>> > Index size:75 GB
>>> > Memory usage:70 GB
>>> >
>>> > Even though there is available memory is high all is not being used ..i
>>> > would expect the complete index to be in memory but it doesnt look
>>> like it
>>> > is. Any recommendations ??
>>> >
>>> > --
>>> > Thanks
>>> > Jay
>>>
>>
>>
>>
>> --
>> Thanks
>> Jay Potharaju
>>
>>
>
>
>
> --
> Thanks
> Jay Potharaju
>
>



-- 
Thanks
Jay Potharaju


Re: debugging solr query

2016-05-27 Thread Jay Potharaju
I updated almost 1/3 of the data and ran my queries with new columns as
mentioned earlier. The query returns data in  almost half the time as
compared to before.
I am thinking that if I update all the columns there would not be much
difference in query response time.

 

 

Are there any suggestions on how handle filtering/querying/sorting on high
cardinality date fields?

Index size: 30Million
Solr: 4.3.1

Thanks

On Thu, May 26, 2016 at 6:04 AM, Jay Potharaju 
wrote:

> Hi,
> Thanks for the feedback. The queries I run are very basic filter queries
> with some sorting.
>
> q:*:*&fq=(dt1:[date1 TO *] && dt2:[* TO NOW/DAY+1]) && fieldA:abc &&
> fieldB:(123 OR 456)&sort=dt1 asc,field2 asc, fieldC desc
>
> I noticed that the date fields(dt1,dt2) are using date instead of tdate
> fields & there are no docValues set on any of the fields used for sorting.
>
> In order to fix this I plan to add a new field using tdate & docvalues
> where required to the schema & update the new columns only for documents
> that have fieldA set to abc. Once the fields are updated query on the new
> fields to measure query performance .
>
>
>- Would the new added fields be used effectively by the solr index
>when querying & filtering? What I am not sure is whether only populating
>small number of documents(fieldA:abc) that are used for the above query
>provide performance benefits.
>- Would there be a performance penalty because majority of the
>    documents(!fieldA:abc) dont have values in the new columns?
>
> Thanks
>
> On Wed, May 25, 2016 at 8:40 PM, Jay Potharaju 
> wrote:
>
>> Any links that illustrate and talk about solr internals and how
>> indexing/querying works would be a great help.
>> Thanks
>> Jay
>>
>> On Wed, May 25, 2016 at 6:30 PM, Jay Potharaju 
>> wrote:
>>
>>> Hi,
>>> Thanks for the feedback. The queries I run are very basic filter queries
>>> with some sorting.
>>>
>>> q:*:*&fq=(dt1:[date1 TO *] && dt2:[* TO NOW/DAY+1]) && fieldA:abc &&
>>> fieldB:(123 OR 456)&sort=dt1 asc,field2 asc, fieldC desc
>>>
>>> I noticed that the date fields(dt1,dt2) are using date instead of tdate
>>> fields & there are no docValues set on any of the fields used for sorting.
>>>
>>> In order to fix this I plan to add a new field using tdate & docvalues
>>> where required to the schema & update the new columns only for documents
>>> that have fieldA set to abc. Once the fields are updated query on the new
>>> fields to measure query performance .
>>>
>>>
>>>- Would the new added fields be used effectively by the solr index
>>>when querying & filtering? What I am not sure is whether only populating
>>>small number of documents(fieldA:abc) that are used for the above query
>>>provide performance benefits.
>>>- Would there be a performance penalty because majority of the
>>>documents(!fieldA:abc) dont have values in the new columns?
>>>
>>>
>>> Thanks
>>> Jay
>>>
>>> On Tue, May 24, 2016 at 8:06 PM, Erick Erickson >> > wrote:
>>>
>>>> Try adding debug=timing, that'll give you an idea of what component is
>>>> taking all the time.
>>>> From there, it's "more art than science".
>>>>
>>>> But you haven't given us much to go on. What is the query? Are you
>>>> grouping?
>>>> Faceting on high-cardinality fields? Returning 10,000 rows?
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>> On Tue, May 24, 2016 at 4:52 PM, Ahmet Arslan 
>>>> wrote:
>>>> >
>>>> >
>>>> > Hi,
>>>> >
>>>> > Is it QueryComponent taking time?
>>>> > Ot other components?
>>>> >
>>>> > Also make sure there is plenty of RAM for OS cache.
>>>> >
>>>> > Ahmet
>>>> >
>>>> > On Wednesday, May 25, 2016 1:47 AM, Jay Potharaju <
>>>> jspothar...@gmail.com> wrote:
>>>> >
>>>> >
>>>> >
>>>> > Hi,
>>>> > I am trying to debug solr performance problems on an old version of
>>>> solr,
>>>> > 4.3.1.
>>>> > The queries are taking really long -in the range of 2-5 seconds!!.
>>>> > Running filter query with only one condition also takes about a
>>>> second.
>>>> >
>>>> > There is memory available on the box for solr to use. I have been
>>>> looking
>>>> > at the following link but was looking for some more reference that
>>>> would
>>>> > tell me why a particular query is slow.
>>>> >
>>>> > https://wiki.apache.org/solr/SolrPerformanceProblems
>>>> >
>>>> > Solr version:4.3.1
>>>> > Index size:128 GB
>>>> > Heap:65 GB
>>>> > Index size:75 GB
>>>> > Memory usage:70 GB
>>>> >
>>>> > Even though there is available memory is high all is not being used
>>>> ..i
>>>> > would expect the complete index to be in memory but it doesnt look
>>>> like it
>>>> > is. Any recommendations ??
>>>> >
>>>> > --
>>>> > Thanks
>>>> > Jay
>>>>
>>>
>>>
>>>
>>> --
>>> Thanks
>>> Jay Potharaju
>>>
>>>
>>
>>
>>
>> --
>> Thanks
>> Jay Potharaju
>>
>>
>
>
>
> --
> Thanks
> Jay Potharaju
>
>



-- 
Thanks
Jay Potharaju


Re: debugging solr query

2016-05-27 Thread Jay Potharaju
Thanks for the suggestion. At this time I wont be able to change any code
in the API ...my options are limited to changing things at the solr level.
Any suggestions regarding solr settings in config or schema changes are
something in my control.



On Fri, May 27, 2016 at 7:03 AM, Ahmet Arslan  wrote:

> Hi Jay,
>
> Please separate the clauses. Feed one of them to the main q parameter with
> content score operator =^ since you are sorting on a structured field(e.g.
> date)
>
> q:fieldB:(123 OR 456)^=1.0
> &fq=dt1:[date1 TO *]
> &fq=dt2:[* TO NOW/DAY+1]
> &fq=fieldA:abc
> &sort=dt1 asc,field2 asc, fieldC desc
>
> Play with the caches.
> Also consider disabling caching, and/or supplying execution order for the
> filer queries.
> Please see :
> https://lucidworks.com/blog/2012/02/10/advanced-filter-caching-in-solr/
>
> Ahmet
>
>
>
> On Friday, May 27, 2016 4:01 PM, Jay Potharaju 
> wrote:
> I updated almost 1/3 of the data and ran my queries with new columns as
> mentioned earlier. The query returns data in  almost half the time as
> compared to before.
> I am thinking that if I update all the columns there would not be much
> difference in query response time.
>
>  
>
>   default=""/>
>
> Are there any suggestions on how handle filtering/querying/sorting on high
> cardinality date fields?
>
> Index size: 30Million
> Solr: 4.3.1
>
> Thanks
>
> On Thu, May 26, 2016 at 6:04 AM, Jay Potharaju 
> wrote:
>
> > Hi,
> > Thanks for the feedback. The queries I run are very basic filter queries
> > with some sorting.
> >
> > q:*:*&fq=(dt1:[date1 TO *] && dt2:[* TO NOW/DAY+1]) && fieldA:abc &&
> > fieldB:(123 OR 456)&sort=dt1 asc,field2 asc, fieldC desc
> >
> > I noticed that the date fields(dt1,dt2) are using date instead of tdate
> > fields & there are no docValues set on any of the fields used for
> sorting.
> >
> > In order to fix this I plan to add a new field using tdate & docvalues
> > where required to the schema & update the new columns only for documents
> > that have fieldA set to abc. Once the fields are updated query on the new
> > fields to measure query performance .
> >
> >
> >- Would the new added fields be used effectively by the solr index
> >when querying & filtering? What I am not sure is whether only
> populating
> >small number of documents(fieldA:abc) that are used for the above
> query
> >provide performance benefits.
> >- Would there be a performance penalty because majority of the
> >documents(!fieldA:abc) dont have values in the new columns?
> >
> > Thanks
> >
> > On Wed, May 25, 2016 at 8:40 PM, Jay Potharaju 
> > wrote:
> >
> >> Any links that illustrate and talk about solr internals and how
> >> indexing/querying works would be a great help.
> >> Thanks
> >> Jay
> >>
> >> On Wed, May 25, 2016 at 6:30 PM, Jay Potharaju 
> >> wrote:
> >>
> >>> Hi,
> >>> Thanks for the feedback. The queries I run are very basic filter
> queries
> >>> with some sorting.
> >>>
> >>> q:*:*&fq=(dt1:[date1 TO *] && dt2:[* TO NOW/DAY+1]) && fieldA:abc &&
> >>> fieldB:(123 OR 456)&sort=dt1 asc,field2 asc, fieldC desc
> >>>
> >>> I noticed that the date fields(dt1,dt2) are using date instead of tdate
> >>> fields & there are no docValues set on any of the fields used for
> sorting.
> >>>
> >>> In order to fix this I plan to add a new field using tdate & docvalues
> >>> where required to the schema & update the new columns only for
> documents
> >>> that have fieldA set to abc. Once the fields are updated query on the
> new
> >>> fields to measure query performance .
> >>>
> >>>
> >>>- Would the new added fields be used effectively by the solr index
> >>>when querying & filtering? What I am not sure is whether only
> populating
> >>>small number of documents(fieldA:abc) that are used for the above
> query
> >>>provide performance benefits.
> >>>- Would there be a performance penalty because majority of the
> >>>documents(!fieldA:abc) dont have values in the new columns?
> >>>
> >>>
> >>> Thanks
> >>> Jay
> >>>
> >>> On Tue, May 24, 2016 at 8:06 PM, Erick Erickson <
> erickerick...@gmail.com
> >>> > w

Slow date filter query

2016-05-27 Thread Jay Potharaju
Hi,
I am running filter query(range query) on date fields(high cardinality) and
the performance is really bad ...it takes about 2-5 seconds for it to come
back with response. I am rebuilding the index to have docvalues & tdates
instead of "date" field. But not sure if that will alleviate the problem
because of high cardinality.

Can I store the date as MMDD and run range queries on them instead of
date fields?
Is that a good option?

-- 
Thanks
Jay


Re: Slow date filter query

2016-05-30 Thread Jay Potharaju
There are about 30 Million Docs and the index size is 75 GB. Using a full
timestamp value when querying and not using NOW.  The fq queries covers
almost all the docs(20+ million) in the index.
Thanks


On Mon, May 30, 2016 at 8:17 PM, Erick Erickson 
wrote:

> Oops, fat fingers.
>
> see:
> searchhub.org/2012/02/23/date-math-now-and-filter-queries/
>
> If you're not re-using the _same_ filter query, you'll be better
> off using fq={!cache=false}range_query
>
> Best,
> Erick
>
> On Mon, May 30, 2016 at 8:16 PM, Erick Erickson 
> wrote:
> > That does seem long, but you haven't provided many details
> > about the fields. Are there 100 docs in your index? 100M docs? 500M docs?
> >
> > Are you using NOW in appropriately? See:
> >
> > On Fri, May 27, 2016 at 1:32 PM, Jay Potharaju 
> wrote:
> >> Hi,
> >> I am running filter query(range query) on date fields(high cardinality)
> and
> >> the performance is really bad ...it takes about 2-5 seconds for it to
> come
> >> back with response. I am rebuilding the index to have docvalues & tdates
> >> instead of "date" field. But not sure if that will alleviate the problem
> >> because of high cardinality.
> >>
> >> Can I store the date as MMDD and run range queries on them instead
> of
> >> date fields?
> >> Is that a good option?
> >>
> >> --
> >> Thanks
> >> Jay
>



-- 
Thanks
Jay Potharaju


result grouping in sharded index

2016-06-13 Thread Jay Potharaju
Hi,
I am working on a functionality that would require me to group documents by
a id field. I read that the ngroups feature would not work in a sharded
index.
Can someone recommend how to handle this in a sharded index?


Solr Version: 5.5

https://cwiki.apache.org/confluence/display/solr/Result+Grouping#ResultGrouping-DistributedResultGroupingCaveats

-- 
Thanks
Jay


Re: result grouping in sharded index

2016-06-14 Thread Jay Potharaju
Any suggestions on how to handle result grouping in sharded index?


On Mon, Jun 13, 2016 at 1:15 PM, Jay Potharaju 
wrote:

> Hi,
> I am working on a functionality that would require me to group documents
> by a id field. I read that the ngroups feature would not work in a sharded
> index.
> Can someone recommend how to handle this in a sharded index?
>
>
> Solr Version: 5.5
>
>
> https://cwiki.apache.org/confluence/display/solr/Result+Grouping#ResultGrouping-DistributedResultGroupingCaveats
>
> --
> Thanks
> Jay
>
>



-- 
Thanks
Jay Potharaju


Re: result grouping in sharded index

2016-06-15 Thread Jay Potharaju
Collapse would also not work since it requires all the data to be on the
same shard.
"In order to use these features with SolrCloud, the documents must be
located on the same shard. To ensure document co-location, you can define
the router.name parameter as compositeId when creating the collection. "

On Wed, Jun 15, 2016 at 3:03 AM, Tom Evans  wrote:

> Do you have to group, or can you collapse instead?
>
>
> https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
>
> Cheers
>
> Tom
>
> On Tue, Jun 14, 2016 at 4:57 PM, Jay Potharaju 
> wrote:
> > Any suggestions on how to handle result grouping in sharded index?
> >
> >
> > On Mon, Jun 13, 2016 at 1:15 PM, Jay Potharaju 
> > wrote:
> >
> >> Hi,
> >> I am working on a functionality that would require me to group documents
> >> by a id field. I read that the ngroups feature would not work in a
> sharded
> >> index.
> >> Can someone recommend how to handle this in a sharded index?
> >>
> >>
> >> Solr Version: 5.5
> >>
> >>
> >>
> https://cwiki.apache.org/confluence/display/solr/Result+Grouping#ResultGrouping-DistributedResultGroupingCaveats
> >>
> >> --
> >> Thanks
> >> Jay
> >>
> >>
> >
> >
> >
> > --
> > Thanks
> > Jay Potharaju
>



-- 
Thanks
Jay Potharaju


Sorting & searching on the same field

2016-06-23 Thread Jay Potharaju
Hi,
I would like to have 1 field that can used for both searching and case
insensitive sorting. As far as i know the only way to do is to have two
fields one for searching (text_en) and one for sorting(lowercase & string).
Any ideas how the two can be combined into 1 field.


-- 
Thanks
Jay Potharaju


Re: Sorting & searching on the same field

2016-06-23 Thread Jay Potharaju
yes, that is what i thought. but was checking to see if there was something
I was missing.
Thanks

On Thu, Jun 23, 2016 at 12:55 PM, Ahmet Arslan 
wrote:

> Hi Jay,
>
> I don't think it can be combined.
> Mainly because: searching requires a tokenized field.
> Sorting requires a single value (token) to be meaningful.
>
> Ahmet
>
>
>
> On Thursday, June 23, 2016 7:43 PM, Jay Potharaju 
> wrote:
> Hi,
> I would like to have 1 field that can used for both searching and case
> insensitive sorting. As far as i know the only way to do is to have two
> fields one for searching (text_en) and one for sorting(lowercase & string).
> Any ideas how the two can be combined into 1 field.
>
>
> --
> Thanks
> Jay Potharaju
>



-- 
Thanks
Jay Potharaju


Slow facet range performance

2016-06-23 Thread Jay Potharaju
Hi,
I am running facet query on a date field and the results are coming back on
an avg  500ms. The field is set to use docvalues & field type is tdate.

SOLR - 5.5

&facet=true&facet.interval=date_field&f.date_field.facet.interval=date_field&f.date_field.facet.interval.set=[NOW-7DAY,NOW]&f.date_field.facet.interval.set=[NOW-30DAY,NOW-7DAY]&f.date_field.facet.interval.set=[NOW-1MONTH,NOW-7DAY]&f.date_field.facet.interval.set=[NOW-1YEAR,NOW-1MONTH]

Any suggestions on how to speed this up?
-- 
Thanks
Jay


clarification on using docvalues for sorting

2016-06-23 Thread Jay Potharaju
Hi,
I am trying to do a case insensitive sorting on couple of fields.
For this I am doing the following

 
 

   



Above would not allow using this datatype with docvalues. Docvalues can
only be used with string & trie fields.
And also docvalues are recommended for sorting & faceting.

How can i accomplish using docvalues for case-insensitive field types.?
Or what I am trying to do is not possible.

-- 
Thanks
Jay


Re: Sorting & searching on the same field

2016-06-23 Thread Jay Potharaju
Any ideas on how to handle case insensitive search, string fields and
docvalues in 1 field?

On Thu, Jun 23, 2016 at 8:14 PM, Alexandre Rafalovitch 
wrote:

> At least you don't need to store the sort field. Or even index, if it is
> docvalues (good for sort).
>
> Regards,
> Alex
> On 24 Jun 2016 9:01 AM, "Jay Potharaju"  wrote:
>
> > yes, that is what i thought. but was checking to see if there was
> something
> > I was missing.
> > Thanks
> >
> > On Thu, Jun 23, 2016 at 12:55 PM, Ahmet Arslan  >
> > wrote:
> >
> > > Hi Jay,
> > >
> > > I don't think it can be combined.
> > > Mainly because: searching requires a tokenized field.
> > > Sorting requires a single value (token) to be meaningful.
> > >
> > > Ahmet
> > >
> > >
> > >
> > > On Thursday, June 23, 2016 7:43 PM, Jay Potharaju <
> jspothar...@gmail.com
> > >
> > > wrote:
> > > Hi,
> > > I would like to have 1 field that can used for both searching and case
> > > insensitive sorting. As far as i know the only way to do is to have two
> > > fields one for searching (text_en) and one for sorting(lowercase &
> > string).
> > > Any ideas how the two can be combined into 1 field.
> > >
> > >
> > > --
> > > Thanks
> > > Jay Potharaju
> > >
> >
> >
> >
> > --
> > Thanks
> > Jay Potharaju
> >
>



-- 
Thanks
Jay Potharaju


Re: Sorting & searching on the same field

2016-06-24 Thread Jay Potharaju
Thanks Alex, I will check this out.
 Is it possible to do something at query time , using a function query to 
lowercase the field and then sort on it.?
Jay

> On Jun 24, 2016, at 12:03 AM, Alexandre Rafalovitch  
> wrote:
> 
> Keep voting for SOLR-8362?
> 
> You could do your preprocessing in UpdateRequestProcessor chain. There
> is nothing specifically for Lower/Upper case, but there is a generic
> scripting one: 
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/update/processor/StatelessScriptUpdateProcessorFactory.html
> 
> Regards,
>   Alex.
> 
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
> 
> 
>> On 24 June 2016 at 13:42, Jay Potharaju  wrote:
>> Any ideas on how to handle case insensitive search, string fields and
>> docvalues in 1 field?
>> 
>> On Thu, Jun 23, 2016 at 8:14 PM, Alexandre Rafalovitch 
>> wrote:
>> 
>>> At least you don't need to store the sort field. Or even index, if it is
>>> docvalues (good for sort).
>>> 
>>> Regards,
>>>Alex
>>>> On 24 Jun 2016 9:01 AM, "Jay Potharaju"  wrote:
>>>> 
>>>> yes, that is what i thought. but was checking to see if there was
>>> something
>>>> I was missing.
>>>> Thanks
>>>> 
>>>> On Thu, Jun 23, 2016 at 12:55 PM, Ahmet Arslan >>> 
>>>> wrote:
>>>> 
>>>>> Hi Jay,
>>>>> 
>>>>> I don't think it can be combined.
>>>>> Mainly because: searching requires a tokenized field.
>>>>> Sorting requires a single value (token) to be meaningful.
>>>>> 
>>>>> Ahmet
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thursday, June 23, 2016 7:43 PM, Jay Potharaju <
>>> jspothar...@gmail.com
>>>>> 
>>>>> wrote:
>>>>> Hi,
>>>>> I would like to have 1 field that can used for both searching and case
>>>>> insensitive sorting. As far as i know the only way to do is to have two
>>>>> fields one for searching (text_en) and one for sorting(lowercase &
>>>> string).
>>>>> Any ideas how the two can be combined into 1 field.
>>>>> 
>>>>> 
>>>>> --
>>>>> Thanks
>>>>> Jay Potharaju
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Thanks
>>>> Jay Potharaju
>> 
>> 
>> 
>> --
>> Thanks
>> Jay Potharaju


json facet - date range & interval

2016-06-27 Thread Jay Potharaju
Hi,
I am trying to use the json range facet with a tdate field. I tried the
following but get an error. Any suggestions on how to fix the following
error /examples for date range facets.

json.facet={daterange : {type : range, field : datefield, start
:"NOW-10DAYS", end : "NOW/DAY", gap : "+1DAY" } }

 msg": "Can't add gap 1DAY to value Fri Jun 17 15:49:36 UTC 2016 for field:
datefield", "code": 400

-- 
Thanks
Jay


Re: json facet - date range & interval

2016-06-28 Thread Jay Potharaju
json.facet={daterange : {type : range, field : datefield, start :
"NOW/DAY-10DAYS", end : "NOW/DAY",gap:"\+1DAY"} }

Escaping the plus sign also gives the same error. Any other suggestions how
can i make this work?
Thanks
Jay

On Mon, Jun 27, 2016 at 10:23 PM, Erick Erickson 
wrote:

> First thing I'd do is escape the plus. It's probably being interpreted
> as a space.
>
> Best,
> Erick
>
> On Mon, Jun 27, 2016 at 9:24 AM, Jay Potharaju 
> wrote:
> > Hi,
> > I am trying to use the json range facet with a tdate field. I tried the
> > following but get an error. Any suggestions on how to fix the following
> > error /examples for date range facets.
> >
> > json.facet={daterange : {type : range, field : datefield, start
> > :"NOW-10DAYS", end : "NOW/DAY", gap : "+1DAY" } }
> >
> >  msg": "Can't add gap 1DAY to value Fri Jun 17 15:49:36 UTC 2016 for
> field:
> > datefield", "code": 400
> >
> > --
> > Thanks
> > Jay
>



-- 
Thanks
Jay Potharaju


Re: json facet - date range & interval

2016-06-28 Thread Jay Potharaju
that worked ...thanks David!

On Tue, Jun 28, 2016 at 11:22 AM, David Santamauro <
david.santama...@gmail.com> wrote:

>
> Have you tried %-escaping?
>
> json.facet = {
>   daterange : { type  : range,
> field : datefield,
> start : "NOW/DAY%2D10DAYS",
> end   : "NOW/DAY",
> gap   : "%2B1DAY"
>
>   }
> }
>
>
> On 06/28/2016 01:19 PM, Jay Potharaju wrote:
>
>> json.facet={daterange : {type : range, field : datefield, start :
>> "NOW/DAY-10DAYS", end : "NOW/DAY",gap:"\+1DAY"} }
>>
>> Escaping the plus sign also gives the same error. Any other suggestions
>> how
>> can i make this work?
>> Thanks
>> Jay
>>
>> On Mon, Jun 27, 2016 at 10:23 PM, Erick Erickson > >
>> wrote:
>>
>> First thing I'd do is escape the plus. It's probably being interpreted
>>> as a space.
>>>
>>> Best,
>>> Erick
>>>
>>> On Mon, Jun 27, 2016 at 9:24 AM, Jay Potharaju 
>>> wrote:
>>>
>>>> Hi,
>>>> I am trying to use the json range facet with a tdate field. I tried the
>>>> following but get an error. Any suggestions on how to fix the following
>>>> error /examples for date range facets.
>>>>
>>>> json.facet={daterange : {type : range, field : datefield, start
>>>> :"NOW-10DAYS", end : "NOW/DAY", gap : "+1DAY" } }
>>>>
>>>>   msg": "Can't add gap 1DAY to value Fri Jun 17 15:49:36 UTC 2016 for
>>>>
>>> field:
>>>
>>>> datefield", "code": 400
>>>>
>>>> --
>>>> Thanks
>>>> Jay
>>>>
>>>
>>>
>>
>>
>>


-- 
Thanks
Jay Potharaju


Re: Integrating Stanford NLP or any other NLP for Natural Language Query

2016-07-07 Thread Jay Urbain
I use Stanford NLP and cTakes (based on OpenNLP) while indexing with a
SOLRJ application.

Best,
Jay

On Thu, Jul 7, 2016 at 12:09 PM, Puneet Pawaia 
wrote:

> Hi
>
> I am currently using Solr 5.5.x to test but can upgrade to Solr 6.x if
> required.
> I am working on a POC for natural language query using Solr. Should I use
> the Stanford libraries or are there any other libraries having integration
> with Solr already available.
> Any direction in how to do this would be most appreciated. How should I
> process the query to give relevant results.
>
> Regards
> Puneet
>


Re: Integrating Stanford NLP or any other NLP for Natural Language Query

2016-07-08 Thread Jay Urbain
I've added multivalued fields within my SOLR schema for indexing entities
extracted using NLP methods applied to the text I'm indexing, along with
fields for other discrete data extracted from relational databases.

A Java application reads data out of multiple relational databases, uses
NLP on the text and indexes each document (de-normalized) using SOLRJ.

I initially tried doing this with content handlers, but found it much
easier to just write a Java application.

SOLRJ Java API reference:
https://cwiki.apache.org/confluence/display/solr/Using+SolrJ

Stanford NLP:
http://stanfordnlp.github.io/CoreNLP/

Best,
Jay


On Thu, Jul 7, 2016 at 9:52 PM, Puneet Pawaia 
wrote:

> Hi Jay
> Any place I can learn more on this method of integration?
> Thanks
> Puneet
>
> On 8 Jul 2016 02:58, "Jay Urbain"  wrote:
>
> > I use Stanford NLP and cTakes (based on OpenNLP) while indexing with a
> > SOLRJ application.
> >
> > Best,
> > Jay
> >
> > On Thu, Jul 7, 2016 at 12:09 PM, Puneet Pawaia 
> > wrote:
> >
> > > Hi
> > >
> > > I am currently using Solr 5.5.x to test but can upgrade to Solr 6.x if
> > > required.
> > > I am working on a POC for natural language query using Solr. Should I
> use
> > > the Stanford libraries or are there any other libraries having
> > integration
> > > with Solr already available.
> > > Any direction in how to do this would be most appreciated. How should I
> > > process the query to give relevant results.
> > >
> > > Regards
> > > Puneet
> > >
> >
>


RE: [Ext] Influence ranking based on document committed date

2016-08-17 Thread Jay Parashar
This is correct: " I index it and feed it the timestamp at index time".
You can sort desc on that field (can be a TrieDateField)


-Original Message-
From: Steven White [mailto:swhite4...@gmail.com] 
Sent: Wednesday, August 17, 2016 9:01 AM
To: solr-user@lucene.apache.org
Subject: [Ext] Influence ranking based on document committed date

Hi everyone

Let's say I search for the word "Olympic" and I get a hit on 10 documents that 
have similar content (let us assume the content is at least 80%
identical) how can I have Solr rank them so that the ones with most recently 
updated doc gets ranked higher?  Is this something I have to do at index time 
or search time?

Is the trick to have a field that holds the committed timestamp and boost on 
that field during search?  If so, is this field something I can configure in 
Solr's schema.xml or must I index it and feed it the timestamp at index time?  
If I'm on the right track, does this mean I have to always append this field 
base boost to each query a user issues?

If there is a wiki or article written on this topic, that would be a good start.

In case it matters, I'm using Solr 5.2 and my searches are utilizing edismax.

Thanks in advanced!

Steve


Solr on GCE

2016-09-22 Thread Jay Parashar
Hi,

Is it possible to have SolrJ client running on Google App Engine to talk to a 
Solr instance hosted on a compute engine? The solr version is 6.2.0

There is also a similar question on Stack Overflow but no answers
http://stackoverflow.com/questions/37390072/httpsolrclient-on-google-app-engine


I am getting the following error

at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1003) 
~[zookeeper-3.4.6.jar:3.4.6-1569965]
[INFO] 09:46:56.419 [main-SendThread(nlxs5139.best-nl0114.slb.com:2181)] INFO  
org.apache.zookeeper.ClientCnxn - Opening socket connection to server 
nlxs5139.best-nl0114.slb.com/199.6.212.77:2181. Will not attempt to 
authenticate using SASL (unknown error)
[INFO] 09:46:56.419 [main-SendThread(nlxs5139.best-nl0114.slb.com:2181)] WARN  
org.apache.zookeeper.ClientCnxn - Session 0x0 for server null, unexpected 
error, closing socket connection and attempting reconnect
[INFO] java.lang.NoClassDefFoundError: java.nio.channels.SocketChannel is a 
restricted class. Please see the Google  App Engine developer's guide for more 
details.
[INFO]  at 
com.google.appengine.tools.development.agent.runtime.Runtime.reject(Runtime.java:52)
 ~[appengine-agentruntime.jar:na]


Thanks
Jay


SolrJ App Engine Client

2016-09-22 Thread Jay Parashar
I sent a similar message earlier but do not see it. Apologize if its duplicated.

I am unable to connect to Solr Cloud zkhost (using CloudSolrClient) from a 
SolrJ client running on Google App Engine.
The error message is "java.nio.channels.SocketChannel is a restricted class. 
Please see the Google  App Engine developer's guide for more details."

Is there a workaround? Its required that the client is SolrJ and running on App 
Engine.

Any feedback is much appreciated. Thanks


RE: [Ext] Re: SolrJ App Engine Client

2016-09-22 Thread Jay Parashar
No, it does not.

The error is (instead of SocketChannel) is now

Caused by: java.lang.NoClassDefFoundError: java.net.ProxySelector is a 
restricted class

And it's during an actual query (solrClient.query(query);)


-Original Message-
From: Mikhail Khludnev [mailto:m...@apache.org] 
Sent: Thursday, September 22, 2016 2:59 PM
To: solr-user 
Subject: [Ext] Re: SolrJ App Engine Client

Does it work with plain HttpSolrClient?

On Thu, Sep 22, 2016 at 10:50 PM, John Bickerstaff  wrote:

> Two possibilities from a quick search on the error message - both 
> point to GAE NOT fully supporting Java 8
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__stackoverflow.com_
> questions_29528580_how-2Dto-2Ddeal-2Dwith-2Dapp-2Dengine-2D&d=CwIBaQ&c
> =uGuXJ43KPkPWEl2imVFDmZQlhQUET7pVRA2PDIOxgqw&r=bRfqJEeedEKG5nkp5748Yxb
> NMFrUYT3YiNl0Ni2vUBQ&m=HDJS4ElFF2X939U2LWfIfRIdBJNLvm9q4mvpNmZp7kU&s=i
> 8WIpnKStYPvIRJTBTjBtqguv_nriuZMnLdBlB7pUWo&e=
> devserver-exception-due-to-formatstyle-restricted-cl
> https://urldefense.proofpoint.com/v2/url?u=http-3A__stackoverflow.com_
> questions_29543131_beancreationexception-2Dthrowed-2D&d=CwIBaQ&c=uGuXJ
> 43KPkPWEl2imVFDmZQlhQUET7pVRA2PDIOxgqw&r=bRfqJEeedEKG5nkp5748YxbNMFrUY
> T3YiNl0Ni2vUBQ&m=HDJS4ElFF2X939U2LWfIfRIdBJNLvm9q4mvpNmZp7kU&s=kGg4rdS
> 7eJoNjVzljzxek-nIUeMnjxRhjETSDJzdaXY&e=
> when-trying-to-run-my-project
>
>
> On Thu, Sep 22, 2016 at 1:38 PM, Jay Parashar  wrote:
>
> > I sent a similar message earlier but do not see it. Apologize if its 
> > duplicated.
> >
> > I am unable to connect to Solr Cloud zkhost (using CloudSolrClient) 
> > from
> a
> > SolrJ client running on Google App Engine.
> > The error message is "java.nio.channels.SocketChannel is a 
> > restricted class. Please see the Google  App Engine developer's 
> > guide for more details."
> >
> > Is there a workaround? Its required that the client is SolrJ and 
> > running on App Engine.
> >
> > Any feedback is much appreciated. Thanks
> >
>



--
Sincerely yours
Mikhail Khludnev


RE: [Ext] Re: SolrJ App Engine Client

2016-09-22 Thread Jay Parashar
I am on java 7. As the GAE states, the SocketChannel is not on Google's white 
list.

Stackoverflow (the 2nd link you sent) suggests to re-invent the class. I will 
see if I come up with anything. 
Thanks John.

-Original Message-
From: John Bickerstaff [mailto:j...@johnbickerstaff.com] 
Sent: Thursday, September 22, 2016 2:51 PM
To: solr-user@lucene.apache.org
Subject: [Ext] Re: SolrJ App Engine Client

Two possibilities from a quick search on the error message - both point to GAE 
NOT fully supporting Java 8

https://urldefense.proofpoint.com/v2/url?u=http-3A__stackoverflow.com_questions_29528580_how-2Dto-2Ddeal-2Dwith-2Dapp-2Dengine-2Ddevserver-2Dexception-2Ddue-2Dto-2Dformatstyle-2Drestricted-2Dcl&d=CwIBaQ&c=uGuXJ43KPkPWEl2imVFDmZQlhQUET7pVRA2PDIOxgqw&r=bRfqJEeedEKG5nkp5748YxbNMFrUYT3YiNl0Ni2vUBQ&m=FjaUoU-i-tiL8deMoKceLKxX-kgXBObYvgMAjZnac8A&s=5lMIyl1JJEfNqZSe80DnJ4PwWt_tpBoq3l6ZjM2EQBM&e=
https://urldefense.proofpoint.com/v2/url?u=http-3A__stackoverflow.com_questions_29543131_beancreationexception-2Dthrowed-2Dwhen-2Dtrying-2Dto-2Drun-2Dmy-2Dproject&d=CwIBaQ&c=uGuXJ43KPkPWEl2imVFDmZQlhQUET7pVRA2PDIOxgqw&r=bRfqJEeedEKG5nkp5748YxbNMFrUYT3YiNl0Ni2vUBQ&m=FjaUoU-i-tiL8deMoKceLKxX-kgXBObYvgMAjZnac8A&s=EkfJOFmbVi4fwdp1mBAnpIXC1XHnT8_eN6Jsz1PvDhw&e=
 


On Thu, Sep 22, 2016 at 1:38 PM, Jay Parashar  wrote:

> I sent a similar message earlier but do not see it. Apologize if its
> duplicated.
>
> I am unable to connect to Solr Cloud zkhost (using CloudSolrClient) from a
> SolrJ client running on Google App Engine.
> The error message is "java.nio.channels.SocketChannel is a restricted
> class. Please see the Google  App Engine developer's guide for more
> details."
>
> Is there a workaround? Its required that the client is SolrJ and running
> on App Engine.
>
> Any feedback is much appreciated. Thanks
>


Re: SolrJ App Engine Client

2016-09-22 Thread Jay Parashar
No, it does not.

The error is (instead of SocketChannel) is now

Caused by: java.lang.NoClassDefFoundError: java.net.ProxySelector is a 
restricted class

And it's during an actual query (solrClient.query(query);)


-Original Message-
From: Mikhail Khludnev [mailto:m...@apache.org]
Sent: Thursday, September 22, 2016 2:59 PM
To: solr-user 
Subject: [Ext] Re: SolrJ App Engine Client

Does it work with plain HttpSolrClient?

On Thu, Sep 22, 2016 at 10:50 PM, John Bickerstaff  wrote:

> Two possibilities from a quick search on the error message - both 
> point to GAE NOT fully supporting Java 8
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__stackoverflow.com_
> questions_29528580_how-2Dto-2Ddeal-2Dwith-2Dapp-2Dengine-2D&d=CwIBaQ&c
> =uGuXJ43KPkPWEl2imVFDmZQlhQUET7pVRA2PDIOxgqw&r=bRfqJEeedEKG5nkp5748Yxb
> NMFrUYT3YiNl0Ni2vUBQ&m=HDJS4ElFF2X939U2LWfIfRIdBJNLvm9q4mvpNmZp7kU&s=i
> 8WIpnKStYPvIRJTBTjBtqguv_nriuZMnLdBlB7pUWo&e=
> devserver-exception-due-to-formatstyle-restricted-cl
> https://urldefense.proofpoint.com/v2/url?u=http-3A__stackoverflow.com_
> questions_29543131_beancreationexception-2Dthrowed-2D&d=CwIBaQ&c=uGuXJ
> 43KPkPWEl2imVFDmZQlhQUET7pVRA2PDIOxgqw&r=bRfqJEeedEKG5nkp5748YxbNMFrUY
> T3YiNl0Ni2vUBQ&m=HDJS4ElFF2X939U2LWfIfRIdBJNLvm9q4mvpNmZp7kU&s=kGg4rdS
> 7eJoNjVzljzxek-nIUeMnjxRhjETSDJzdaXY&e=
> when-trying-to-run-my-project
>
>
> On Thu, Sep 22, 2016 at 1:38 PM, Jay Parashar  wrote:
>
> > I sent a similar message earlier but do not see it. Apologize if its 
> > duplicated.
> >
> > I am unable to connect to Solr Cloud zkhost (using CloudSolrClient) 
> > from
> a
> > SolrJ client running on Google App Engine.
> > The error message is "java.nio.channels.SocketChannel is a 
> > restricted class. Please see the Google  App Engine developer's 
> > guide for more details."
> >
> > Is there a workaround? Its required that the client is SolrJ and 
> > running on App Engine.
> >
> > Any feedback is much appreciated. Thanks
> >
>



--
Sincerely yours
Mikhail Khludnev


RE: SolrJ App Engine Client

2016-09-22 Thread Jay Parashar
I am on java 7. As the GAE states, the SocketChannel is not on Google's white 
list.

Stackoverflow (the 2nd link you sent) suggests to re-invent the class. I will 
see if I come up with anything. 
Thanks John.

-Original Message-
From: John Bickerstaff [mailto:j...@johnbickerstaff.com]
Sent: Thursday, September 22, 2016 2:51 PM
To: solr-user@lucene.apache.org
Subject: [Ext] Re: SolrJ App Engine Client

Two possibilities from a quick search on the error message - both point to GAE 
NOT fully supporting Java 8

https://urldefense.proofpoint.com/v2/url?u=http-3A__stackoverflow.com_questions_29528580_how-2Dto-2Ddeal-2Dwith-2Dapp-2Dengine-2Ddevserver-2Dexception-2Ddue-2Dto-2Dformatstyle-2Drestricted-2Dcl&d=CwIBaQ&c=uGuXJ43KPkPWEl2imVFDmZQlhQUET7pVRA2PDIOxgqw&r=bRfqJEeedEKG5nkp5748YxbNMFrUYT3YiNl0Ni2vUBQ&m=FjaUoU-i-tiL8deMoKceLKxX-kgXBObYvgMAjZnac8A&s=5lMIyl1JJEfNqZSe80DnJ4PwWt_tpBoq3l6ZjM2EQBM&e=
https://urldefense.proofpoint.com/v2/url?u=http-3A__stackoverflow.com_questions_29543131_beancreationexception-2Dthrowed-2Dwhen-2Dtrying-2Dto-2Drun-2Dmy-2Dproject&d=CwIBaQ&c=uGuXJ43KPkPWEl2imVFDmZQlhQUET7pVRA2PDIOxgqw&r=bRfqJEeedEKG5nkp5748YxbNMFrUYT3YiNl0Ni2vUBQ&m=FjaUoU-i-tiL8deMoKceLKxX-kgXBObYvgMAjZnac8A&s=EkfJOFmbVi4fwdp1mBAnpIXC1XHnT8_eN6Jsz1PvDhw&e=
 


On Thu, Sep 22, 2016 at 1:38 PM, Jay Parashar  wrote:

> I sent a similar message earlier but do not see it. Apologize if its 
> duplicated.
>
> I am unable to connect to Solr Cloud zkhost (using CloudSolrClient) 
> from a SolrJ client running on Google App Engine.
> The error message is "java.nio.channels.SocketChannel is a restricted 
> class. Please see the Google  App Engine developer's guide for more 
> details."
>
> Is there a workaround? Its required that the client is SolrJ and 
> running on App Engine.
>
> Any feedback is much appreciated. Thanks
>


solrcloud load balancing

2016-10-22 Thread Jay Potharaju
Hi,
I am trying to understand how load balancing works in solrcloud.

As per my understanding solrcloud provides load balancing when querying
using an http endpoint.  When a query is sent to any of the nodes , solr
will intelligently decide which server can fulfill the request and will be
processed by one of the nodes in the cluster.

1) Does the logic change when there is only 1 shard vs multiple shards?

2) Does the QTime displayed is sum of processing time for the query request
+ latency(if processed by another node) + time to decide which node will
process the request(which i am guessing is minimal and can be ignored)

3) In my solr logs i display the "slow" queries, is the qtime displayed
takes all of the above and shows the correct time taken.

Solr version: 5.5.0


-- 
Thanks
Jay


Re: solrcloud load balancing

2016-10-22 Thread Jay Potharaju
Thanks Erick for the response
I am currently using a load balancer for my solrcloud, but was particularly
interested to know if solrcloud is doing load balancing internally in the
case of a single shard.
All the documentation that I have seen assumes multi-shard scenarios but
not for a single shard. Can you please point me to some code/documenation
that can help me understand this better.

Thanks
Jay

On Sat, Oct 22, 2016 at 6:00 PM, Erick Erickson 
wrote:

> 1) Single shards have some short circuiting in them. And anyway it's
> best to have some kind of load balancer in front or use SolrJ with
> CloudSolrClient. If you just use an HTTP end-point, you have a single
> point of failure if that node goes down.
>
> 2) yes. What it does _not_ include is the time taken to assemble the
> final document list, i.e. get the "fl" parameters. And also note that
> there's "the laggard problem" here. The time will be something close
> to the _longest_ time it takes any replica to respond. Say you have 4
> shards and the replica for one of them happens to hit a 5 second
> stop-the-world GC collection. Your QTime will be 5 seconds+. I really
> have no idea whether the QTime includes the decision process for
> selecting nodes, but I've also never heard of it being significant.
>
> 3) I guess, although I'm not quite sure I understand the question.
> Slow queries will include (roughly) the max of the sub-request QTimes.
>
> Best,
> Erick
>
> On Sat, Oct 22, 2016 at 5:19 PM, Jay Potharaju 
> wrote:
> > Hi,
> > I am trying to understand how load balancing works in solrcloud.
> >
> > As per my understanding solrcloud provides load balancing when querying
> > using an http endpoint.  When a query is sent to any of the nodes , solr
> > will intelligently decide which server can fulfill the request and will
> be
> > processed by one of the nodes in the cluster.
> >
> > 1) Does the logic change when there is only 1 shard vs multiple shards?
> >
> > 2) Does the QTime displayed is sum of processing time for the query
> request
> > + latency(if processed by another node) + time to decide which node will
> > process the request(which i am guessing is minimal and can be ignored)
> >
> > 3) In my solr logs i display the "slow" queries, is the qtime displayed
> > takes all of the above and shows the correct time taken.
> >
> > Solr version: 5.5.0
> >
> >
> > --
> > Thanks
> > Jay
>



-- 
Thanks
Jay Potharaju


Re: solrcloud load balancing

2016-10-22 Thread Jay Potharaju
Thanks Erick & Shawn for the response.

In case of non-distributed queries(single shard with replicas) is there a
way for me to determine how long does it take to retrieve the documents
 and send the response.

In my load test , i see that the response time at the client API is in
seconds but I am not able to see any high response time in the solr logs.
Is it possible that the under high load it takes a long time to retrieve
and send the documents?
If i run the same query in browser individually it comes back in quick time.

Thanks
Jay

On Sat, Oct 22, 2016 at 6:14 PM, Shawn Heisey  wrote:

> On 10/22/2016 6:19 PM, Jay Potharaju wrote:
> > I am trying to understand how load balancing works in solrcloud.
> >
> > As per my understanding solrcloud provides load balancing when querying
> > using an http endpoint.  When a query is sent to any of the nodes , solr
> > will intelligently decide which server can fulfill the request and will
> be
> > processed by one of the nodes in the cluster.
>
> Erick already responded, but I had this mostly written before I saw his
> response.  I decided to send it anyway.
>
> > 1) Does the logic change when there is only 1 shard vs multiple shards?
>
> The way I understand it, each shard is independently load balanced.  You
> might have a situation where one shard has more replicas than another
> shard, and I believe in that even in that situation, all replicas should
> be used.
>
> > 2) Does the QTime displayed is sum of processing time for the query
> request + latency(if processed by another node) + time to decide which node
> will process the request(which i am guessing is minimal and can be ignored)
>
> There are three phases in a distributed (multi-shard) query.
>
> 1) Each shard is sent the query, with the field list set to include the
> score, the unique key field, and if there is a sort parameter, whichever
> fields are used for sorting.  These requests happen in parallel.
> Whichever request takes the longest will determine the total time for
> this phase.
>
> 2) The responses from the subqueries are combined to determine which
> documents will make up the final result.
>
> 3) Additional queries are sent to the individual shards to retrieve the
> matching documents.  These requests are also in parallel, so the slowest
> such request will determine the time for this whole phase.
>
> > 3) In my solr logs i display the "slow" queries, is the qtime displayed
> > takes all of the above and shows the correct time taken.
>
> For non-distributed queries, QTime includes the time required to process
> the query, but not the time to retrieve the documents and send the
> response.  I *think* that when the query is distributed, QTime will be
> the sum of the first two phases that I mentioned above, but I'm not 100%
> sure.
>
> Thanks,
> Shawn
>
>


-- 
Thanks
Jay Potharaju


book on solr

2017-10-12 Thread Jay Potharaju
Hi,
I am looking for a book that covers some basic principles on how to scale
solr. Are there any suggestions.
Example how to scale , by adding shards or replicas in the case of high rps
and high index rates.

Any blog or documentation also that would provide some basic rules or
guidelines for scaling would also be great.

Thanks
Jay Potharaju


Managed schema used with Cloudera MapreduceIndexerTool and morphlines?

2017-03-17 Thread Jay Hill
I've got a very difficult project to tackle. I've been tasked with using
schemaless mode to index json files that we receive. The structure of the
json files will always be very different as we're receiving files from
different customers totally unrelated to one another. We are attempting to
build a "one size fits all" approach to receiving documents from a wide
variety of sources and then index them into Solr.

We're running in Solr 5.3. The schemaless approach works well enough -
until it doesn't. It seems to fail on type guessing and also gets confused
indexing to different shards. If it was reliable it would be the perfect
solution for our task. But the larger the JSON file the more likely it is
to fail. At a certain size it just doesn't work.

I've been advised by some experts and committers that schemaless is a good
tool for prototyping, but risky to run in production, but we thought we
would try it by doing offline indexing using the Cloudera
MapReduceIndexerTool to build offline indexes - but still using managed
schemas. This map reduce tool uses morphlines, which is a nifty ETL tool
that pipes together a series of commands to transform data. For example a
JSON or CSV file can be processed and loaded into a Solr index with a
"readJSON" command piped to a "loadSolr" command, for a simple example.

But the kite-sdk that manages the morphlines only seems to offer as they're
latest version, solr *4.10.3*-cdh5.10.0 (they're customized version of
4.10.3)

So I can't see any way to integrate schemaless (which has dependencies
after 4.10.3) with the morphlines.

But I thought I would ask here: Anybody had ANY experience with morphlines
to index to Solr? Any info would help me make sense of this.

Cheers to all!


Best practices for backup & restore

2017-05-16 Thread Jay Potharaju
Hi,
I was wondering if there are any best practices for doing solr backup &
restore. In the past when running backup, I stopped indexing during the
backup process.

I am looking at this documentation and it says that indexing can continue
when backup is in progress.
https://cwiki.apache.org/confluence/display/solr/Making+and+Restoring+Backups

Any recommendations ?

-- 
Thanks
Jay


search multiple cores

2014-05-13 Thread Jay Potharaju
Hi,
I am trying to join across multiple cores using query time join. Following
is my setup
3 cores - Solr 4.7
core1:  0.5 million documents
core2: 4 million documents and growing. This contains the child documents
for documents in core1.
core3: 2 million documents and growing. Contains records from all users.

 core2 contains documents that are accessible to each user based on their
permissions. The number of documents accessible to a user range from couple
of 1000s to 100,000.

I would like to get results by combining all three cores. For each search I
get documents from core3 and then query core1 to get parent documents &
then core2 to get the appropriate child documents depending of user
permissions.

I 'm referring to this link to join across cores
http://stackoverflow.com/questions/12665797/is-solr-4-0-capable-of-using-join-for-multiple-core

{!join from=fromField to=toField fromIndex=fromCoreName}fromQuery

This is not working for me. Can anyone suggest why it is not working. Any
pointers on how to search across multiple cores.

thanks



J


highlighting on hl.alternateField (copyField target) doesnt highlight

2014-06-03 Thread jay list

Hello,
 
im trying to implement a user friendly search for phone numbers. These numbers 
consist out of two digit-tokens like "12345 67890".
 
Finally I want the highlighting for the phone number in the search result, 
without any concerns about was this search result hit by field  tel  or 
copyField  tel2.
 
The field tel is splitted by a StandardTokenizer in two tokens "12345" AND 
"67890".
And I want to catch up those people, who enter "1234567890" without any space.
I use copyField  tel2  to a solr.PatternReplaceCharFilterFactory to eliminate 
non digits followed by a solr.KeywordTokenizerFactory.
 
In both cases the search hits as expected.
 
The highlighter works well for  tel  or  tel2,  but I want the highlight always 
on field  tel!
Using  f.tel.hl.alternateField=tel2  is returning the field value wihtout any 
highlighting.
 

 tel2:1234567890
 tel2
 true
 true
 
 
 tel,tel2
 tel,tel2
 xml
 typ:person


...


 
  user1
  12345 67890
  12345 67890


...


 
  
   123456 67890 
  
  
   123456 67890
  
 


Any idea? Or do I have to change my velocity macros, always looking for a 
different highlighted field?
Best Regards


highlighting on hl.alternateField (copyField target) doesnt highlight

2014-06-03 Thread jay list

Hello,
 
im trying to implement a user friendly search for phone numbers. These numbers 
consist out of two digit-tokens like "12345 67890".
 
Finally I want the highlighting for the phone number in the search result, 
without any concerns about was this search result hit by field  tel  or 
copyField  tel2.
 
The field tel is splitted by a StandardTokenizer in two tokens "12345" AND 
"67890".
And I want to catch up those people, who enter "1234567890" without any space.
I use copyField  tel2  to a solr.PatternReplaceCharFilterFactory to eliminate 
non digits followed by a solr.KeywordTokenizerFactory.
 
In both cases the search hits as expected.
 
The highlighter works well for  tel  or  tel2,  but I want the highlight always 
on field  tel!
Using  f.tel.hl.alternateField=tel2  is returning the field value wihtout any 
highlighting.
 

 tel2:1234567890
 tel2
 true
 true
 
 
 tel,tel2
 tel,tel2
 xml
 typ:person


...


 
  user1
  12345 67890
  12345 67890


...


 
  
   123456 67890 
  
  
   123456 67890
  
 


Any idea? Or do I have to change my velocity macros, always looking for a 
different highlighted field?
Best Regards


Fw: highlighting on hl.alternateField (copyField target) doesnt highlight

2014-06-05 Thread jay list
Anybody knowing this issue?

> Gesendet: Dienstag, 03. Juni 2014 um 09:11 Uhr
> Von: "jay list" 
> An: solr-user@lucene.apache.org
> Betreff: highlighting on hl.alternateField (copyField target) doesnt highlight
>
> 
> Hello,
>  
> im trying to implement a user friendly search for phone numbers. These 
> numbers consist out of two digit-tokens like "12345 67890".
>  
> Finally I want the highlighting for the phone number in the search result, 
> without any concerns about was this search result hit by field  tel  or 
> copyField  tel2.
>  
> The field tel is splitted by a StandardTokenizer in two tokens "12345" AND 
> "67890".
> And I want to catch up those people, who enter "1234567890" without any space.
> I use copyField  tel2  to a solr.PatternReplaceCharFilterFactory to eliminate 
> non digits followed by a solr.KeywordTokenizerFactory.
>  
> In both cases the search hits as expected.
>  
> The highlighter works well for  tel  or  tel2,  but I want the highlight 
> always on field  tel!
> Using  f.tel.hl.alternateField=tel2  is returning the field value wihtout any 
> highlighting.
>  
> 
>  tel2:1234567890
>  tel2
>  true
>  true
>  
>  
>  tel,tel2
>  tel,tel2
>  xml
>  typ:person
> 
> 
> ...
> 
> 
>  
>   user1
>   12345 67890
>   12345 67890
> 
> 
> ...
> 
> 
>  
>   
>123456 67890 
>   
>   
>123456 67890
>   
>  
> 
> 
> Any idea? Or do I have to change my velocity macros, always looking for a 
> different highlighted field?
> Best Regards


  1   2   3   4   5   >