Schema/Index design for disparate data sources (Federated / Google like search)

2015-12-22 Thread Susheel Kumar
Hello,

I am going thru few use cases where we have kind of multiple disparate data
sources which in general doesn't have much common fields and i was thinking
to design different schema/index/collection for each of them and query each
of them separately and provide different result sets to the client.

I have seen one implementation where all different fields from these
disparate data sources are put together in single schema/design/collection
that it can be searched easily using catch all field but this was having
200+ fields including copy fields. The problem i see with this design is
ingestion will be slower (and scaling) as many of the fields for one data
source will not be applicable when ingesting for other data source.
Basically everything is being dumped into one huge schema/index/collection.

After looking above, I am wondering how we can design this better in
another implementation where we have the requirement to search across
disparate source (each having multiple fields 10-15 fields searchable &
10-15 fields stored) with only 1 common field like description in each of
the data sources.  Most of the time user may perform search on description
and rest of the time combination of different fields. Similar to google
like search where you search for "coffee" and it searches in various data
sources (websites, maps, images, places etc.)

My thought is to make separate indexes for each search scenario.  For
example for single search box, we index description, other key fields which
can be searched together  and their data source type into one index/schema
that we don't make a huge index/schema and use the catch all field for
search.

And for other Advance search (field specific) scenario we create separate
index/schema for each data sources.

Any suggestions/guidelines on how we can better address this in terms of
responsiveness and scaling? Each data source may have documents in 50-100+
millions.

Thanks,
Susheel


Re: Schema/Index design for disparate data sources (Federated / Google like search)

2015-12-22 Thread Susheel Kumar
Thanks, Jack for various points. A question when you have hundreds of
fields from different sources and you also have lot of copy fields
instructions for facets, sort or catch all etc. you suffer some performance
hit during ingestion as many of the copy instructions would just be
executing but doing nothing since they don't have data, do you agree?

Assuming keyword search is required on different data sources and present
result from each data source when user is typing (instant / auto complete)
in single search box and advance search (very field specific) is required
in the advance search option,  how do you suggest to design the
index/schema?

Let me know if i am missing any other info to get your thoughts.

On Tue, Dec 22, 2015 at 11:53 AM, Jack Krupansky 
wrote:

> Step one is to refine and more clearly state the requirements. Sure,
> sometimes (most of the time?) the end user really doesn't know exactly what
> they expect or want other than "Gee, I want to search for everything, isn't
> that obvious??!!", but that simply means that an analyst is needed to
> intervene before you leap to implementation. An analyst is someone who
> knows how to interview all relevant parties (not just the approving
> manager) to understand their true needs. I mean, who knows, maybe all they
> really need is basic keyword search. Or... maybe they actually need a
> full-blown data warehouse with precise access to each specific field of
> each data source. Without knowing how refined user queries need to get,
> there is little to go on here.
>
> My other advice is to be careful not to overthink the problem - to imagine
> that some complex solution is needed when the end users really only need to
> do super basic queries. In general, managers are very poor when it comes to
> analysis and requirement specification.
>
> Do they need to do date searches on a variety of date fields?
>
> Do they need to do numeric or range queries on specific numeric fields?
>
> Do they need to do any exact match queries on raw character fields (as
> opposed to tokenized text)?
>
> Do they have fields like product names or numbers in addition to free-form
> text?
>
> Do they need to distinguish or weight titles from detailed descriptions?
>
> You could have catchall fields for categories of field types like titles,
> bodies, authors/names, locations, dates, numeric values. But... who
> knows... this may be more than what an average user really needs.
>
> As far as the concern about fields from different sources that are not
> used, Lucene only stores and indexes fields which have values, so no
> storage or performance is consumed when you have a lot of fields which are
> not present for a particular data source.
>
> -- Jack Krupansky
>
> On Tue, Dec 22, 2015 at 11:25 AM, Susheel Kumar 
> wrote:
>
> > Hello,
> >
> > I am going thru few use cases where we have kind of multiple disparate
> data
> > sources which in general doesn't have much common fields and i was
> thinking
> > to design different schema/index/collection for each of them and query
> each
> > of them separately and provide different result sets to the client.
> >
> > I have seen one implementation where all different fields from these
> > disparate data sources are put together in single
> schema/design/collection
> > that it can be searched easily using catch all field but this was having
> > 200+ fields including copy fields. The problem i see with this design is
> > ingestion will be slower (and scaling) as many of the fields for one data
> > source will not be applicable when ingesting for other data source.
> > Basically everything is being dumped into one huge
> schema/index/collection.
> >
> > After looking above, I am wondering how we can design this better in
> > another implementation where we have the requirement to search across
> > disparate source (each having multiple fields 10-15 fields searchable &
> > 10-15 fields stored) with only 1 common field like description in each of
> > the data sources.  Most of the time user may perform search on
> description
> > and rest of the time combination of different fields. Similar to google
> > like search where you search for "coffee" and it searches in various data
> > sources (websites, maps, images, places etc.)
> >
> > My thought is to make separate indexes for each search scenario.  For
> > example for single search box, we index description, other key fields
> which
> > can be searched together  and their data source type into one
> index/schema
> > that we don't make a huge index/schema and use the catch all field for
> > search.
> >
> > And for other Advance search (field specific) scenario we create separate
> > index/schema for each data sources.
> >
> > Any suggestions/guidelines on how we can better address this in terms of
> > responsiveness and scaling? Each data source may have documents in
> 50-100+
> > millions.
> >
> > Thanks,
> > Susheel
> >
>


Re: Newbie: Searching across 2 collections ?

2016-01-06 Thread Susheel Kumar
Hi Bruno,

I just tested this scenario in my local solr 5.3.1 and it returned results
from two identical collections. I doubt if it is broken in 5.4 just double
check if you are not missing anything else.

Thanks,
Susheel

http://localhost:8983/solr/c1/select?q=id_type%3Ahello&wt=json&indent=true&collection=c1,c2

responseHeader": {"status": 0,"QTime": 98,"params": {"q": "id_type:hello","
indent": "true","collection": "c1,c2","wt": "json"}},
response": {"numFound": 2,"start": 0,"maxScore": 1,"docs": [{"id": "1","
id_type": "hello","_version_": 1522623395043213300},{"id": "3","id_type": "
hello","_version_": 1522623422397415400}]}

On Wed, Jan 6, 2016 at 6:13 AM, Bruno Mannina  wrote:

> yes id value is unique in C1 and unique in C2.
> id in C1 is never present in C2
> id in C2 is never present in C1
>
>
> Le 06/01/2016 11:12, Binoy Dalal a écrit :
>
>> Are Id values for docs in both the collections exactly same?
>> To get proper results, the ids should be unique across both the cores.
>>
>> On Wed, 6 Jan 2016, 15:11 Bruno Mannina  wrote:
>>
>> Hi All,
>>>
>>> Solr 5.4, Ubuntu
>>>
>>> I thought it was simple to request across two collections with the same
>>> schema but not.
>>> I have one solr instance launch. 300 000 records in each collection.
>>>
>>> I try to use this request without having both results:
>>>
>>> http://my_adress:my_port
>>> /solr/C1/select?collection=C1,C2&q=fid:34520196&wt=json
>>>
>>> this request returns only C1 results and if I do:
>>>
>>> http://my_adress:my_port
>>> /solr/C2/select?collection=C1,C2&q=fid:34520196&wt=json
>>>
>>> it returns only C2 results.
>>>
>>> I have 5 identical fields on both collection
>>> id, fid, st, cc, timestamp
>>> where id is the unique key field.
>>>
>>> Can someone could explain me why it doesn't work ?
>>>
>>> Thanks a lot !
>>> Bruno
>>>
>>> ---
>>> L'absence de virus dans ce courrier électronique a été vérifiée par le
>>> logiciel antivirus Avast.
>>> http://www.avast.com
>>>
>>> --
>>>
>> Regards,
>> Binoy Dalal
>>
>>
>
> ---
> L'absence de virus dans ce courrier électronique a été vérifiée par le
> logiciel antivirus Avast.
> http://www.avast.com
>
>


Re: Newbie: Searching across 2 collections ?

2016-01-06 Thread Susheel Kumar
I'll suggest if you can setup some some test data locally and try this
out.  This will confirm your understanding.

Thanks,
Susheel

On Wed, Jan 6, 2016 at 10:39 AM, Bruno Mannina  wrote:

> Hi Susheel, Emir,
>
> yes I check, and I have one result in c1 and in c2 with the same query
> fid:34520196
>
> http://xxx.xxx.xxx.xxx:
> /solr/c1/select?q=fid:34520196&wt=json&indent=true&fl=id,fid,cc*,st&collection=c1,c2
>
> { "responseHeader":{ "status":0, "QTime":1, "params":{ "fl":"fid,cc*,st",
> "indent":"true", "q":"fid:34520196", "collection":"c1,c2", "wt":"json"}},
> "response":{"numFound":1,"start":0,"docs":[ {
>
> "id":"EP1680447",
> "st":"LAPSED",
> "fid":"34520196"}]
>   }
> }
>
>
> http://xxx.xxx.xxx.xxx:
> /solr/c2/select?q=fid:34520196&wt=json&indent=true&fl=id,fid,cc*,st&collection=c1,c2
>
> {
>   "responseHeader":{
> "status":0,
> "QTime":0,
> "params":{
>   "fl":"id,fid,cc*,st",
>   "indent":"true",
>   "q":"fid:34520196",
>   "collection":"c1,c2",
>   "wt":"json"}},
>   "response":{"numFound":1,"start":0,"docs":[
>   {
> "id":"WO2005040212",
> "st":"PENDING",
> "cc_CA":"LAPSED",
> "cc_EP":"LAPSED",
> "cc_JP":"PENDING",
> "cc_US":"LAPSED",
> "fid":"34520196"}]
>   }}
>
>
> I have the same xxx.xxx.xxx.xxx: (server:port).
> unique key field C1, C2 : id
>
> id data in C1 is different of id data in C2
>
> Must I config/set something in solr ?
>
> thanks,
> Bruno
>
>
> Le 06/01/2016 14:56, Emir Arnautovic a écrit :
>
>> Hi Bruno,
>> Can you check counts? Is it possible that first page is only with results
>> from collection that you sent request to so you assumed it returns only
>> results from single collection?
>>
>> Thanks,
>> Emir
>>
>> On 06.01.2016 14:33, Susheel Kumar wrote:
>>
>>> Hi Bruno,
>>>
>>> I just tested this scenario in my local solr 5.3.1 and it returned
>>> results
>>> from two identical collections. I doubt if it is broken in 5.4 just
>>> double
>>> check if you are not missing anything else.
>>>
>>> Thanks,
>>> Susheel
>>>
>>>
>>> http://localhost:8983/solr/c1/select?q=id_type%3Ahello&wt=json&indent=true&collection=c1,c2
>>>
>>> responseHeader": {"status": 0,"QTime": 98,"params": {"q":
>>> "id_type:hello","
>>> indent": "true","collection": "c1,c2","wt": "json"}},
>>> response": {"numFound": 2,"start": 0,"maxScore": 1,"docs": [{"id": "1","
>>> id_type": "hello","_version_": 1522623395043213300},{"id":
>>> "3","id_type": "
>>> hello","_version_": 1522623422397415400}]}
>>>
>>> On Wed, Jan 6, 2016 at 6:13 AM, Bruno Mannina  wrote:
>>>
>>> yes id value is unique in C1 and unique in C2.
>>>> id in C1 is never present in C2
>>>> id in C2 is never present in C1
>>>>
>>>>
>>>> Le 06/01/2016 11:12, Binoy Dalal a écrit :
>>>>
>>>> Are Id values for docs in both the collections exactly same?
>>>>> To get proper results, the ids should be unique across both the cores.
>>>>>
>>>>> On Wed, 6 Jan 2016, 15:11 Bruno Mannina  wrote:
>>>>>
>>>>> Hi All,
>>>>>
>>>>>> Solr 5.4, Ubuntu
>>>>>>
>>>>>> I thought it was simple to request across two collections with the
>>>>>> same
>>>>>> schema but not.
>>>>>> I have one solr instance launch. 300 000 records in each collection.
>>>>>>
>>>>>> I try to use this request without having both results:
>>>>>>
>>>>>> http://my_adress:my_port
>>>>>> /solr/C1/select?collection=C1,C2&q=fid:34520196&wt=json
>>>>>>
>>>>>> this request returns only C1 results and if I do:
>>>>>>
>>>>>> http://my_adress:my_port
>>>>>> /solr/C2/select?collection=C1,C2&q=fid:34520196&wt=json
>>>>>>
>>>>>> it returns only C2 results.
>>>>>>
>>>>>> I have 5 identical fields on both collection
>>>>>> id, fid, st, cc, timestamp
>>>>>> where id is the unique key field.
>>>>>>
>>>>>> Can someone could explain me why it doesn't work ?
>>>>>>
>>>>>> Thanks a lot !
>>>>>> Bruno
>>>>>>
>>>>>> ---
>>>>>> L'absence de virus dans ce courrier électronique a été vérifiée par le
>>>>>> logiciel antivirus Avast.
>>>>>> http://www.avast.com
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Regards,
>>>>> Binoy Dalal
>>>>>
>>>>>
>>>>> ---
>>>> L'absence de virus dans ce courrier électronique a été vérifiée par le
>>>> logiciel antivirus Avast.
>>>> http://www.avast.com
>>>>
>>>>
>>>>
>>
>
> ---
> L'absence de virus dans ce courrier électronique a été vérifiée par le
> logiciel antivirus Avast.
> http://www.avast.com
>
>


Re: Newbie: Searching across 2 collections ?

2016-01-06 Thread Susheel Kumar
Hi Bruno,  I just tested on 5.4 for your sake and it works fine.  You are
somewhere goofing up.  Please create a new simple schema different from
your use case with 2-3 fields with 2-3 documents and test this out
independently on your current problem.  That's what i can make suggestion
and did same to confirm this.

On Wed, Jan 6, 2016 at 11:48 AM, Bruno Mannina  wrote:

> Same result on my dev' server, it seems that collection param haven't
> effect on the query...
>
> Q: I don't see on the solr 5.4 doc, the "collection" param for select
> handler, is it always present in 5.4 version ?
>
>
> Le 06/01/2016 17:38, Bruno Mannina a écrit :
>
>> I have a dev' server, I will do some test on it...
>>
>> Le 06/01/2016 17:31, Susheel Kumar a écrit :
>>
>>> I'll suggest if you can setup some some test data locally and try this
>>> out.  This will confirm your understanding.
>>>
>>> Thanks,
>>> Susheel
>>>
>>> On Wed, Jan 6, 2016 at 10:39 AM, Bruno Mannina  wrote:
>>>
>>> Hi Susheel, Emir,
>>>>
>>>> yes I check, and I have one result in c1 and in c2 with the same query
>>>> fid:34520196
>>>>
>>>> http://xxx.xxx.xxx.xxx:
>>>> /solr/c1/select?q=fid:34520196&wt=json&indent=true&fl=id,fid,cc*,st&collection=c1,c2
>>>>
>>>>
>>>> { "responseHeader":{ "status":0, "QTime":1, "params":{
>>>> "fl":"fid,cc*,st",
>>>> "indent":"true", "q":"fid:34520196", "collection":"c1,c2",
>>>> "wt":"json"}},
>>>> "response":{"numFound":1,"start":0,"docs":[ {
>>>>
>>>>  "id":"EP1680447",
>>>>  "st":"LAPSED",
>>>>  "fid":"34520196"}]
>>>>}
>>>> }
>>>>
>>>>
>>>> http://xxx.xxx.xxx.xxx:
>>>> /solr/c2/select?q=fid:34520196&wt=json&indent=true&fl=id,fid,cc*,st&collection=c1,c2
>>>>
>>>>
>>>> {
>>>>"responseHeader":{
>>>>  "status":0,
>>>>  "QTime":0,
>>>>  "params":{
>>>>"fl":"id,fid,cc*,st",
>>>>"indent":"true",
>>>>"q":"fid:34520196",
>>>>    "collection":"c1,c2",
>>>>"wt":"json"}},
>>>>"response":{"numFound":1,"start":0,"docs":[
>>>>{
>>>>  "id":"WO2005040212",
>>>>  "st":"PENDING",
>>>>  "cc_CA":"LAPSED",
>>>>  "cc_EP":"LAPSED",
>>>>  "cc_JP":"PENDING",
>>>>  "cc_US":"LAPSED",
>>>>  "fid":"34520196"}]
>>>>}}
>>>>
>>>>
>>>> I have the same xxx.xxx.xxx.xxx: (server:port).
>>>> unique key field C1, C2 : id
>>>>
>>>> id data in C1 is different of id data in C2
>>>>
>>>> Must I config/set something in solr ?
>>>>
>>>> thanks,
>>>> Bruno
>>>>
>>>>
>>>> Le 06/01/2016 14:56, Emir Arnautovic a écrit :
>>>>
>>>> Hi Bruno,
>>>>> Can you check counts? Is it possible that first page is only with
>>>>> results
>>>>> from collection that you sent request to so you assumed it returns only
>>>>> results from single collection?
>>>>>
>>>>> Thanks,
>>>>> Emir
>>>>>
>>>>> On 06.01.2016 14:33, Susheel Kumar wrote:
>>>>>
>>>>> Hi Bruno,
>>>>>>
>>>>>> I just tested this scenario in my local solr 5.3.1 and it returned
>>>>>> results
>>>>>> from two identical collections. I doubt if it is broken in 5.4 just
>>>>>> double
>>>>>> check if you are not missing anything else.
>>>>>>
>>>>>> Thanks,
>>>>>> Susheel
>&

Re: collapse filter query

2016-01-11 Thread Susheel Kumar
You can go to https://issues.apache.org/jira/browse/SOLR/ and create Jira
ticket after signing in.

Thanks,
Susheel

On Mon, Jan 11, 2016 at 2:15 PM, sara hajili  wrote:

> Tnx.How I can create a jira ticket?
> On Jan 11, 2016 10:42 PM, "Joel Bernstein"  wrote:
>
> > I believe this is a bug. I think the reason this is occurring is that you
> > have an index segment with no values at all in the collapse field. If you
> > could create a jira ticket for this I will look at resolving the issue.
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Mon, Jan 11, 2016 at 2:03 PM, sara hajili 
> > wrote:
> >
> > > I am using solr 5.3.1
> > > On Jan 11, 2016 10:30 PM, "Joel Bernstein"  wrote:
> > >
> > > > Which version of Solr are you using?
> > > >
> > > > Joel Bernstein
> > > > http://joelsolr.blogspot.com/
> > > >
> > > > On Mon, Jan 11, 2016 at 6:39 AM, sara hajili 
> > > > wrote:
> > > >
> > > > > hi all
> > > > > i have a MLT query and i wanna to use collapse filter query.
> > > > > and i wanna to use collapse expand nullPolicy.
> > > > > in this way when i used it :
> > > > > {!collapse field=original_post_id nullPolicy=expand}
> > > > > i got my appropriate result .
> > > > > (in solr web UI)
> > > > >
> > > > > but in regular search handler "/select",when i used
> > > > > {!collapse field=original_post_id nullPolicy=expand}
> > > > > i got error:
> > > > >
> > > > > {
> > > > >   "responseHeader":{
> > > > > "status":500,
> > > > > "QTime":2,
> > > > > "params":{
> > > > >   "q":"*:*",
> > > > >   "indent":"true",
> > > > >   "fq":"{!collapse field=original_post_id nullPolicy=expand}",
> > > > >   "wt":"json"}},
> > > > >   "error":{
> > > > > "trace":"java.lang.NullPointerException\n\tat
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.search.CollapsingQParserPlugin$IntScoreCollector.finish(CollapsingQParserPlugin.java:763)\n\tat
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:211)\n\tat
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1678)\n\tat
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1497)\n\tat
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:555)\n\tat
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:522)\n\tat
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:277)\n\tat
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)\n\tat
> > > > > org.apache.solr.core.SolrCore.execute(SolrCore.java:2068)\n\tat
> > > > >
> > >
> org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:669)\n\tat
> > > > >
> > org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:462)\n\tat
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:214)\n\tat
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)\n\tat
> > > > >
> > > > >
> > > >
> > >
> >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)\n\tat
> > > > >
> > > > >
> > > >
> > >
> >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)\n\tat
> > > > >
> > > > >
> > > >
> > >
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat
> > > > >
> > > > >
> > > >
> > >
> >
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)\n\tat
> > > > >
> > > > >
> > > >
> > >
> >
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)\n\tat
> > > > >
> > > > >
> > > >
> > >
> >
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)\n\tat
> > > > >
> > > > >
> > > >
> > >
> >
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)\n\tat
> > > > >
> > > > >
> > > >
> > >
> >
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)\n\tat
> > > > >
> > > > >
> > > >
> > >
> >
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)\n\tat
> > > > >
> > > > >
> > > >
> > >
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat
> > > > >
> > > > >
> > > >
> > >
> >
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)\n\tat
> > > > >
> > > > >
> > > >
> > >
> >
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)\n\tat
> > > > >
> > > > >
> > > >
> > >
> >
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)\n\tat
>

Re: Returning all documents in a collection

2016-01-20 Thread Susheel Kumar
Hello Salman,

Please checkout the export functionality
https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets

Thanks,
Susheel

On Wed, Jan 20, 2016 at 6:57 AM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:

> Hi Salman,
> You should use cursors in order to avoid "deep paging issues". Take a look
> at https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results.
>
> Regards,
> Emir
>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
>
> On 20.01.2016 12:55, Salman Ansari wrote:
>
>> Hi,
>>
>> I am looking for a way to return all documents from a collection.
>> Currently, I am restricted to specifying the number of rows using Solr.NET
>> but I am looking for a better approach to actually return all documents.
>> If
>> I specify a huge number such as 1M, the processing takes a long time.
>>
>> Any feedback/comment will be appreciated.
>>
>> Regards,
>> Salman
>>
>>
>


Re: collection aliasing

2016-01-22 Thread Susheel Kumar
Hi Vidya, if i understood your question correctly you can simply use the
original collection name(s) to point to individual collections. Isn't that
the case?

Thanks,
Susheel

On Fri, Jan 22, 2016 at 8:10 AM, vidya  wrote:

> Hi
>
> I wanted to mainatain two sets of indexes or collections for maintaing my
> large input data for indexing for which i found collection aliasing is
> helpful. I have created alais for 2 collections. but my problem is , how
> can
> i point out my alias to 2 different colletions at 2 different times.
>
> Thanks in advance.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/collection-aliasing-tp4252527.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Mix Solr 4 and 5?

2016-01-23 Thread Susheel Kumar
Just to share one of our developer noticed issues when trying to use SolrJ
4.10.x against Solr 5.4.0 with chroot enabled. After she upgraded to SolrJ
5.3.1, it worked.

Thanks,
Susheel

On Fri, Jan 22, 2016 at 11:20 AM, Jack Krupansky 
wrote:

> To be clear, having separate Solr servers on different versions should
> definitely not be a problem. The only potential difficulty here is the
> SolrJ vs. server back-compat issue.
>
> -- Jack Krupansky
>
> On Fri, Jan 22, 2016 at 10:57 AM, 
> wrote:
>
> > Shawn wrote:
> > >
> > > If you are NOT running SolrCloud, then that should work with no
> problem.
> > > The HTTP API is fairly static and has not seen any major upheaval
> > recently.
> > > If you're NOT running SolrCloud, you may even be able to replace the
> > > SolrJ jar in your existing system with the 5.4.1 version (and update
> > > SolrJ's dependent jars) and have everything continue to work.
> > >
> > > If you ARE running SolrCloud, I would not try mixing 4.x and 5.x,
> > > in either direction.  SolrCloud is evolving very quickly ... I wouldn't
> > > even mix *minor* versions, much less *major* versions.
> > > There are differences in how the zookeeper database is laid out,
> > > and mixing versions is not guaranteed to work, especially if SolrJ
> > > is older than Solr.  If the version difference is small and SolrJ is
> > newer
> > > than Solr, there's a chance of success, but with the situation you
> > > have described, SolrCloud would likely not work.
> >
> > When you talk about not mixing 4.x and 5.x when using SolrCloud, you mean
> > between the client and the server that talk to each other, right? Or
> would
> > it be a problem keeping our existing non cloud solr 4.x server, upgrading
> > the client solrj jar to 5.x (assuming this works, like you and others
> here
> > seem to think it should/could), and then adding a new solr cloud 5.x
> > server? That way, there the two separate communication "channels" are
> solrj
> > 5.x <--> solr 4.x server, and, solrj 5.x  <--> solrcloud 5.x.
> >
> > Or does the mere presense of a solr 4.x server and a solr cloud 5.x
> server
> > on the same network cause problems, even when they don't know about
> > eachother?
> >
> > Regards
> > /Jimi
> >
>


Re: collection aliasing

2016-01-24 Thread Susheel Kumar
As Jens mentioned you use aliasing for referring to a group of collections.
E.g. below command you can create a alias called quarterly for 3 separate
collections Jan,Feb & Mar and then you can use alias quarterly to refer all
of them in single query

http://
:8983/solr/admin/collections?action=CREATEALIAS&name=Quarterly&collections=Jan,Feb,Mar

http://
:8983/solr/Quarterly/select?q=fire&wt=json&indent=true&collection=
Quarterly&facet=true&facet.field=docType&rows=300

On Sun, Jan 24, 2016 at 7:44 AM, vidya  wrote:

> Yeah, while querying and indexing also, we can directly use our collection
> names. Then what is the use of aliasing ?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/collection-aliasing-tp4252527p4252885.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Solrcloud error on finding active nodes.

2016-01-27 Thread Susheel Kumar
Hi,

I haven't seen this error before but which version of Solr you are using &
assume zookeeper is configured correctly. Do you see nodes
down/active/leader etc. under Cloud in Admin UI?

Thanks,
Susheel

On Wed, Jan 27, 2016 at 11:51 AM, Pranaya Behera 
wrote:

> Hi,
>  I have created one solrcloud collection with this
> `curl "
> http://localhost:8983/solr/admin/collections?action=CREATE&name=card&numShards=2&replicationFactor=2&maxShardsPerNode=2&createNodeSet=localhost:8983,localhost:8984,localhost:8985&collection.configName=igp
> "
>
> It gave me success. And when I saw in solr admin ui: i got  to see the
> collection name as card and pointing to two shards in the radial graph but
> nothing on the graph tab.  Both shards are in leader color.
>
> When I tried to index data to this collection it gave me this error:
>
> Indexing cardERROR StatusLogger No log4j2 configuration file found.
> Using default configuration: logging only errors to the console.
> 16:49:21.899 [main] ERROR
> org.apache.solr.client.solrj.impl.CloudSolrClient - Request to collection
> card failed due to (510) org.apache.solr.common.SolrException: Could not
> find a healthy node to handle the request., retry? 0
> 16:49:21.911 [main] ERROR
> org.apache.solr.client.solrj.impl.CloudSolrClient - Request to collection
> card failed due to (510) org.apache.solr.common.SolrException: Could not
> find a healthy node to handle the request., retry? 1
> 16:49:21.915 [main] ERROR
> org.apache.solr.client.solrj.impl.CloudSolrClient - Request to collection
> card failed due to (510) org.apache.solr.common.SolrException: Could not
> find a healthy node to handle the request., retry? 2
> 16:49:21.925 [main] ERROR
> org.apache.solr.client.solrj.impl.CloudSolrClient - Request to collection
> card failed due to (510) org.apache.solr.common.SolrException: Could not
> find a healthy node to handle the request., retry? 3
> 16:49:21.928 [main] ERROR
> org.apache.solr.client.solrj.impl.CloudSolrClient - Request to collection
> card failed due to (510) org.apache.solr.common.SolrException: Could not
> find a healthy node to handle the request., retry? 4
> 16:49:21.931 [main] ERROR
> org.apache.solr.client.solrj.impl.CloudSolrClient - Request to collection
> card failed due to (510) org.apache.solr.common.SolrException: Could not
> find a healthy node to handle the request., retry? 5
> org.apache.solr.common.SolrException: Could not find a healthy node to
> handle the request.
> at
> org.apache.solr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:1085)
> at
> org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:871)
> at
> org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:954)
> at
> org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:954)
> at
> org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:954)
> at
> org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:954)
> at
> org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:954)
> at
> org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:807)
> at
> org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:150)
> at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:107)
> at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:72)
> at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:86)
> at com.igp.solrindex.CardIndex.index(CardIndex.java:75)
> at com.igp.solrindex.App.main(App.java:19)
>
>
> Why I am getting error ?
>
> --
> Thanks & Regards
> Pranaya Behera
>
>


Re: Memory leak defect or misssuse of SolrJ API?

2016-01-30 Thread Susheel Kumar
Hi Steve,

Can you please elaborate what error you are getting and i didn't understand
your code above, that why initiating Solr client object  is in loop.  In
general  creating client instance should be outside the loop and a one time
activity during the complete execution of program.

Thanks,
Susheel

On Sat, Jan 30, 2016 at 8:15 AM, Steven White  wrote:

> Hi folks,
>
> I'm getting memory leak in my code.  I narrowed the code to the following
> minimal to cause the leak.
>
> while (true) {
> HttpSolrClient client = new HttpSolrClient("
> http://192.168.202.129:8983/solr/core1";);
> client.close();
> }
>
> Is this a defect or an issue in the way I'm using HttpSolrClient?
>
> I'm on Solr 5.2.1
>
> Thanks.
>
> Steve
>


Re: URI is too long

2016-02-01 Thread Susheel Kumar
Post is pretty much similar to GET. You can use any REST Client to try.
Same select URL & pass below header and put the queries parameters into body

POST:  http://localhost:8983/solr/techproducts/select

Header
==
Content-Type:application/x-www-form-urlencoded

payload/body:
==
q=*:*&rows=2


Thanks,
Susheel

On Mon, Feb 1, 2016 at 2:38 AM, Salman Ansari 
wrote:

> Cool. I would give POST a try. Any samples of using Post while passing the
> query string values (such as ORing between Solr field values) using
> Solr.NET?
>
> Regards,
> Salman
>
> On Sun, Jan 31, 2016 at 10:21 PM, Shawn Heisey 
> wrote:
>
> > On 1/31/2016 7:20 AM, Salman Ansari wrote:
> > > I am building a long query containing multiple ORs between query
> terms. I
> > > started to receive the following exception:
> > >
> > > The remote server returned an error: (414) Request-URI Too Long. Any
> idea
> > > what is the limit of the URL in Solr? Moreover, as a solution I was
> > > thinking of chunking the query into multiple requests but I was
> wondering
> > > if anyone has a better approach?
> >
> > The default HTTP header size limit on most webservers and containers
> > (including the Jetty that ships with Solr) is 8192 bytes.  A typical
> > request like this will start with "GET " and end with " HTTP/1.1", which
> > count against that 8192 bytes.  The max header size can be increased.
> >
> > If you place the parameters into a POST request instead of on the URL,
> > then the default size limit of that POST request in Solr is 2MB.  This
> > can also be increased.
> >
> > Thanks,
> > Shawn
> >
> >
>


Re: sorry, no dataimport-handler defined!

2016-02-01 Thread Susheel Kumar
Please register Data Import Handler to work with it
https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler


On Mon, Feb 1, 2016 at 2:31 PM, Jean-Jacques MONOT 
wrote:

> Hello
>
> I am using SOLR 5.4.1 and the graphical admin UI.
>
> I successfully created multiples cores and indexed various documents,
> using in line commands : (create -c) and (post.jar) on W10.
>
> But in the GUI, when I click on "Dataimport", I get the following message
> : "sorry, no dataimport-handler defined!"
>
> I get the same message even on 5.3.1 or for different cores.
>
> What is wrong ?
>
> JJM
>
> ---
> L'absence de virus dans ce courrier électronique a été vérifiée par le
> logiciel antivirus Avast.
> https://www.avast.com/antivirus
>
>


Re: catch alls and nuances

2016-02-02 Thread Susheel Kumar
Hi John - You can take more close look on different options with
WordDelimeterFilterFactory at
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
to see if they meet your requirement and use Analysis tab in Solr Admin UI.
If still have question, you can share what exact search requirement(s) you
are trying to meet and how does your current analysis looks like for
text_general then perhaps one can suggest/help you out there.

Thnx
Susheel

On Tue, Feb 2, 2016 at 5:21 PM, Erick Erickson 
wrote:

> bq: Have now begun writing my own.
>
> I hope by that you mean defining your own ,
> at least until you're sure that none of the zillion things
> you can do with an analysis chain don't suit your needs.
>
> If you haven't already looked _seriously_ at the admin/analysis
> page (you have to choose a core to have that available). Fuzzy
> matching won't help you with the 1234-LT example at all.
>
> BTW, you (perhaps unintentionally) changed the problem
> 1234LT as input is vastly different from 1234-LT. The latter
> will be made into two tokens by some tokenizers. Whereas
> 1234LT is always passed through the tokenizers as a single
> "word", _then_ broken up by WordDelimiterFilterFactory if
> its a filter in the analysis chain.
>
> Do note that when I use "tokenizer" I'm referring to the
> specific class that breaks the incoming stream up. The
> simplest example is WhitespaceTokenizer, which.. you
> guessed it, breaks up the stream on whitespace.
>
> Once something gets through the one and only tokenizer
> in an analysis chain, each token passes through 0
> or more "Filters", and WordDelimiterFilterFactory is
> one of these.
>
> Pardon me for being somewhat pedantic here but unless the
> analysis chain is understood, you'll go through endless
> thrashing. This is where the admin/analysis page is
> invaluable.
>
> Best,
> Erick
>
> On Tue, Feb 2, 2016 at 12:49 PM, John Blythe  wrote:
> > I had been using text_general at the time of my email's writing. Have
> tried
> > a couple of the other stock ones (text_en, text_en_splitting, _tight).
> Have
> > now begun writing my own. I began to wonder if simply doing one of the
> > above, such as text_general, with a fuzzy distance (probably just ~1)
> would
> > be best suited. Another example would be an indexed value of "Phasaix"
> > (which is a typo in the original data) being searched for with the
> correct
> > spelling of "Phasix" and returning nothing. Adding ~1 in that case works.
> > For some reason it doesn't in the case of the 1234-L and 1234-LT example.
> >
> > Thanks for any insight-
> >
> > --
> > *John Blythe*
> > Product Manager & Lead Developer
> >
> > 251.605.3071 | j...@curvolabs.com
> > www.curvolabs.com
> >
> > 58 Adams Ave
> > Evansville, IN 47713
> >
> > On Mon, Feb 1, 2016 at 3:30 PM, Erick Erickson 
> > wrote:
> >
> >> Likely you also have WordDelimiterFilterFactory in
> >> your fieldType, that's what will split on alphanumeric
> >> transitions.
> >>
> >> So you should be able to use wildcards here, i.e. 1234L*
> >>
> >> However, that'll only work if you have preserveOriginal set in
> >> WordDelimiterFilterFactory in your indexing chain.
> >>
> >> And just to make life "interesting", there are some peculiarities
> >> with parsing wildcards at query time, so be sure to see the
> >> admin/analysis page
> >>
> >> Best,
> >> Erick
> >>
> >> On Mon, Feb 1, 2016 at 12:20 PM, John Blythe 
> wrote:
> >> > Hi there
> >> >
> >> > I have a a catch all field called 'text' that I copy my item
> description,
> >> > manufacturer name, and the item's catalog number into. I'm having an
> >> issue
> >> > with keeping the broadness of the tokenizers in place whilst still
> >> allowing
> >> > some good precision in the case of very specific queries.
> >> >
> >> > The results are generally good. But, for instance, the products named
> >> 1234L
> >> > and 1234LT aren't behaving how i would like. If I search 1234 they
> both
> >> > show. If I search 1234L only the first one is returned. I'm guessing
> this
> >> > is due to the splitting of the numeric and string portions. The "1234"
> >> and
> >> > the "L" both hit in the first case ("1234" and "L") but the L is of no
> >> > value in the "1234" and "LT" indexed item.
> >> >
> >> > What is the best way around this so that a small levenstein distance,
> for
> >> > instance, is picked up?
> >>
>


Re: Solr for real time analytics system

2016-02-04 Thread Susheel Kumar
Hi Rohit,

Please take a loot at Streaming expressions & Parallel SQL Interface.  That
should meet many of your analytics requirement (aggregation queries like
sum/average/groupby etc).
https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface

Thanks,
Susheel

On Thu, Feb 4, 2016 at 3:17 AM, Arkadiusz Robiński <
arkadiusz.robin...@otodom.pl> wrote:

> A few people did a real time analytics system with solr and talked about it
> at conferences. Maybe you'll find their presentations useful:
>
> https://www.youtube.com/results?search_query=solr%20real%20time%20analytics&oq=&gs_l=
> (esp. the first one: https://www.youtube.com/watch?v=PkoyCxBXAiA )
>
> On Thu, Feb 4, 2016 at 8:25 AM, Rohit Kumar  >
> wrote:
>
> > Thanks Bhimavarapu for the information.
> >
> > We are creating our own dashboard, so probably wont need kibana/banana. I
> > was more curious about Solr support for fast aggregation query over very
> > large data set. As suggested, I guess elasticsearch  has this capability.
> > Is there any published metrics or data regarding elasticsearch/solr
> > performance in this area that I can refer to?
> >
> > Thanks
> > Rohit
> >
> >
> >
> > On Thu, Feb 4, 2016 at 11:48 AM, CKReddy Bhimavarapu <
> chaitu...@gmail.com>
> > wrote:
> >
> > > Hello Rohit,
> > >
> > > You can use the Banana project which was forked from Kibana
> > > , and works with all kinds of time
> > > series (and non-time series) data stored in Apache Solr
> > > . It uses Kibana's powerful dashboard
> > > configuration capabilities, ports key panels to work with Solr, and
> > > provides significant additional capabilities, including new panels that
> > > leverage D3.js 
> > >
> > >  would need mostly aggregation queries like sum/average/groupby etc,
> but
> > > > data set is quite huge. The aggregation queries should be very fast.
> > >
> > >
> > > all your requirement can be served by this banana but I'm not sure
> about
> > > how fast solr compare to ELK 
> > >
> > > On Thu, Feb 4, 2016 at 10:51 AM, Rohit Kumar <
> > > rohitkumarbhagat...@gmail.com>
> > > wrote:
> > >
> > > > Hi
> > > >
> > > > I am quite new to Solr. I have to build a real time analytics system
> > > which
> > > > displays metrics based on multiple filters over a huge data set
> > > (~50million
> > > > documents with ~100 fileds ).  I would need mostly aggregation
> queries
> > > like
> > > > sum/average/groupby etc, but data set is quite huge. The aggregation
> > > > queries should be very fast.
> > > >
> > > > Is Solr suitable for such use cases?
> > > >
> > > > Thanks
> > > > Rohit
> > > >
> > >
> > >
> > >
> > > --
> > > ckreddybh. 
> > >
> >
>
>
>
> --
> Arkadiusz Robiński
> Software Developer
> Otodom.pl
>


Re: solr performance issue

2016-02-08 Thread Susheel Kumar
1 million document shouldn't have any issues at all.  Something else is
wrong with your hw/system configuration.

Thanks,
Susheel

On Mon, Feb 8, 2016 at 6:45 AM, sara hajili  wrote:

> On Mon, Feb 8, 2016 at 3:04 AM, sara hajili  wrote:
>
> > sorry i made a mistake i have a bout 1000 K doc.
> > i mean about 100 doc.
> >
> > On Mon, Feb 8, 2016 at 1:35 AM, Emir Arnautovic <
> > emir.arnauto...@sematext.com> wrote:
> >
> >> Hi Sara,
> >> Not sure if I am reading this right, but I read it as you have 1000 doc
> >> index and issues? Can you tell us bit more about your setup: number of
> >> servers, hw, index size, number of shards, queries that you run, do you
> >> index at the same time...
> >>
> >> It seems to me that you are running Solr on server with limited RAM and
> >> probably small heap. Swapping for sure will slow things down and GC is
> most
> >> likely reason for high CPU.
> >>
> >> You can use http://sematext.com/spm to collect Solr and host metrics
> and
> >> see where the issue is.
> >>
> >> Thanks,
> >> Emir
> >>
> >> --
> >> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> >> Solr & Elasticsearch Support * http://sematext.com/
> >>
> >>
> >>
> >> On 08.02.2016 10:27, sara hajili wrote:
> >>
> >>> hi all.
> >>> i have a problem with my solr performance and usage hardware like a
> >>> ram,cup...
> >>> i have a lot of document and so indexed file about 1000 doc in solr
> that
> >>> every doc has about 8 field in average.
> >>> and each field has about 60 char.
> >>> i set my field as a storedfield = "false" except of  1 field. // i read
> >>> that this help performance.
> >>> i used copy field and dynamic field if it was necessary . // i read
> that
> >>> this help performance.
> >>> and now my question is that when i run a lot of query on solr i faced
> >>> with
> >>> a problem solr use more cpu and ram and after that filled ,it use a lot
> >>>   swapped storage and then use hard,but doesn't create a system file!
> >>> solr
> >>> fill hard until i forced to restart server to release hard disk.
> >>> and now my question is why solr treat in this way? and how i can avoid
> >>> solr
> >>> to use huge cpu space?
> >>> any config need?!
> >>>
> >>>
> >>
> >
>


Re: Data Import Handler - autoSoftCommit and autoCommit

2016-02-08 Thread Susheel Kumar
You can start with one of the suggestions from this link based on your
indexing and query load.


https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/


Thanks,
Susheel

On Mon, Feb 8, 2016 at 10:15 AM, Troy Edwards 
wrote:

> We are running the data import handler to retrieve about 10 million records
> during work hours every day of the week. We are using Clean = true, Commit
> = true and Optimize = true. The entire process takes about 1 hour.
>
> What would be a good setting for autoCommit and autoSoftCommit?
>
> Thanks
>


Re: Solr architecture

2016-02-08 Thread Susheel Kumar
Also if you are expecting indexing of 2 billion docs as NRT or if it will
be offline (during off hours etc).  For more accurate sizing you may also
want to index say 10 million documents which may give you idea how much is
your index size and then use that for extrapolation to come up with memory
requirements.

Thanks,
Susheel

On Mon, Feb 8, 2016 at 11:00 AM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:

> Hi Mark,
> Can you give us bit more details: size of docs, query types, are docs
> grouped somehow, are they time sensitive, will they update or it is rebuild
> every time, etc.
>
> Thanks,
> Emir
>
>
> On 08.02.2016 16:56, Mark Robinson wrote:
>
>> Hi,
>> We have a requirement where we would need to index around 2 Billion docs
>> in
>> a day.
>> The queries against this indexed data set can be around 80K queries per
>> second during peak time and during non peak hours around 12K queries per
>> second.
>>
>> Can Solr realize this huge volumes.
>>
>> If so, assuming we have no constraints for budget what would be a
>> recommended Solr set up (number of shards, number of Solr instances
>> etc...)
>>
>> Thanks!
>> Mark
>>
>>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>


Re: Bulk delete of Solr documents

2016-02-08 Thread Susheel Kumar
Yes, use below url

http://localhost:8983/solr//update?stream.body=
*:*&commit=true

On Mon, Feb 8, 2016 at 11:33 AM, Anil  wrote:

> Hi ,
>
> Can we delete solr documents from a collection in a bulk ?
>
> Regards,
> Anil
>


Re: Solr 4.10 with Jetty 8.1.10 & Tomcat 7

2016-02-09 Thread Susheel Kumar
Shahzad - I am curious what features of distributed search stops you to run
SolrCloud. Using DS, you would be able to search across cores or
collections.
https://cwiki.apache.org/confluence/display/solr/Advanced+Distributed+Request+Options

Thanks,
Susheel

On Tue, Feb 9, 2016 at 12:10 AM, Shahzad Masud <
shahzad.ma...@northbaysolutions.net> wrote:

> Thank you Shawn for your response. I would be running some performance
> tests lately on this structure (one JVM with multiple cores), and would
> share feedback on this thread.
>
> >There IS a way to specify the solr home for a specific context, but keep
> >in mind that I definitely DO NOT recommend doing this.  There is
> >resource and administrative overhead to running multiple copies of Solr
> >in one JVM.  Simply run one context and let it handle multiple shards,
> >whether you choose SolrCloud or not.
> Due to distributed search feature, I might not be able to run SolrCloud. I
> would appreciate, if you please share that way of setting solr home for a
> specific context in Jetty-Solr. Its good to seek more information for
> comparison purposes. Do you think having multiple JVMs would increase or
> decrease performance. My document base is around 20 million rows (in 24
> shards), with document size ranging from 100KB - 400 MB.
>
> SM
>
> On Mon, Feb 8, 2016 at 8:09 PM, Shawn Heisey  wrote:
>
> > On 2/8/2016 1:14 AM, Shahzad Masud wrote:
> > > Thank you Shawn for your reply. Here is my structure of cores and
> shards
> > >
> > > Shard 1 = localhost:8983/solr_2014 [3 Core  - Employee, Service
> Tickets,
> > > Departments]
> > > Shard 2 = localhost:8983/solr_2015 [3 Core  - Employee, Service
> Tickets,
> > > Departments]
> > > Shard 3 = localhost:8983/solr_2016 [3 Core  - Employee, Service
> Tickets,
> > > Departments]
> > >
> > > While searching, I use distributed search feature to search data from
> all
> > > three shards in respective cores e.g. If I want to search from Employee
> > > data for all three years, I search from Employee core of three
> contexts.
> > > This is legacy design, do you think this is okay, or this require
> > immediate
> > > restructure / design? I am going to try this,
> > >
> > > Context = localhost:8982/solr (9 cores - Employee-2014, Employee-2015,
> > > Employee-2016, ServiceTickets-2014, ServiceTickets-2015,
> > > ServiceTickets-2016, Department-2014, Department-2015, Department-2016]
> > > distributed search would be from all three cores of same data category
> > > (i.e. For Employee search, it would be from Employee-2014,
> Employee-2015,
> > > Employee-2016).
> >
> > With SolrCloud, you can have multiple collections for each of these
> > types and alias them together.  Or you can simply have one collection
> > for employee, one for servicetickets, and one for department, with
> > SolrCloud automatically handling splitting those documents into the
> > number of shardsthat you specify when you create the collection.  You
> > can also do manual sharding and split each collection on a time basis
> > like you have been doing, but then you lose some of the automation that
> > SolrCloud provides, so I do not recommend handling it that way.
> >
> > > Regarding one Solr context per jetty; I cannot run two solr contexts
> > > pointing to different data in Jetty, as while starting jetty I have to
> > > provide -Dsolr.solr.home variable - which ends up pointing to one data
> > > folder (2014 data) only.
> >
> > You do not need multiple contexts to have multiple indexes.
> >
> > My dev Solr server has exactly one Solr JVM, with exactly one context --
> > /solr.  That instance of Solr has 45 indexes (cores) on it.  These 45
> > cores are various shards for three larger indexes.  I am not running
> > SolrCloud, but I certainly could.
> >
> > You can see 25 of the 45 cores in my Solr instance in this screenshot of
> > the admin UI for this server:
> >
> > https://www.dropbox.com/s/v87mxvkdejvd92h/solr-with-45-cores.png?dl=0
> >
> > There IS a way to specify the solr home for a specific context, but keep
> > in mind that I definitely DO NOT recommend doing this.  There is
> > resource and administrative overhead to running multiple copies of Solr
> > in one JVM.  Simply run one context and let it handle multiple shards,
> > whether you choose SolrCloud or not.
> >
> > Thanks,
> > Shawn
> >
> >
>


Re: Solr 4.10 with Jetty 8.1.10 & Tomcat 7

2016-02-09 Thread Susheel Kumar
Shahzad - As Shawn mentioned you can get lot of inputs from the folks who
are using joins in Solr cloud if you start a new thread and i would suggest
to take a look at Solr Streaming expressions and Parallel SQL Interface
which covers joining use cases as well.

Thanks,
Susheel

On Tue, Feb 9, 2016 at 9:17 AM, Shawn Heisey  wrote:

> On 2/8/2016 10:10 PM, Shahzad Masud wrote:
> > Due to distributed search feature, I might not be able to run
> > SolrCloud. I would appreciate, if you please share that way of setting
> > solr home for a specific context in Jetty-Solr. Its good to seek more
> > information for comparison purposes. Do you think having multiple JVMs
> > would increase or decrease performance. My document base is around 20
> > million rows (in 24 shards), with document size ranging from 100KB -
> > 400 MB. SM
>
> For most people, the *entire point* of running SolrCloud is to do
> distributed search, so to hear that you can't run SolrCloud because of
> distributed search is very confusing to me.
>
> I admit to ignorance when it comes to the join feature in Solr ... but
> it is my understanding that all you need to make joins work properly is
> to have both of the indexes that you are joining running in the same JVM
> and the same Solr instance.  If you arrange your SolrCloud replicas so a
> copy of every index is loaded on every server, I think that would
> satisfy this requirement.  I may be wrong, but I believe there are
> SolrCloud users that use the join feature.
>
> When you create a config file for a Solr context, whether it's Jetty,
> Tomcat, or some other container, you can set the solr/home JNDI variable
> in the context fragment to set the solr home for that context.  I found
> a specific example for Tomcat.  I know Jetty can do the same, but I do
> not know how to actually create the context fragment.
>
>
> https://wiki.apache.org/solr/SolrTomcat#Installing_Solr_instances_under_Tomcat
>
> I need to reiterate one point again.  You should only run one Solr
> container per server, with exactly one Solr context installed in that
> server.  This is recommended whether you're running SolrCloud or not,
> and whether you're using distributed search or not.  One Solr context
> can handle a LOT of indexes.
>
> Running multiple Solr instances per server is only recommended in one
> case:  Extremely large indexes where you would need a very large heap.
> Running two JVMs with smaller heaps *might* be more efficient ... but in
> that case, it is usually better to split those indexes between two
> separate servers, each one running only one instance of Solr.
>
> Thanks,
> Shawn
>
>


Re: Adding nodes

2016-02-14 Thread Susheel Kumar
Hi Paul,

Shawn is referring to use Collections API
https://cwiki.apache.org/confluence/display/solr/Collections+API  than Core
Admin API https://cwiki.apache.org/confluence/display/solr/CoreAdmin+API
for SolrCloud.

Hope that clarifies and you mentioned about ADDREPLICA which is the
collections API, so you are on right track.

Thanks,
Susheel



On Sun, Feb 14, 2016 at 10:51 AM, McCallick, Paul <
paul.e.mccall...@nordstrom.com> wrote:

> Then what is the suggested way to add a new node to a collection via the
> apis?  I  am specifically thinking of autoscale scenarios where a node has
> gone down or more nodes are needed to handle load.
>
> Note that the ADDREPLICA endpoint requires a shard name, which puts the
> onus of how to scale out on the user. This can be challenging in an
> autoscale scenario.
>
> Thanks,
> Paul
>
> > On Feb 14, 2016, at 12:25 AM, Shawn Heisey  wrote:
> >
> >> On 2/13/2016 6:01 PM, McCallick, Paul wrote:
> >> - When creating a new collection, SOLRCloud will use all available
> nodes for the collection, adding cores to each.  This assumes that you do
> not specify a replicationFactor.
> >
> > The number of nodes that will be used is numShards multipled by
> > replicationFactor.  The default value for replicationFactor is 1.  If
> > you do not specify numShards, there is no default -- the CREATE call
> > will fail.  The value of maxShardsPerNode can also affect the overall
> > result.
> >
> >> - When adding new nodes to the cluster AFTER the collection is created,
> one must use the core admin api to add the node to the collection.
> >
> > Using the CoreAdmin API is strongly discouraged when running SolrCloud.
> > It works, but it is an expert API when in cloud mode, and can cause
> > serious problems if not used correctly.  Instead, use the Collections
> > API.  It can handle all normal maintenance needs.
> >
> >> I would really like to see the second case behave more like the first.
> If I add a node to the cluster, it is automatically used as a replica for
> existing clusters without my having to do so.  This would really simplify
> things.
> >
> > I've added a FAQ entry to address why this is a bad idea.
> >
> >
> https://wiki.apache.org/solr/FAQ#Why_doesn.27t_SolrCloud_automatically_create_replicas_when_I_add_nodes.3F
> >
> > Thanks,
> > Shawn
> >
>


Re: Adding nodes

2016-02-14 Thread Susheel Kumar
Hi Pual,


For Auto-scaling, it depends on how you are thinking to design and what/how
do you want to scale. Which scenario you think makes coreadmin API easy to
use for a sharded SolrCloud environment?

Isn't if in a sharded environment (assume 3 shards A,B & C) and shard B has
having higher or more load,  then you want to add Replica for shard B to
distribute the load or if a particular shard replica goes down then you
want to add another Replica back for the shard in which case ADDREPLICA
requires a shard name?

Can you describe your scenario / provide more detail?

Thanks,
Susheel



On Sun, Feb 14, 2016 at 11:51 AM, McCallick, Paul <
paul.e.mccall...@nordstrom.com> wrote:

> Hi all,
>
>
> This doesn’t really answer the following question:
>
> What is the suggested way to add a new node to a collection via the
> apis?  I  am specifically thinking of autoscale scenarios where a node has
> gone down or more nodes are needed to handle load.
>
>
> The coreadmin api makes this easy.  The collections api (ADDREPLICA),
> makes this very difficult.
>
>
> On 2/14/16, 8:19 AM, "Susheel Kumar"  wrote:
>
> >Hi Paul,
> >
> >Shawn is referring to use Collections API
> >https://cwiki.apache.org/confluence/display/solr/Collections+API  than
> Core
> >Admin API https://cwiki.apache.org/confluence/display/solr/CoreAdmin+API
> >for SolrCloud.
> >
> >Hope that clarifies and you mentioned about ADDREPLICA which is the
> >collections API, so you are on right track.
> >
> >Thanks,
> >Susheel
> >
> >
> >
> >On Sun, Feb 14, 2016 at 10:51 AM, McCallick, Paul <
> >paul.e.mccall...@nordstrom.com> wrote:
> >
> >> Then what is the suggested way to add a new node to a collection via the
> >> apis?  I  am specifically thinking of autoscale scenarios where a node
> has
> >> gone down or more nodes are needed to handle load.
> >>
> >> Note that the ADDREPLICA endpoint requires a shard name, which puts the
> >> onus of how to scale out on the user. This can be challenging in an
> >> autoscale scenario.
> >>
> >> Thanks,
> >> Paul
> >>
> >> > On Feb 14, 2016, at 12:25 AM, Shawn Heisey 
> wrote:
> >> >
> >> >> On 2/13/2016 6:01 PM, McCallick, Paul wrote:
> >> >> - When creating a new collection, SOLRCloud will use all available
> >> nodes for the collection, adding cores to each.  This assumes that you
> do
> >> not specify a replicationFactor.
> >> >
> >> > The number of nodes that will be used is numShards multipled by
> >> > replicationFactor.  The default value for replicationFactor is 1.  If
> >> > you do not specify numShards, there is no default -- the CREATE call
> >> > will fail.  The value of maxShardsPerNode can also affect the overall
> >> > result.
> >> >
> >> >> - When adding new nodes to the cluster AFTER the collection is
> created,
> >> one must use the core admin api to add the node to the collection.
> >> >
> >> > Using the CoreAdmin API is strongly discouraged when running
> SolrCloud.
> >> > It works, but it is an expert API when in cloud mode, and can cause
> >> > serious problems if not used correctly.  Instead, use the Collections
> >> > API.  It can handle all normal maintenance needs.
> >> >
> >> >> I would really like to see the second case behave more like the
> first.
> >> If I add a node to the cluster, it is automatically used as a replica
> for
> >> existing clusters without my having to do so.  This would really
> simplify
> >> things.
> >> >
> >> > I've added a FAQ entry to address why this is a bad idea.
> >> >
> >> >
> >>
> https://wiki.apache.org/solr/FAQ#Why_doesn.27t_SolrCloud_automatically_create_replicas_when_I_add_nodes.3F
> >> >
> >> > Thanks,
> >> > Shawn
> >> >
> >>
>


Re: Need to move on SOlr cloud (help required)

2016-02-15 Thread Susheel Kumar
In SolrJ, you would use CloudSolrClient which interacts with Zookeeper
(which maintains Cluster State). See CloudSolrClient API. So that's how
SolrJ would know which node is down or not.


Thanks,
Susheel

On Mon, Feb 15, 2016 at 12:07 AM, Midas A  wrote:

> Erick,
>
> We are using  php for our application so client would you suggest .
> currently we are using pecl solr client .
>
>
> but i want to understand that  suppose we sent a request to a node and that
> node is down that time how solrj  figure out where request should go.
>
> On Fri, Feb 12, 2016 at 9:44 PM, Erick Erickson 
> wrote:
>
> > bq: in case of solrcloud architecture we need not to have load balancer
> >
> > First, my comment about a load balancer was for the master/slave
> > architecture where the load balancer points to the slaves.
> >
> > Second, for SolrCloud you don't necessarily need a load balancer as
> > if you're using a SolrJ client requests are distributed across the
> replicas
> > via an internal load balancer.
> >
> > Best,
> > Erick
> >
> > On Thu, Feb 11, 2016 at 9:19 PM, Midas A  wrote:
> > > Erick ,
> > >
> > > bq: We want the hits on solr servers to be distributed
> > >
> > > True, this happens automatically in SolrCloud, but a simple load
> > > balancer in front of master/slave does the same thing.
> > >
> > > Midas : in case of solrcloud architecture we need not to have load
> > balancer
> > > ? .
> > >
> > > On Thu, Feb 11, 2016 at 11:42 PM, Erick Erickson <
> > erickerick...@gmail.com>
> > > wrote:
> > >
> > >> bq: We want the hits on solr servers to be distributed
> > >>
> > >> True, this happens automatically in SolrCloud, but a simple load
> > >> balancer in front of master/slave does the same thing.
> > >>
> > >> bq: what if master node fail what should be our fail over strategy  ?
> > >>
> > >> This is, indeed one of the advantages for SolrCloud, you don't have
> > >> to worry about this any more.
> > >>
> > >> Another benefit (and you haven't touched on whether this matters)
> > >> is that in SolrCloud you do not have the latency of polling and
> > >> replicating from master to slave, in other words it supports Near Real
> > >> Time.
> > >>
> > >> This comes at some additional complexity however. If you have
> > >> your master node failing often enough to be a problem, you have
> > >> other issues ;)...
> > >>
> > >> And the recovery strategy if the master fails is straightforward:
> > >> 1> pick one of the slaves to be the master.
> > >> 2> update the other nodes to point to the new master
> > >> 3> re-index the docs from before the old master failed to the new
> > master.
> > >>
> > >> You can use system variables to not even have to manually edit all of
> > the
> > >> solrconfig files, just supply different -D parameters on startup.
> > >>
> > >> Best,
> > >> Erick
> > >>
> > >> On Wed, Feb 10, 2016 at 10:39 PM, kshitij tyagi
> > >>  wrote:
> > >> > @Jack
> > >> >
> > >> > Currently we have around 55,00,000 docs
> > >> >
> > >> > Its not about load on one node we have load on different nodes at
> > >> different
> > >> > times as our traffic is huge around 60k users at a given point of
> time
> > >> >
> > >> > We want the hits on solr servers to be distributed so we are
> planning
> > to
> > >> > move on solr cloud as it would be fault tolerant.
> > >> >
> > >> >
> > >> >
> > >> > On Thu, Feb 11, 2016 at 11:10 AM, Midas A 
> > wrote:
> > >> >
> > >> >> hi,
> > >> >> what if master node fail what should be our fail over strategy  ?
> > >> >>
> > >> >> On Wed, Feb 10, 2016 at 9:12 PM, Jack Krupansky <
> > >> jack.krupan...@gmail.com>
> > >> >> wrote:
> > >> >>
> > >> >> > What exactly is your motivation? I mean, the primary benefit of
> > >> SolrCloud
> > >> >> > is better support for sharding, and you have only a single shard.
> > If
> > >> you
> > >> >> > have no need for sharding and your master-slave replicated Solr
> has
> > >> been
> > >> >> > working fine, then stick with it. If only one machine is having a
> > load
> > >> >> > problem, then that one node should be replaced. There are indeed
> > >> plenty
> > >> >> of
> > >> >> > good reasons to prefer SolrCloud over traditional master-slave
> > >> >> replication,
> > >> >> > but so far you haven't touched on any of them.
> > >> >> >
> > >> >> > How much data (number of documents) do you have?
> > >> >> >
> > >> >> > What is your typical query latency?
> > >> >> >
> > >> >> >
> > >> >> > -- Jack Krupansky
> > >> >> >
> > >> >> > On Wed, Feb 10, 2016 at 2:15 AM, kshitij tyagi <
> > >> >> > kshitij.shopcl...@gmail.com>
> > >> >> > wrote:
> > >> >> >
> > >> >> > > Hi,
> > >> >> > >
> > >> >> > > We are currently using solr 5.2 and I need to move on solr
> cloud
> > >> >> > > architecture.
> > >> >> > >
> > >> >> > > As of now we are using 5 machines :
> > >> >> > >
> > >> >> > > 1. I am using 1 master where we are indexing ourdata.
> > >> >> > > 2. I replicate my data on other machines
> > >> >> > >
> > >> >> > > One or the other machine keeps on showing hig

Re: Adding nodes

2016-02-15 Thread Susheel Kumar
Hi Paul,  Thanks for the detail but I am still not able to understand how
the CoreAPI would make it easier for you to create replica's.  I understand
that using Core API, you can add more cores but would that also populate
the data so that it can serve queries / act like a replica.

Second, As Shawn mentioned in the link above that adding Replica for
auto-scaling or in a near real time etc. is not a good idea since it put
more load on the system and causing delay.

Unless you have copy of indexes (assuming index is static) and you can
create more cores dynamically in which case Core API may work for your case.

Thanks,
Susheel



On Sun, Feb 14, 2016 at 2:17 PM, McCallick, Paul <
paul.e.mccall...@nordstrom.com> wrote:

> These are excellent questions and give me a good sense of why you suggest
> using the collections api.
>
> In our case we have 8 shards of product data with a even distribution of
> data per shard, no hot spots. We have very different load at different
> points in the year (cyber monday), and we tend to have very little traffic
> at night. I'm thinking of two use cases:
>
> 1) we are seeing increased latency due to load and want to add 8 more
> replicas to handle the query volume.  Once the volume subsides, we'd remove
> the nodes.
>
> 2) we lose a node due to some unexpected failure (ec2 tends to do this).
> We want auto scaling to detect the failure and add a node to replace the
> failed one.
>
> In both cases the core api makes it easy. It adds nodes to the shards
> evenly. Otherwise we have to write a fairly involved script that is subject
> to race conditions to determine which shard to add nodes to.
>
> Let me know if I'm making dangerous or uninformed assumptions, as I'm new
> to solr.
>
> Thanks,
> Paul
>
> > On Feb 14, 2016, at 10:35 AM, Susheel Kumar 
> wrote:
> >
> > Hi Pual,
> >
> >
> > For Auto-scaling, it depends on how you are thinking to design and
> what/how
> > do you want to scale. Which scenario you think makes coreadmin API easy
> to
> > use for a sharded SolrCloud environment?
> >
> > Isn't if in a sharded environment (assume 3 shards A,B & C) and shard B
> has
> > having higher or more load,  then you want to add Replica for shard B to
> > distribute the load or if a particular shard replica goes down then you
> > want to add another Replica back for the shard in which case ADDREPLICA
> > requires a shard name?
> >
> > Can you describe your scenario / provide more detail?
> >
> > Thanks,
> > Susheel
> >
> >
> >
> > On Sun, Feb 14, 2016 at 11:51 AM, McCallick, Paul <
> > paul.e.mccall...@nordstrom.com> wrote:
> >
> >> Hi all,
> >>
> >>
> >> This doesn’t really answer the following question:
> >>
> >> What is the suggested way to add a new node to a collection via the
> >> apis?  I  am specifically thinking of autoscale scenarios where a node
> has
> >> gone down or more nodes are needed to handle load.
> >>
> >>
> >> The coreadmin api makes this easy.  The collections api (ADDREPLICA),
> >> makes this very difficult.
> >>
> >>
> >>> On 2/14/16, 8:19 AM, "Susheel Kumar"  wrote:
> >>>
> >>> Hi Paul,
> >>>
> >>> Shawn is referring to use Collections API
> >>> https://cwiki.apache.org/confluence/display/solr/Collections+API  than
> >> Core
> >>> Admin API
> https://cwiki.apache.org/confluence/display/solr/CoreAdmin+API
> >>> for SolrCloud.
> >>>
> >>> Hope that clarifies and you mentioned about ADDREPLICA which is the
> >>> collections API, so you are on right track.
> >>>
> >>> Thanks,
> >>> Susheel
> >>>
> >>>
> >>>
> >>> On Sun, Feb 14, 2016 at 10:51 AM, McCallick, Paul <
> >>> paul.e.mccall...@nordstrom.com> wrote:
> >>>
> >>>> Then what is the suggested way to add a new node to a collection via
> the
> >>>> apis?  I  am specifically thinking of autoscale scenarios where a node
> >> has
> >>>> gone down or more nodes are needed to handle load.
> >>>>
> >>>> Note that the ADDREPLICA endpoint requires a shard name, which puts
> the
> >>>> onus of how to scale out on the user. This can be challenging in an
> >>>> autoscale scenario.
> >>>>
> >>>> Thanks,
> >>>> Paul
> >>>>
> >>>>> On Feb 14, 2

Re: Running solr as a service vs. Running it as a process

2016-02-17 Thread Susheel Kumar
In addition you also get many advantages like you can start/stop/restart
solr using "service solr stop|start|restart" as mentioned above.  You don't
need to launch solr script directly. Also the install scripts take care of
installing/setting up Solr nicely for Production environment.  Even you can
automate installation/launch of Solr using Install script.

Thanks,
Susheel


On Wed, Feb 17, 2016 at 1:10 PM, Davis, Daniel (NIH/NLM) [C] <
daniel.da...@nih.gov> wrote:

> So, running solr as a service also runs it as a process.   In typical
> Linux environments, (based on initscripts), a service is a process
> installed to meet additional considerations:
>
> - Putting logs in predictable places where system operators and
> administrators expect to see logs - /var/logs
> - Putting dynamic data that varies again in predictable places where
> system administrators expect to see dynamic data.
> - Putting code for the process in /opt/solr - the /opt filesystem is for
> non-operating system components
> - Putting configuration files for the process again in predictable places.
> - Running the process as a non-root user, but also as a user that is not
> any one user's account - e.g. a "service" account
> - Making sure Solr starts at system startup and stops at system shutdown
> - Making sure only a single copy of the service is running
>
> The options implemented in the install_solr_service.sh command are meant
> to be generic to many Linux environments, e.g. appropriate for RHEL/CentOS,
> Ubuntu, and Amazon Linux.   My organization is large enough (and perhaps
> peculiar enough) to have its own standards for where administrators expect
> to see logs and where dynamic data should go.   However, I still need to
> make sure to run it as a service, and this is part of taking it to
> production.
>
> The command /sbin/service is part of a package called "initscripts" which
> is used on a number of different Linux environments.   Many systems are now
> using both that package and another, "systemd", that starts things somewhat
> differently.
>
> Hope this helps,
>
> Dan Davis, Systems/Applications Architect (Contractor),
> Office of Computer and Communications Systems,
> National Library of Medicine, NIH
>
>
> -Original Message-
> From: Binoy Dalal [mailto:binoydala...@gmail.com]
> Sent: Wednesday, February 17, 2016 2:17 AM
> To: SOLR users group 
> Subject: Running solr as a service vs. Running it as a process
>
> Hello everyone,
> I've read about running solr as a service but I don't understand what it
> really means.
>
> I went through the "Taking solr to production" documentation on the wiki
> which suggests that solr be installed using the script provided and run as
> a service.
> From what I could glean, the script creates a directory structure and sets
> various environment variables and then starts solr using the service
> command.
> How is this different from setting up solr manually and starting solr
> using `./solr start`?
>
> Currently in my project, we start solr as a process using the `./` Is this
> something that should be avoided and if so why?
>
> Additionally, and I know that this is not the right place to ask, yet if
> someone could explain what the service command actually does, that would be
> great. I've read a few articles and they say that it runs the init script
> in as predictable an environment as possible, but what does that mean?
>
> Thanks
> --
> Regards,
> Binoy Dalal
>


Re: OutOfMemory when batchupdating from SolrJ

2016-02-19 Thread Susheel Kumar
When you run your SolrJ Client Indexing program, can you increase heap size
similar below.  I guess it may be on your client side you are running int
OOM... or please share the exact error if below doesn't work/is the issue.

 java -Xmx4096m 


Thanks,

Susheel

On Fri, Feb 19, 2016 at 6:25 AM, Clemens Wyss DEV 
wrote:

> Guessing on ;) :
> must I commit after every "batch", in order to force a flushing of
> org.apache.solr.client.solrj.request.RequestWriter$LazyContentStream et al?
>
> OTH it is propagated to NOT "commit" from a (SolrJ) client
>
> https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> 'Be very careful committing from the client! In fact, don’t do it'
>
> I would not want to commit "just to flush a client side buffer" ...
>
> -Ursprüngliche Nachricht-
> Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch]
> Gesendet: Freitag, 19. Februar 2016 11:09
> An: solr-user@lucene.apache.org
> Betreff: AW: OutOfMemory when batchupdating from SolrJ
>
> The char[] which occupies 180MB has the following "path to root"
>
> char[87690841] @ 0x7940ba658   name="_my_id">shopproducts#...
> |-  java.lang.Thread @ 0x7321d9b80  SolrUtil executorService
> |for core 'fust-1-fr_CH_1' -3-thread-1 Thread
> |- value java.lang.String @ 0x79e804110   name="_my_id">shopproducts#...
> |  '- str org.apache.solr.common.util.ContentStreamBase$StringStream @
> 0x77fd84680
> | |-  java.lang.Thread @ 0x7321d9b80  SolrUtil
> executorService for core 'fust-1-fr_CH_1' -3-thread-1
> | |- contentStream
> org.apache.solr.client.solrj.request.RequestWriter$LazyContentStream @
> 0x77fd846a0
> | |  |-  java.lang.Thread @ 0x7321d9b80  SolrUtil
> executorService for core 'fust-1-fr_CH_1' -3-thread-1 Thread
> | |  |- [0] org.apache.solr.common.util.ContentStream[1] @ 0x79e802fb8
> | |  |  '-  java.lang.Thread @ 0x7321d9b80  SolrUtil
> |executorService for core 'fust-1-fr_CH_1' -3-thread-1 Thread
>
> And there is another byte[] with 260MB.
>
> The logic is somewhat this:
>
> SolrClient solrClient = new HttpSolrClient( coreUrl ); while ( got more
> elements to index ) {
>   batch = create 100 SolrInputDocuments
>   solrClient.add( batch )
>  }
>
>
> -Ursprüngliche Nachricht-
> Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch]
> Gesendet: Freitag, 19. Februar 2016 09:07
> An: solr-user@lucene.apache.org
> Betreff: OutOfMemory when batchupdating from SolrJ
>
> Environment: Solr 5.4.1
>
> I am facing OOMs when batchupdating SolrJ. I am seeing approx 30'000(!)
> SolrInputDocument instances, although my batchsize is 100. I.e. I call
> solrClient.add( documents ) for every 100 documents only. So I'd expect to
> see at most 100 SolrInputDocument's in memory at any moment UNLESS
> a) solrClient.add is "asynchronous" in its nature. Then QueryResponse
> would be an async-result?
> or
> b) SolrJ is spooling the documents in client-side
>
> What might be going wrong?
>
> Thx for your advices
> Clemens
>
>


Re: OutOfMemory when batchupdating from SolrJ

2016-02-19 Thread Susheel Kumar
And if it is on Solr side, please increase the heap size on Solr side
https://cwiki.apache.org/confluence/display/solr/JVM+Settings

On Fri, Feb 19, 2016 at 8:42 AM, Susheel Kumar 
wrote:

> When you run your SolrJ Client Indexing program, can you increase heap
> size similar below.  I guess it may be on your client side you are running
> int OOM... or please share the exact error if below doesn't work/is the
> issue.
>
>  java -Xmx4096m 
>
>
> Thanks,
>
> Susheel
>
> On Fri, Feb 19, 2016 at 6:25 AM, Clemens Wyss DEV 
> wrote:
>
>> Guessing on ;) :
>> must I commit after every "batch", in order to force a flushing of
>> org.apache.solr.client.solrj.request.RequestWriter$LazyContentStream et al?
>>
>> OTH it is propagated to NOT "commit" from a (SolrJ) client
>>
>> https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>> 'Be very careful committing from the client! In fact, don’t do it'
>>
>> I would not want to commit "just to flush a client side buffer" ...
>>
>> -Ursprüngliche Nachricht-
>> Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch]
>> Gesendet: Freitag, 19. Februar 2016 11:09
>> An: solr-user@lucene.apache.org
>> Betreff: AW: OutOfMemory when batchupdating from SolrJ
>>
>> The char[] which occupies 180MB has the following "path to root"
>>
>> char[87690841] @ 0x7940ba658  > name="_my_id">shopproducts#...
>> |-  java.lang.Thread @ 0x7321d9b80  SolrUtil executorService
>> |for core 'fust-1-fr_CH_1' -3-thread-1 Thread
>> |- value java.lang.String @ 0x79e804110  > name="_my_id">shopproducts#...
>> |  '- str org.apache.solr.common.util.ContentStreamBase$StringStream @
>> 0x77fd84680
>> | |-  java.lang.Thread @ 0x7321d9b80  SolrUtil
>> executorService for core 'fust-1-fr_CH_1' -3-thread-1
>> | |- contentStream
>> org.apache.solr.client.solrj.request.RequestWriter$LazyContentStream @
>> 0x77fd846a0
>> | |  |-  java.lang.Thread @ 0x7321d9b80  SolrUtil
>> executorService for core 'fust-1-fr_CH_1' -3-thread-1 Thread
>> | |  |- [0] org.apache.solr.common.util.ContentStream[1] @ 0x79e802fb8
>> | |  |  '-  java.lang.Thread @ 0x7321d9b80  SolrUtil
>> |executorService for core 'fust-1-fr_CH_1' -3-thread-1 Thread
>>
>> And there is another byte[] with 260MB.
>>
>> The logic is somewhat this:
>>
>> SolrClient solrClient = new HttpSolrClient( coreUrl ); while ( got more
>> elements to index ) {
>>   batch = create 100 SolrInputDocuments
>>   solrClient.add( batch )
>>  }
>>
>>
>> -Ursprüngliche Nachricht-
>> Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch]
>> Gesendet: Freitag, 19. Februar 2016 09:07
>> An: solr-user@lucene.apache.org
>> Betreff: OutOfMemory when batchupdating from SolrJ
>>
>> Environment: Solr 5.4.1
>>
>> I am facing OOMs when batchupdating SolrJ. I am seeing approx 30'000(!)
>> SolrInputDocument instances, although my batchsize is 100. I.e. I call
>> solrClient.add( documents ) for every 100 documents only. So I'd expect to
>> see at most 100 SolrInputDocument's in memory at any moment UNLESS
>> a) solrClient.add is "asynchronous" in its nature. Then QueryResponse
>> would be an async-result?
>> or
>> b) SolrJ is spooling the documents in client-side
>>
>> What might be going wrong?
>>
>> Thx for your advices
>> Clemens
>>
>>
>


Re: OutOfMemory when batchupdating from SolrJ

2016-02-19 Thread Susheel Kumar
Clemens,

First allocating higher or right amount of heap memory is not a workaround
but becomes a requirement depending on how much heap memory your Java
program needs.
Please read about why Solr need heap memory at
https://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Susheel



On Fri, Feb 19, 2016 at 9:17 AM, Clemens Wyss DEV 
wrote:

> > increase heap size
> this is a "workaround"
>
> Doesn't SolrClient free part of its buffer? At least documents it has sent
> to the Solr-Server?
>
> -----Ursprüngliche Nachricht-
> Von: Susheel Kumar [mailto:susheel2...@gmail.com]
> Gesendet: Freitag, 19. Februar 2016 14:42
> An: solr-user@lucene.apache.org
> Betreff: Re: OutOfMemory when batchupdating from SolrJ
>
> When you run your SolrJ Client Indexing program, can you increase heap
> size similar below.  I guess it may be on your client side you are running
> int OOM... or please share the exact error if below doesn't work/is the
> issue.
>
>  java -Xmx4096m 
>
>
> Thanks,
>
> Susheel
>
> On Fri, Feb 19, 2016 at 6:25 AM, Clemens Wyss DEV 
> wrote:
>
> > Guessing on ;) :
> > must I commit after every "batch", in order to force a flushing of
> > org.apache.solr.client.solrj.request.RequestWriter$LazyContentStream et
> al?
> >
> > OTH it is propagated to NOT "commit" from a (SolrJ) client
> >
> > https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-
> > softcommit-and-commit-in-sorlcloud/
> > 'Be very careful committing from the client! In fact, don’t do it'
> >
> > I would not want to commit "just to flush a client side buffer" ...
> >
> > -Ursprüngliche Nachricht-
> > Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch]
> > Gesendet: Freitag, 19. Februar 2016 11:09
> > An: solr-user@lucene.apache.org
> > Betreff: AW: OutOfMemory when batchupdating from SolrJ
> >
> > The char[] which occupies 180MB has the following "path to root"
> >
> > char[87690841] @ 0x7940ba658   > name="_my_id">shopproducts#...
> > |-  java.lang.Thread @ 0x7321d9b80  SolrUtil
> > |executorService for core 'fust-1-fr_CH_1' -3-thread-1 Thread
> > |- value java.lang.String @ 0x79e804110   > name="_my_id">shopproducts#...
> > |  '- str org.apache.solr.common.util.ContentStreamBase$StringStream @
> > 0x77fd84680
> > | |-  java.lang.Thread @ 0x7321d9b80  SolrUtil
> > executorService for core 'fust-1-fr_CH_1' -3-thread-1
> > | |- contentStream
> > org.apache.solr.client.solrj.request.RequestWriter$LazyContentStream @
> > 0x77fd846a0
> > | |  |-  java.lang.Thread @ 0x7321d9b80  SolrUtil
> > executorService for core 'fust-1-fr_CH_1' -3-thread-1 Thread
> > | |  |- [0] org.apache.solr.common.util.ContentStream[1] @
> 0x79e802fb8
> > | |  |  '-  java.lang.Thread @ 0x7321d9b80  SolrUtil
> > |executorService for core 'fust-1-fr_CH_1' -3-thread-1 Thread
> >
> > And there is another byte[] with 260MB.
> >
> > The logic is somewhat this:
> >
> > SolrClient solrClient = new HttpSolrClient( coreUrl ); while ( got
> > more elements to index ) {
> >   batch = create 100 SolrInputDocuments
> >   solrClient.add( batch )
> >  }
> >
> >
> > -Ursprüngliche Nachricht-
> > Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch]
> > Gesendet: Freitag, 19. Februar 2016 09:07
> > An: solr-user@lucene.apache.org
> > Betreff: OutOfMemory when batchupdating from SolrJ
> >
> > Environment: Solr 5.4.1
> >
> > I am facing OOMs when batchupdating SolrJ. I am seeing approx
> > 30'000(!) SolrInputDocument instances, although my batchsize is 100.
> > I.e. I call solrClient.add( documents ) for every 100 documents only.
> > So I'd expect to see at most 100 SolrInputDocument's in memory at any
> > moment UNLESS
> > a) solrClient.add is "asynchronous" in its nature. Then QueryResponse
> > would be an async-result?
> > or
> > b) SolrJ is spooling the documents in client-side
> >
> > What might be going wrong?
> >
> > Thx for your advices
> > Clemens
> >
> >
>


Re: OutOfMemory when batchupdating from SolrJ

2016-02-19 Thread Susheel Kumar
Clemens,

What i understand from your above emails that you are creating
SolrInputDocuments in a batch inside a loop which gets created in heap .
SolrJ/SolrClient doesn't have  any control on removing those objects from
heap which is controlled by Garbage Collection.  So your program may end up
in a situation where there is no more heap memory left or GC is not able to
free up memory and hits OOM because you have the loop running and creating
more & more objects.  By defaults a minimum heap memory is allocated to
java program unless you set -Xmx when you launch your Java program.

Hope that clarifies.

On Fri, Feb 19, 2016 at 12:11 PM, Clemens Wyss DEV 
wrote:

> Thanks Susheel,
> but I am having problems in and am talking about SolrJ, i.e. the
> "client-side of Solr" ...
>
> -Ursprüngliche Nachricht-
> Von: Susheel Kumar [mailto:susheel2...@gmail.com]
> Gesendet: Freitag, 19. Februar 2016 17:23
> An: solr-user@lucene.apache.org
> Betreff: Re: OutOfMemory when batchupdating from SolrJ
>
> Clemens,
>
> First allocating higher or right amount of heap memory is not a workaround
> but becomes a requirement depending on how much heap memory your Java
> program needs.
> Please read about why Solr need heap memory at
> https://wiki.apache.org/solr/SolrPerformanceProblems
>
> Thanks,
> Susheel
>
>
>
> On Fri, Feb 19, 2016 at 9:17 AM, Clemens Wyss DEV 
> wrote:
>
> > > increase heap size
> > this is a "workaround"
> >
> > Doesn't SolrClient free part of its buffer? At least documents it has
> > sent to the Solr-Server?
> >
> > -Ursprüngliche Nachricht-
> > Von: Susheel Kumar [mailto:susheel2...@gmail.com]
> > Gesendet: Freitag, 19. Februar 2016 14:42
> > An: solr-user@lucene.apache.org
> > Betreff: Re: OutOfMemory when batchupdating from SolrJ
> >
> > When you run your SolrJ Client Indexing program, can you increase heap
> > size similar below.  I guess it may be on your client side you are
> > running int OOM... or please share the exact error if below doesn't
> > work/is the issue.
> >
> >  java -Xmx4096m 
> >
> >
> > Thanks,
> >
> > Susheel
> >
> > On Fri, Feb 19, 2016 at 6:25 AM, Clemens Wyss DEV
> > 
> > wrote:
> >
> > > Guessing on ;) :
> > > must I commit after every "batch", in order to force a flushing of
> > > org.apache.solr.client.solrj.request.RequestWriter$LazyContentStream
> > > et
> > al?
> > >
> > > OTH it is propagated to NOT "commit" from a (SolrJ) client
> > >
> > > https://lucidworks.com/blog/2013/08/23/understanding-transaction-log
> > > s-
> > > softcommit-and-commit-in-sorlcloud/
> > > 'Be very careful committing from the client! In fact, don’t do it'
> > >
> > > I would not want to commit "just to flush a client side buffer" ...
> > >
> > > -Ursprüngliche Nachricht-
> > > Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch]
> > > Gesendet: Freitag, 19. Februar 2016 11:09
> > > An: solr-user@lucene.apache.org
> > > Betreff: AW: OutOfMemory when batchupdating from SolrJ
> > >
> > > The char[] which occupies 180MB has the following "path to root"
> > >
> > > char[87690841] @ 0x7940ba658   > > name="_my_id">shopproducts#...
> > > |-  java.lang.Thread @ 0x7321d9b80  SolrUtil
> > > |executorService for core 'fust-1-fr_CH_1' -3-thread-1 Thread
> > > |- value java.lang.String @ 0x79e804110   > > |boost="1.0"> > > name="_my_id">shopproducts#...
> > > |  '- str org.apache.solr.common.util.ContentStreamBase$StringStream
> > > | @
> > > 0x77fd84680
> > > | |-  java.lang.Thread @ 0x7321d9b80  SolrUtil
> > > executorService for core 'fust-1-fr_CH_1' -3-thread-1
> > > | |- contentStream
> > > org.apache.solr.client.solrj.request.RequestWriter$LazyContentStream
> > > @
> > > 0x77fd846a0
> > > | |  |-  java.lang.Thread @ 0x7321d9b80  SolrUtil
> > > executorService for core 'fust-1-fr_CH_1' -3-thread-1 Thread
> > > | |  |- [0] org.apache.solr.common.util.ContentStream[1] @
> > 0x79e802fb8
> > > | |  |  '-  java.lang.Thread @ 0x7321d9b80  SolrUtil
> > > |executorService for core 'fust-1-fr_CH_1' -3-thread-1 Thread
> > >
> > > And there is another byte[] with 260MB.
> > >
> > > The logic is somewhat this:
> > >
&g

Re: Slow commits

2016-02-22 Thread Susheel Kumar
Adam - how many documents you have in your index?

Thanks,
Susheel

On Mon, Feb 22, 2016 at 4:37 AM, Adam Neal [Extranet] 
wrote:

> Well I got the numbers wrong, there are actually around 66000 fields on
> the index. I have restructured the index and there are now around 1500
> fiields. This has resulted in the commit taking 34 seconds which is
> acceptable for my usage however it is still significantly slower than the
> 4.10.2 commit on the original 66000 fields which was around 1 second.
> 
> From: Adam Neal [Extranet] [an...@mass.co.uk]
> Sent: 19 February 2016 17:43
> To: solr-user@lucene.apache.org
> Subject: RE: Slow commits
>
> I'm out of the office now so I don't have the numbers to hand but from
> memory I think there are probably around 800-1000 fields or so. I will
> confirm on Monday.
>
> If i have time over the weekend I will try and recreate the problem at
> home and see if I can post up a sample.
> 
> From: Yonik Seeley [ysee...@gmail.com]
> Sent: 19 February 2016 16:25
> To: solr-user@lucene.apache.org
> Subject: Re: Slow commits
>
> On Fri, Feb 19, 2016 at 8:51 AM, Adam Neal [Extranet] 
> wrote:
> > I've recently upgraded from 4.10.2 to 5.3.1 and I've hit an issue with
> slow commits on one of my cores. The core in question is relatively small
> (56k docs) and the issue only shows when commiting after a number of
> deletes, commiting after additions is fine. As an example commiting after
> deleting approximately 10% of the documents takes around 25mins. The same
> test on the 4.10.2 instance takes around 1 second.
> >
> > I have done some investigation and the problem appears to be caused by
> having dynamic fields, the core in question has a large number, performing
> the same operation on this core with the dynamic fields removed sees a big
> improvement on the performance with the commit taking 11 seconds (still not
> quite on a par with 4.10.2).
>
> Dynamic fields is a Solr schema concept, and does not translate to any
> differences in Lucene.
> You may be hitting something due to a large number of fields (at the
> lucene level, each field name is a different field).  How many
> different fields (i.e. fieldnames) do you have across the entire
> index?
>
> -Yonik
>
>
> #
>
> This E-mail is the property of Mass Consultants Ltd. It is confidential
> and intended only for the use of the addressee or with its permission.  Use
> by anyone else for any purpose is prohibited.  If you are not the
> addressee, you should not use, disclose, copy or distribute this e-mail and
> should notify us of receipt immediately by return e-mail to the address
> where the e-mail originated.
>
> This E-mail may not have been sent through a secure system and accordingly
> (i) its contents should not be relied upon by any person without
> independent verification from Mass Consultants Ltd and (ii) it is the
> responsibility of the recipient to ensure that the onward transmission,
> opening or use of this message and any attachments will not adversely
> affect its systems or data. No responsibility is accepted by Mass
> Consultants Ltd in this regard.
>
> Any e-mails that are sent to Mass Consultants Ltd's e-mail addresses may
> be monitored by systems or persons other than the addressee, for the
> purposes of ascertaining whether the communication complies with the law
> and Mass Consultants Ltd's policies.
>
> Mass Consultants Ltd is registered in England No. 1705804, Enterprise
> House, Great North Road, Little Paxton, Cambs., PE19 6BN, United Kingdom.
> Tel: +44 (0) 1480 222600.
>
>
> #
>
>
> #
>
> This E-mail is the property of Mass Consultants Ltd. It is confidential
> and intended only for the use of the addressee or with its permission.  Use
> by anyone else for any purpose is prohibited.  If you are not the
> addressee, you should not use, disclose, copy or distribute this e-mail and
> should notify us of receipt immediately by return e-mail to the address
> where the e-mail originated.
>
> This E-mail may not have been sent through a secure system and accordingly
> (i) its contents should not be relied upon by any person without
> independent verification from Mass Consultants Ltd and (ii) it is the
> responsibility of the recipient to ensure that the onward transmission,
> opening or use of this message and any attachments will not adversely
> affect its systems or data. No responsibility is accepted by Mass
> Consultants Ltd in this regard.
>
> Any e-mails that are sent to Mass Consultants Ltd's e-mail addresses may
> be monitored by systems or persons other than the addressee, for the
> purposes of ascertaining whether the communication complies with the law
> and Mass Cons

Re: Slow commits

2016-02-22 Thread Susheel Kumar
Sorry, I see now you mentioned 56K docs which is pretty small.

On Mon, Feb 22, 2016 at 8:30 AM, Susheel Kumar 
wrote:

> Adam - how many documents you have in your index?
>
> Thanks,
> Susheel
>
> On Mon, Feb 22, 2016 at 4:37 AM, Adam Neal [Extranet] 
> wrote:
>
>> Well I got the numbers wrong, there are actually around 66000 fields on
>> the index. I have restructured the index and there are now around 1500
>> fiields. This has resulted in the commit taking 34 seconds which is
>> acceptable for my usage however it is still significantly slower than the
>> 4.10.2 commit on the original 66000 fields which was around 1 second.
>> 
>> From: Adam Neal [Extranet] [an...@mass.co.uk]
>> Sent: 19 February 2016 17:43
>> To: solr-user@lucene.apache.org
>> Subject: RE: Slow commits
>>
>> I'm out of the office now so I don't have the numbers to hand but from
>> memory I think there are probably around 800-1000 fields or so. I will
>> confirm on Monday.
>>
>> If i have time over the weekend I will try and recreate the problem at
>> home and see if I can post up a sample.
>> 
>> From: Yonik Seeley [ysee...@gmail.com]
>> Sent: 19 February 2016 16:25
>> To: solr-user@lucene.apache.org
>> Subject: Re: Slow commits
>>
>> On Fri, Feb 19, 2016 at 8:51 AM, Adam Neal [Extranet] 
>> wrote:
>> > I've recently upgraded from 4.10.2 to 5.3.1 and I've hit an issue with
>> slow commits on one of my cores. The core in question is relatively small
>> (56k docs) and the issue only shows when commiting after a number of
>> deletes, commiting after additions is fine. As an example commiting after
>> deleting approximately 10% of the documents takes around 25mins. The same
>> test on the 4.10.2 instance takes around 1 second.
>> >
>> > I have done some investigation and the problem appears to be caused by
>> having dynamic fields, the core in question has a large number, performing
>> the same operation on this core with the dynamic fields removed sees a big
>> improvement on the performance with the commit taking 11 seconds (still not
>> quite on a par with 4.10.2).
>>
>> Dynamic fields is a Solr schema concept, and does not translate to any
>> differences in Lucene.
>> You may be hitting something due to a large number of fields (at the
>> lucene level, each field name is a different field).  How many
>> different fields (i.e. fieldnames) do you have across the entire
>> index?
>>
>> -Yonik
>>
>>
>> #
>>
>> This E-mail is the property of Mass Consultants Ltd. It is confidential
>> and intended only for the use of the addressee or with its permission.  Use
>> by anyone else for any purpose is prohibited.  If you are not the
>> addressee, you should not use, disclose, copy or distribute this e-mail and
>> should notify us of receipt immediately by return e-mail to the address
>> where the e-mail originated.
>>
>> This E-mail may not have been sent through a secure system and
>> accordingly (i) its contents should not be relied upon by any person
>> without independent verification from Mass Consultants Ltd and (ii) it is
>> the responsibility of the recipient to ensure that the onward transmission,
>> opening or use of this message and any attachments will not adversely
>> affect its systems or data. No responsibility is accepted by Mass
>> Consultants Ltd in this regard.
>>
>> Any e-mails that are sent to Mass Consultants Ltd's e-mail addresses may
>> be monitored by systems or persons other than the addressee, for the
>> purposes of ascertaining whether the communication complies with the law
>> and Mass Consultants Ltd's policies.
>>
>> Mass Consultants Ltd is registered in England No. 1705804, Enterprise
>> House, Great North Road, Little Paxton, Cambs., PE19 6BN, United Kingdom.
>> Tel: +44 (0) 1480 222600.
>>
>>
>> #
>>
>>
>> #
>>
>> This E-mail is the property of Mass Consultants Ltd. It is confidential
>> and intended only for the use of the addressee or with its permission.  Use
>> by anyone else for any purpose is prohibited.  If you are not the
>> addressee, you should not use, disclose, copy or distribute this e-mail and
>> should notify 

Re: SOLR cloud startup poniting to zookeeper ensemble

2016-02-23 Thread Susheel Kumar
Use this syntax and see if it works.

bin/solr start -e cloud -noprompt -z
localhost:2181,localhost:2182,localhost:2183

On Mon, Feb 22, 2016 at 11:16 PM, bbarani  wrote:

> I downloaded the latest version of SOLR (5.5.0) and also installed
> zookeeper
> on port 2181,2182,2183 and its running fine.
>
> Now when I try to start the SOLR instance using the below command its just
> showing help content rather than executing the command.
>
> bin/solr start -e cloud -z localhost:2181,localhost:2182,localhost:2183
> -noprompt
>
> The below command works with one zookeeper host.
> solr start -e cloud -z localhost:2181 -noprompt
>
> Am I missing anything?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/SOLR-cloud-startup-poniting-to-zookeeper-ensemble-tp4259023.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: SOLR cloud startup poniting to zookeeper ensemble

2016-02-24 Thread Susheel Kumar
I see your point. Didn't realize that you are using windows.  If it work
using double quotes, please go ahead and launch that way.

Thank,

Susheel

On Wed, Feb 24, 2016 at 12:44 PM, bbarani  wrote:

> Its still throwing error without quotes.
>
> solr start -e cloud -noprompt -z
> localhost:2181,localhost:2182,localhost:2183
>
> Invalid command-line option: localhost:2182
>
> Usage: solr start [-f] [-c] [-h hostname] [-p port] [-d directory] [-z
> zkHost] [
> -m memory] [-e example] [-s solr.solr.home] [-a "additional-options"] [-V]
>
>   -fStart Solr in foreground; default starts Solr in the
> background
>   and sends stdout / stderr to solr-PORT-console.log
>
>   -c or -cloud  Start Solr in SolrCloud mode; if -z not supplied, an
> embedded Zo
>
> *Info on using double quotes:*
>
>
> http://lucene.472066.n3.nabble.com/Solr-5-2-1-setup-zookeeper-ensemble-problem-td4215823.html#a4215877
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/SOLR-cloud-startup-error-zookeeper-ensemble-windows-tp4259023p4259567.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Indexing Twitter - Hypothetical

2016-03-06 Thread Susheel Kumar
Entity Recognition means you may want to recognize different entities
name/person, email, location/city/state/country etc. in your
tweets/messages with goal of  providing better relevant results to users.
NER can be used at query or indexing (data enrichment) time.

Thanks,
Susheel

On Fri, Mar 4, 2016 at 7:55 PM, Joseph Obernberger <
joseph.obernber...@gmail.com> wrote:

> Thank you all very much for all the responses so far.  I've enjoyed reading
> them!  We have noticed that storing data inside of Solr results in
> significantly worse performance (particularly faceting); so we store the
> values of all the fields elsewhere, but index all the data with Solr
> Cloud.  I think the suggestion about splitting the data up into blocks of
> date/time is where we would be headed.  Having two Solr-Cloud clusters -
> one to handle ~30 days of data, and one to handle historical.  Another
> option is to use a single Solr Cloud cluster, but use multiple
> cores/collections.  Either way you'd need a job to come through and clean
> up old data. The historical cluster would have much worse performance,
> particularly for clustering and faceting the data, but that may be
> acceptable.
> I don't know what you mean by 'entity recognition in the queries' - could
> you elaborate?
>
> We would want to index and potentially facet on any of the fields - for
> example entities_media_url, username, even background color, but we do not
> know a-priori what fields will be important to users.
> As to why we would want to make the data searchable; well - I don't make
> the rules!  Tweets is not the only data source, but it's certainly the
> largest that we are currently looking at handling.
>
> I will read up on the Berlin Buzzwords - thank you for the info!
>
> -Joe
>
>
>
> On Fri, Mar 4, 2016 at 9:59 AM, Jack Krupansky 
> wrote:
>
> > As always, the initial question is how you intend to query the data -
> query
> > drives data modeling. How real-time do you need queries to be? How fast
> do
> > you need archive queries to be? How many fields do you need to query on?
> > How much entity recognition do you need in queries?
> >
> >
> > -- Jack Krupansky
> >
> > On Fri, Mar 4, 2016 at 4:19 AM, Charlie Hull  wrote:
> >
> > > On 03/03/2016 19:25, Toke Eskildsen wrote:
> > >
> > >> Joseph Obernberger  wrote:
> > >>
> > >>> Hi All - would it be reasonable to index the Twitter 'firehose'
> > >>> with Solr Cloud - roughly 500-600 million docs per day indexing
> > >>> each of the fields (about 180)?
> > >>>
> > >>
> > >> Possible, yes. Reasonable? It is not going to be cheap.
> > >>
> > >> Twitter index the tweets themselves and have been quite open about
> > >> how they do it. I would suggest looking for their presentations;
> > >> slides or recordings. They have presented at Berlin Buzzwords and
> > >> Lucene/Solr Revolution and probably elsewhere too. The gist is that
> > >> they have done a lot of work and custom coding to handle it.
> > >>
> > >
> > > As I recall they're not using Solr, but rather an in-house layer built
> on
> > > a customised version of Lucene. They're indexing around half a trillion
> > > tweets.
> > >
> > > If the idea is to provide a searchable archive of all tweets, my first
> > > question would be 'why': if the idea is to monitor new tweets for
> > > particular patterns there are better ways to do this (Luwak for
> example).
> > >
> > > Charlie
> > >
> > >
> > >> If I were to guess at a sharded setup to handle such data, and keep
> > >>> 2 years worth, I would guess about 2500 shards.  Is that
> > >>> reasonable?
> > >>>
> > >>
> > >> I think you need to think well beyond standard SolrCloud setups. Even
> > >> if you manage to get 2500 shards running, you will want to do a lot
> > >> of tweaking on the way to issue queries so that each request does not
> > >> require all 2500 shards to be searched. Prioritizing newer material
> > >> and only query the older shards if there is not enough resent results
> > >> is an example.
> > >>
> > >> I highly doubt that a single SolrCloud is the best answer here. Maybe
> > >> one cloud for each month and a lot of external logic?
> > >>
> > >> - Toke Eskildsen
> > >>
> > >>
> > >
> > > --
> > > Charlie Hull
> > > Flax - Open Source Enterprise Search
> > >
> > > tel/fax: +44 (0)8700 118334
> > > mobile:  +44 (0)7767 825828
> > > web: www.flax.co.uk
> > >
> >
>


Re: Solr Queries are very slow - Suggestions needed

2016-03-14 Thread Susheel Kumar
For each of the solr machines/shards you have.  Thanks.

On Mon, Mar 14, 2016 at 10:04 AM, Susheel Kumar 
wrote:

> Hello Anil,
>
> Can you go to Solr Admin Panel -> Dashboard and share all 4 memory
> parameters under System / share the snapshot. ?
>
> Thanks,
> Susheel
>
> On Mon, Mar 14, 2016 at 5:36 AM, Anil  wrote:
>
>> HI Toke and Jack,
>>
>> Please find the details below.
>>
>> * How large are your 3 shards in bytes? (total index across replicas)
>>   --  *146G. i am using CDH (cloudera), not sure how to check the
>> index size of each collection on each shard*
>> * What storage system do you use (local SSD, local spinning drives, remote
>> storage...)? *Local (hdfs) spinning drives*
>> * How much physical memory does your system have? *we have 15 data nodes.
>> multiple services installed on each data node (252 GB RAM for each data
>> node). 25 gb RAM allocated for solr service.*
>> * How much memory is free for disk cache? *i could not find.*
>> * How many concurrent queries do you issue? *very less. i dont see any
>> concurrent queries to this file_collection for now.*
>> * Do you update while you search? *Yes.. its very less.*
>> * What does a full query (rows, faceting, grouping, highlighting,
>> everything) look like? *for the file_collection, rows - 100, highlights =
>> false, no facets, expand = false.*
>> * How many documents does a typical query match (hitcount)? *it varies
>> with
>> each file. i have sort on int field to order commands in the query.*
>>
>> we have two sets of collections on solr cluster ( 17 data nodes)
>>
>> 1. main_collection - collection created per year. each collection uses 8
>> shards 2 replicas ex: main_collection_2016, main_collection_2015 etc
>>
>> 2. file_collection (where files having commands are indexed) - collection
>> created per 2 years. it uses 3 shards and 2 replicas. ex :
>> file_collection_2014, file_collection_2016
>>
>> The slowness is happening for file_collection. though it has 3 shards,
>> documents are available in 2 shards. shard1 - 150M docs and shard2 has
>> 330M
>> docs , shard3 is empty.
>>
>> main_collection is looks good.
>>
>> please let me know if you need any additional details.
>>
>> Regards,
>> Anil
>>
>>
>> On 13 March 2016 at 21:48, Anil  wrote:
>>
>> > Thanks Toke and Jack.
>> >
>> > Jack,
>> >
>> > Yes. it is 480 million :)
>> >
>> > I will share the additional details soon. thanks.
>> >
>> >
>> > Regards,
>> > Anil
>> >
>> >
>> >
>> >
>> >
>> > On 13 March 2016 at 21:06, Jack Krupansky 
>> > wrote:
>> >
>> >> (We should have a wiki/doc page for the "usual list of suspects" when
>> >> queries are/appear slow, rather than need to repeat the same mantra(s)
>> for
>> >> every inquiry on this topic.)
>> >>
>> >>
>> >> -- Jack Krupansky
>> >>
>> >> On Sun, Mar 13, 2016 at 11:29 AM, Toke Eskildsen <
>> t...@statsbiblioteket.dk>
>> >> wrote:
>> >>
>> >> > Anil  wrote:
>> >> > > i have indexed a data (commands from files) with 10 fields and 3 of
>> >> them
>> >> > is
>> >> > > text fields. collection is created with 3 shards and 2 replicas. I
>> >> have
>> >> > > used document routing as well.
>> >> >
>> >> > > Currently collection holds 47,80,01,405 records.
>> >> >
>> >> > ...480 million, right? Funny digit grouping in India.
>> >> >
>> >> > > text search against text field taking around 5 sec. solr is query
>> just
>> >> > and
>> >> > > of two terms with fl as 7 fields
>> >> >
>> >> > > fileId:"file unique id" AND command_text:(system login)
>> >> >
>> >> > While not an impressive response time, it might just be that your
>> >> hardware
>> >> > is not enough to handle that amount of documents. The usual culprit
>> is
>> >> IO
>> >> > speed, so chances are you have a system with spinning drives and not
>> >> enough
>> >> > RAM: Switch to SSD and/or add more RAM.
>> >> >
>> >> > To give better advice, we need more information.
>> >> >
>> >> > * How large are your 3 shards in bytes?
>> >> > * What storage system do you use (local SSD, local spinning drives,
>> >> remote
>> >> > storage...)?
>> >> > * How much physical memory does your system have?
>> >> > * How much memory is free for disk cache?
>> >> > * How many concurrent queries do you issue?
>> >> > * Do you update while you search?
>> >> > * What does a full query (rows, faceting, grouping, highlighting,
>> >> > everything) look like?
>> >> > * How many documents does a typical query match (hitcount)?
>> >> >
>> >> > - Toke Eskildsen
>> >> >
>> >>
>> >
>> >
>>
>
>


Re: Solr Queries are very slow - Suggestions needed

2016-03-14 Thread Susheel Kumar
Hello Anil,

Can you go to Solr Admin Panel -> Dashboard and share all 4 memory
parameters under System / share the snapshot. ?

Thanks,
Susheel

On Mon, Mar 14, 2016 at 5:36 AM, Anil  wrote:

> HI Toke and Jack,
>
> Please find the details below.
>
> * How large are your 3 shards in bytes? (total index across replicas)
>   --  *146G. i am using CDH (cloudera), not sure how to check the
> index size of each collection on each shard*
> * What storage system do you use (local SSD, local spinning drives, remote
> storage...)? *Local (hdfs) spinning drives*
> * How much physical memory does your system have? *we have 15 data nodes.
> multiple services installed on each data node (252 GB RAM for each data
> node). 25 gb RAM allocated for solr service.*
> * How much memory is free for disk cache? *i could not find.*
> * How many concurrent queries do you issue? *very less. i dont see any
> concurrent queries to this file_collection for now.*
> * Do you update while you search? *Yes.. its very less.*
> * What does a full query (rows, faceting, grouping, highlighting,
> everything) look like? *for the file_collection, rows - 100, highlights =
> false, no facets, expand = false.*
> * How many documents does a typical query match (hitcount)? *it varies with
> each file. i have sort on int field to order commands in the query.*
>
> we have two sets of collections on solr cluster ( 17 data nodes)
>
> 1. main_collection - collection created per year. each collection uses 8
> shards 2 replicas ex: main_collection_2016, main_collection_2015 etc
>
> 2. file_collection (where files having commands are indexed) - collection
> created per 2 years. it uses 3 shards and 2 replicas. ex :
> file_collection_2014, file_collection_2016
>
> The slowness is happening for file_collection. though it has 3 shards,
> documents are available in 2 shards. shard1 - 150M docs and shard2 has 330M
> docs , shard3 is empty.
>
> main_collection is looks good.
>
> please let me know if you need any additional details.
>
> Regards,
> Anil
>
>
> On 13 March 2016 at 21:48, Anil  wrote:
>
> > Thanks Toke and Jack.
> >
> > Jack,
> >
> > Yes. it is 480 million :)
> >
> > I will share the additional details soon. thanks.
> >
> >
> > Regards,
> > Anil
> >
> >
> >
> >
> >
> > On 13 March 2016 at 21:06, Jack Krupansky 
> > wrote:
> >
> >> (We should have a wiki/doc page for the "usual list of suspects" when
> >> queries are/appear slow, rather than need to repeat the same mantra(s)
> for
> >> every inquiry on this topic.)
> >>
> >>
> >> -- Jack Krupansky
> >>
> >> On Sun, Mar 13, 2016 at 11:29 AM, Toke Eskildsen <
> t...@statsbiblioteket.dk>
> >> wrote:
> >>
> >> > Anil  wrote:
> >> > > i have indexed a data (commands from files) with 10 fields and 3 of
> >> them
> >> > is
> >> > > text fields. collection is created with 3 shards and 2 replicas. I
> >> have
> >> > > used document routing as well.
> >> >
> >> > > Currently collection holds 47,80,01,405 records.
> >> >
> >> > ...480 million, right? Funny digit grouping in India.
> >> >
> >> > > text search against text field taking around 5 sec. solr is query
> just
> >> > and
> >> > > of two terms with fl as 7 fields
> >> >
> >> > > fileId:"file unique id" AND command_text:(system login)
> >> >
> >> > While not an impressive response time, it might just be that your
> >> hardware
> >> > is not enough to handle that amount of documents. The usual culprit is
> >> IO
> >> > speed, so chances are you have a system with spinning drives and not
> >> enough
> >> > RAM: Switch to SSD and/or add more RAM.
> >> >
> >> > To give better advice, we need more information.
> >> >
> >> > * How large are your 3 shards in bytes?
> >> > * What storage system do you use (local SSD, local spinning drives,
> >> remote
> >> > storage...)?
> >> > * How much physical memory does your system have?
> >> > * How much memory is free for disk cache?
> >> > * How many concurrent queries do you issue?
> >> > * Do you update while you search?
> >> > * What does a full query (rows, faceting, grouping, highlighting,
> >> > everything) look like?
> >> > * How many documents does a typical query match (hitcount)?
> >> >
> >> > - Toke Eskildsen
> >> >
> >>
> >
> >
>


Re: Solr Queries are very slow - Suggestions needed

2016-03-14 Thread Susheel Kumar
If you can find/know which fields (or combination) in your document divides
/ groups the data together would be the fields for custom routing.  Solr
supports up to two level.

E.g. if you have field with say documentType or country or etc. would
help.  See the document routing at
https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud



On Mon, Mar 14, 2016 at 3:14 PM, Erick Erickson 
wrote:

> Usually I just let the compositeId do its thing and only go for custom
> routing when the default proves inadequate.
>
> Note: your 480M documents may very well be too many for three shards!
> You really have to test
>
> Erick
>
>
> On Mon, Mar 14, 2016 at 10:04 AM, Anil  wrote:
> > Hi Erick,
> > In b/w, Do you recommend any effective shard distribution method ?
> >
> > Regards,
> > Anil
> >
> > On 14 March 2016 at 22:30, Erick Erickson 
> wrote:
> >
> >> Try shards.info=true, but pinging the shard directly is the most
> certain.
> >>
> >>
> >> Best,
> >> Erick
> >>
> >> On Mon, Mar 14, 2016 at 9:48 AM, Anil  wrote:
> >> > HI Erik,
> >> >
> >> > we have used document routing to balance the shards load and for
> >> > expand/collapse. it is mainly used for main_collection which holds
> one to
> >> > many relationship records. In file_collection, it is only for load
> >> > distribution.
> >> >
> >> > 25GB for entire solr service. each machine will act as shard for some
> >> > collections.
> >> >
> >> > we have not stress tested our servers at least for solr service. i
> have
> >> > read the the link you have shared, i will do something on it. thanks
> for
> >> > sharing.
> >> >
> >> > i have checked other collections, where index size is max 90GB and 5
> M as
> >> > max number of documents. but for the particular file_collection_2014
> , i
> >> > see total index size across replicas is 147 GB.
> >> >
> >> > Can we get any hints if we run the query with debugQuery=true ?  what
> is
> >> > the effective way of load distribution ? Please advice.
> >> >
> >> > Regards,
> >> > Anil
> >> >
> >> > On 14 March 2016 at 20:32, Erick Erickson 
> >> wrote:
> >> >
> >> >> bq: The slowness is happening for file_collection. though it has 3
> >> shards,
> >> >> documents are available in 2 shards. shard1 - 150M docs and shard2
> has
> >> 330M
> >> >> docs , shard3 is empty.
> >> >>
> >> >> Well, this collection terribly balanced. Putting 330M docs on a
> single
> >> >> shard is
> >> >> pushing the limits, the only time I've seen that many docs on a
> shard,
> >> >> particularly
> >> >> with 25G of ram, they were very small records. My guess is that you
> will
> >> >> find
> >> >> the queries you send to that shard substantially slower than the 150M
> >> >> shard,
> >> >> although 150M could also be pushing your limits. You can measure this
> >> >> by sending the query to the specific core (something like
> >> >>
> >> >> solr/files_shard1_replica1/query?(your queryhere)&distrib=false
> >> >>
> >> >> My bet is that your QTime will be significantly different with the
> two
> >> >> shards.
> >> >>
> >> >> It also sounds like you're using implicit routing where you control
> >> where
> >> >> the
> >> >> files go, it's easy to have unbalanced shards in that case, why did
> you
> >> >> decide
> >> >> to do it this way? There are valid reasons, but...
> >> >>
> >> >> In short, my guess is that you've simply overloaded your shard with
> >> >> 330M docs. It's
> >> >> not at all clear that even 150 will give you satisfactory
> performance,
> >> >> have you stress
> >> >> tested your servers? Here's the long form of sizing:
> >> >>
> >> >>
> >> >>
> >>
> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> >> >>
> >> >> Best,
> >> >> Erick
> >> >>
> >> >> On Mon, Mar 14, 2016 at 7:05 AM, Susheel Kumar <
>

Re: Indexing 700 docs per second

2016-04-19 Thread Susheel Kumar
 It sounds achievable with your machine configuration and i would suggest
to try out atomic update.  Use SolrJ with multi-threaded indexing for
higher indexing rate.

Thanks,
Susheel



On Tue, Apr 19, 2016 at 9:27 AM, Tom Evans  wrote:

> On Tue, Apr 19, 2016 at 10:25 AM, Mark Robinson 
> wrote:
> > Hi,
> >
> > I have a requirement to index (mainly updation) 700 docs per second.
> > Suppose I have a 128GB RAM, 32 CPU machine, with each doc size around 260
> > byes (6 fields out of which only 2 will undergo updation at the above
> > rate). This collection has around 122Million docs and that count is
> pretty
> > much a constant.
> >
> > 1. Can I manage this updation rate with a non-sharded ie single Solr
> > instance set up?
> > 2. Also is atomic update or a full update (the whole doc) of the changed
> > records the better approach in this case.
> >
> > Could some one please share their views/ experience?
>
> Try it and see - everyone's data/schemas are different and can affect
> indexing speed. It certainly sounds achievable enough - presumably you
> can at least produce the documents at that rate?
>
> Cheers
>
> Tom
>


ReversedWildcardFilterFactory question

2016-05-04 Thread Susheel Kumar
Hello,

I wanted to confirm that using below type for fields where user *may
also* search
for leading wildcard, is a good solution and edismax query parser would
automatically reverse the query string in case of leading wildcard search
e.g. q:"text:*plane" would automatically be reversed by edismax query
parser to search for plane* ?

Thanks,
Susheel


  
  
  
  
  
  
  
  
  
  
  
  
  
  
  


Solr cloud 6.0.0 with ZooKeeper 3.4.8 Errors

2016-05-04 Thread Susheel Kumar
Hello,

I am trying to setup 2 node Solr cloud 6 cluster with ZK 3.4.8 and used the
install service to setup solr.

After launching Solr Admin Panel on server1, it looses connections in few
seconds and then comes back and other node server2 is marked as Down in
cloud graph. After few seconds its loosing the connection and comes back.

Any idea what may be going wrong? Has anyone used Solr 6 with ZK 3.4.8.
Have never seen this error before with solr 5.x with ZK 3.4.6.

Below log from server1 & server2.  The ZK has 3 nodes with chroot enabled.

Thanks,
Susheel

server1/solr.log




2016-05-04 19:20:53.804 INFO  (qtp1989972246-14) [   ]
o.a.s.c.c.ZkStateReader path=[/collections/collection1]
[configName]=[collection1] specified config exists in ZooKeeper

2016-05-04 19:20:53.806 INFO  (qtp1989972246-14) [   ] o.a.s.s.HttpSolrCall
[admin] webapp=null path=/admin/collections
params={action=CLUSTERSTATUS&wt=json&_=1462389588125} status=0 QTime=25

2016-05-04 19:20:53.859 INFO  (qtp1989972246-19) [   ]
o.a.s.h.a.CollectionsHandler Invoked Collection Action :list with params
action=LIST&wt=json&_=1462389588125 and sendToOCPQueue=true

2016-05-04 19:20:53.861 INFO  (qtp1989972246-19) [   ] o.a.s.s.HttpSolrCall
[admin] webapp=null path=/admin/collections
params={action=LIST&wt=json&_=1462389588125} status=0 QTime=2

2016-05-04 19:20:57.520 INFO  (qtp1989972246-13) [   ] o.a.s.s.HttpSolrCall
[admin] webapp=null path=/admin/cores
params={indexInfo=false&wt=json&_=1462389588124} status=0 QTime=0

2016-05-04 19:20:57.546 INFO  (qtp1989972246-15) [   ] o.a.s.s.HttpSolrCall
[admin] webapp=null path=/admin/info/system
params={wt=json&_=1462389588126} status=0 QTime=25

2016-05-04 19:20:57.610 INFO  (qtp1989972246-13) [   ]
o.a.s.h.a.CollectionsHandler Invoked Collection Action :list with params
action=LIST&wt=json&_=1462389588125 and sendToOCPQueue=true

2016-05-04 19:20:57.613 INFO  (qtp1989972246-13) [   ] o.a.s.s.HttpSolrCall
[admin] webapp=null path=/admin/collections
params={action=LIST&wt=json&_=1462389588125} status=0 QTime=3

2016-05-04 19:21:29.139 INFO  (qtp1989972246-5980) [   ]
o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException) caught
when connecting to {}->http://server2:8983: Too many open files

2016-05-04 19:21:29.139 INFO  (qtp1989972246-5983) [   ]
o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException) caught
when connecting to {}->http://server2:8983: Too many open files

2016-05-04 19:21:29.139 INFO  (qtp1989972246-5984) [   ]
o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException) caught
when connecting to {}->http://server2:8983: Too many open files

2016-05-04 19:21:29.141 INFO  (qtp1989972246-5984) [   ]
o.a.h.i.c.DefaultHttpClient Retrying connect to {}->http://server2:8983

2016-05-04 19:21:29.141 INFO  (qtp1989972246-5984) [   ]
o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException) caught
when connecting to {}->http://server2:8983: Too many open files

2016-05-04 19:21:29.142 INFO  (qtp1989972246-5984) [   ]
o.a.h.i.c.DefaultHttpClient Retrying connect to {}->http://server2:8983

2016-05-04 19:21:29.142 INFO  (qtp1989972246-5984) [   ]
o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException) caught
when connecting to {}->http://server2:8983: Too many open files

2016-05-04 19:21:29.142 INFO  (qtp1989972246-5984) [   ]
o.a.h.i.c.DefaultHttpClient Retrying connect to {}->http://server2:8983

2016-05-04 19:21:29.140 INFO  (qtp1989972246-5983) [   ]
o.a.h.i.c.DefaultHttpClient Retrying connect to {}->http://server2:8983

2016-05-04 19:21:29.140 INFO  (qtp1989972246-5980) [   ]
o.a.h.i.c.DefaultHttpClient Retrying connect to {}->http://server2:8983

2016-05-04 19:21:29.143 INFO  (qtp1989972246-5983) [   ]
o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException) caught
when connecting to {}->http://server2:8983: Too many open files

2016-05-04 19:21:29.144 INFO  (qtp1989972246-5983) [   ]
o.a.h.i.c.DefaultHttpClient Retrying connect to {}->http://server2:8983

2016-05-04 19:21:29.144 INFO  (qtp1989972246-5980) [   ]
o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException) caught
when connecting to {}->http://server2:8983: Too many open files

2016-05-04 19:21:29.144 INFO  (qtp1989972246-5983) [   ]
o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException) caught
when connecting to {}->http://server2:8983: Too many open files

2016-05-04 19:20:53.806 INFO  (qtp1989972246-14) [   ] o.a.s.s.HttpSolrCall
[admin] webapp=null path=/admin/collections
params={action=CLUSTERSTATUS&wt=json&_=1462389588125} status=0 QTime=25

2016-05-04 19:20:53.859 INFO  (qtp1989972246-19) [   ]
o.a.s.h.a.CollectionsHandler Invoked Collection Action :list with params
action=LIST&wt=json&_=1462389588125 and sendToOCPQueue=true

2016-05-04 19:20:53.861 INFO  (qtp1989972246-19) [   ] o.a.s.s.HttpSolrCall
[admin] webapp=null path=/admin/collections
params={action=LIST&wt=json&_=1462389588125} status=0 QTime=2

2016-05-04 19

Re: Solr cloud 6.0.0 with ZooKeeper 3.4.8 Errors

2016-05-04 Thread Susheel Kumar
Thanks, Nick. Do we know any suggested # for file descriptor limit with
Solr6?  Also wondering why i haven't seen this problem before with Solr 5.x?

On Wed, May 4, 2016 at 4:54 PM, Nick Vasilyev 
wrote:

> It looks like you have too many open files, try increasing the file
> descriptor limit.
>
> On Wed, May 4, 2016 at 3:48 PM, Susheel Kumar 
> wrote:
>
> > Hello,
> >
> > I am trying to setup 2 node Solr cloud 6 cluster with ZK 3.4.8 and used
> the
> > install service to setup solr.
> >
> > After launching Solr Admin Panel on server1, it looses connections in few
> > seconds and then comes back and other node server2 is marked as Down in
> > cloud graph. After few seconds its loosing the connection and comes back.
> >
> > Any idea what may be going wrong? Has anyone used Solr 6 with ZK 3.4.8.
> > Have never seen this error before with solr 5.x with ZK 3.4.6.
> >
> > Below log from server1 & server2.  The ZK has 3 nodes with chroot
> enabled.
> >
> > Thanks,
> > Susheel
> >
> > server1/solr.log
> >
> > 
> >
> >
> > 2016-05-04 19:20:53.804 INFO  (qtp1989972246-14) [   ]
> > o.a.s.c.c.ZkStateReader path=[/collections/collection1]
> > [configName]=[collection1] specified config exists in ZooKeeper
> >
> > 2016-05-04 19:20:53.806 INFO  (qtp1989972246-14) [   ]
> o.a.s.s.HttpSolrCall
> > [admin] webapp=null path=/admin/collections
> > params={action=CLUSTERSTATUS&wt=json&_=1462389588125} status=0 QTime=25
> >
> > 2016-05-04 19:20:53.859 INFO  (qtp1989972246-19) [   ]
> > o.a.s.h.a.CollectionsHandler Invoked Collection Action :list with params
> > action=LIST&wt=json&_=1462389588125 and sendToOCPQueue=true
> >
> > 2016-05-04 19:20:53.861 INFO  (qtp1989972246-19) [   ]
> o.a.s.s.HttpSolrCall
> > [admin] webapp=null path=/admin/collections
> > params={action=LIST&wt=json&_=1462389588125} status=0 QTime=2
> >
> > 2016-05-04 19:20:57.520 INFO  (qtp1989972246-13) [   ]
> o.a.s.s.HttpSolrCall
> > [admin] webapp=null path=/admin/cores
> > params={indexInfo=false&wt=json&_=1462389588124} status=0 QTime=0
> >
> > 2016-05-04 19:20:57.546 INFO  (qtp1989972246-15) [   ]
> o.a.s.s.HttpSolrCall
> > [admin] webapp=null path=/admin/info/system
> > params={wt=json&_=1462389588126} status=0 QTime=25
> >
> > 2016-05-04 19:20:57.610 INFO  (qtp1989972246-13) [   ]
> > o.a.s.h.a.CollectionsHandler Invoked Collection Action :list with params
> > action=LIST&wt=json&_=1462389588125 and sendToOCPQueue=true
> >
> > 2016-05-04 19:20:57.613 INFO  (qtp1989972246-13) [   ]
> o.a.s.s.HttpSolrCall
> > [admin] webapp=null path=/admin/collections
> > params={action=LIST&wt=json&_=1462389588125} status=0 QTime=3
> >
> > 2016-05-04 19:21:29.139 INFO  (qtp1989972246-5980) [   ]
> > o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException)
> caught
> > when connecting to {}->http://server2:8983: Too many open files
> >
> > 2016-05-04 19:21:29.139 INFO  (qtp1989972246-5983) [   ]
> > o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException)
> caught
> > when connecting to {}->http://server2:8983: Too many open files
> >
> > 2016-05-04 19:21:29.139 INFO  (qtp1989972246-5984) [   ]
> > o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException)
> caught
> > when connecting to {}->http://server2:8983: Too many open files
> >
> > 2016-05-04 19:21:29.141 INFO  (qtp1989972246-5984) [   ]
> > o.a.h.i.c.DefaultHttpClient Retrying connect to {}->http://server2:8983
> >
> > 2016-05-04 19:21:29.141 INFO  (qtp1989972246-5984) [   ]
> > o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException)
> caught
> > when connecting to {}->http://server2:8983: Too many open files
> >
> > 2016-05-04 19:21:29.142 INFO  (qtp1989972246-5984) [   ]
> > o.a.h.i.c.DefaultHttpClient Retrying connect to {}->http://server2:8983
> >
> > 2016-05-04 19:21:29.142 INFO  (qtp1989972246-5984) [   ]
> > o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException)
> caught
> > when connecting to {}->http://server2:8983: Too many open files
> >
> > 2016-05-04 19:21:29.142 INFO  (qtp1989972246-5984) [   ]
> > o.a.h.i.c.DefaultHttpClient Retrying connect to {}->http://server2:8983
> >
> > 2016-05-04 19:21:29.140 INFO  (qtp1989972246-5983) [   ]
> > o.a.h.i.c.DefaultHttpClient Retrying connect to {}->http://server2:8983
> >
> > 2016-05-04 19:21:29.140 INFO  (qtp1989972246-5980) [   ]
> &g

Re: Solr cloud 6.0.0 with ZooKeeper 3.4.8 Errors

2016-05-04 Thread Susheel Kumar
Thanks, Nick & Hoss.  I am using the exact same machine, have wiped out
solr 5.5.0 and installed solr-6.0.0 with external ZK 3.4.8.  I checked the
File Description limit for user solr, which is 12000 and increased to
52000. Don't see "too many files open..." error now in Solr log but still
Solr connection getting lost in Admin panel.

Let me do some more tests and install older version back to confirm and
will share the findings.

Thanks,
Susheel

On Wed, May 4, 2016 at 8:11 PM, Chris Hostetter 
wrote:

>
> : Thanks, Nick. Do we know any suggested # for file descriptor limit with
> : Solr6?  Also wondering why i haven't seen this problem before with Solr
> 5.x?
>
> are you running Solr6 on the exact same host OS that you were running
> Solr5 on?
>
> even if you are using the "same OS version" on a diff machine, that could
> explain the discrepency if you (or someone else) increased the file
> descriptor limit on the "old machine" but that neverh appened on the 'new
> machine"
>
>
>
> : On Wed, May 4, 2016 at 4:54 PM, Nick Vasilyev 
> : wrote:
> :
> : > It looks like you have too many open files, try increasing the file
> : > descriptor limit.
> : >
> : > On Wed, May 4, 2016 at 3:48 PM, Susheel Kumar 
> : > wrote:
> : >
> : > > Hello,
> : > >
> : > > I am trying to setup 2 node Solr cloud 6 cluster with ZK 3.4.8 and
> used
> : > the
> : > > install service to setup solr.
> : > >
> : > > After launching Solr Admin Panel on server1, it looses connections
> in few
> : > > seconds and then comes back and other node server2 is marked as Down
> in
> : > > cloud graph. After few seconds its loosing the connection and comes
> back.
> : > >
> : > > Any idea what may be going wrong? Has anyone used Solr 6 with ZK
> 3.4.8.
> : > > Have never seen this error before with solr 5.x with ZK 3.4.6.
> : > >
> : > > Below log from server1 & server2.  The ZK has 3 nodes with chroot
> : > enabled.
> : > >
> : > > Thanks,
> : > > Susheel
> : > >
> : > > server1/solr.log
> : > >
> : > > 
> : > >
> : > >
> : > > 2016-05-04 19:20:53.804 INFO  (qtp1989972246-14) [   ]
> : > > o.a.s.c.c.ZkStateReader path=[/collections/collection1]
> : > > [configName]=[collection1] specified config exists in ZooKeeper
> : > >
> : > > 2016-05-04 19:20:53.806 INFO  (qtp1989972246-14) [   ]
> : > o.a.s.s.HttpSolrCall
> : > > [admin] webapp=null path=/admin/collections
> : > > params={action=CLUSTERSTATUS&wt=json&_=1462389588125} status=0
> QTime=25
> : > >
> : > > 2016-05-04 19:20:53.859 INFO  (qtp1989972246-19) [   ]
> : > > o.a.s.h.a.CollectionsHandler Invoked Collection Action :list with
> params
> : > > action=LIST&wt=json&_=1462389588125 and sendToOCPQueue=true
> : > >
> : > > 2016-05-04 19:20:53.861 INFO  (qtp1989972246-19) [   ]
> : > o.a.s.s.HttpSolrCall
> : > > [admin] webapp=null path=/admin/collections
> : > > params={action=LIST&wt=json&_=1462389588125} status=0 QTime=2
> : > >
> : > > 2016-05-04 19:20:57.520 INFO  (qtp1989972246-13) [   ]
> : > o.a.s.s.HttpSolrCall
> : > > [admin] webapp=null path=/admin/cores
> : > > params={indexInfo=false&wt=json&_=1462389588124} status=0 QTime=0
> : > >
> : > > 2016-05-04 19:20:57.546 INFO  (qtp1989972246-15) [   ]
> : > o.a.s.s.HttpSolrCall
> : > > [admin] webapp=null path=/admin/info/system
> : > > params={wt=json&_=1462389588126} status=0 QTime=25
> : > >
> : > > 2016-05-04 19:20:57.610 INFO  (qtp1989972246-13) [   ]
> : > > o.a.s.h.a.CollectionsHandler Invoked Collection Action :list with
> params
> : > > action=LIST&wt=json&_=1462389588125 and sendToOCPQueue=true
> : > >
> : > > 2016-05-04 19:20:57.613 INFO  (qtp1989972246-13) [   ]
> : > o.a.s.s.HttpSolrCall
> : > > [admin] webapp=null path=/admin/collections
> : > > params={action=LIST&wt=json&_=1462389588125} status=0 QTime=3
> : > >
> : > > 2016-05-04 19:21:29.139 INFO  (qtp1989972246-5980) [   ]
> : > > o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException)
> : > caught
> : > > when connecting to {}->http://server2:8983: Too many open files
> : > >
> : > > 2016-05-04 19:21:29.139 INFO  (qtp1989972246-5983) [   ]
> : > > o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException)
> : > caught
> : > > 

Re: Solr cloud 6.0.0 with ZooKeeper 3.4.8 Errors

2016-05-05 Thread Susheel Kumar
Nick, Hoss -  Things are back to normal with ZK 3.4.8 and ZK-6.0.0.  I
switched to Solr 5.5.0 with ZK 3.4.8 which worked fine and then installed
6.0.0.  I suspect (not 100% sure) i left ZK dataDir / Solr collection
directory data from previous ZK/solr version which probably was making Solr
6 in unstable state.

Thanks,
Susheel

On Wed, May 4, 2016 at 9:56 PM, Susheel Kumar  wrote:

> Thanks, Nick & Hoss.  I am using the exact same machine, have wiped out
> solr 5.5.0 and installed solr-6.0.0 with external ZK 3.4.8.  I checked the
> File Description limit for user solr, which is 12000 and increased to
> 52000. Don't see "too many files open..." error now in Solr log but still
> Solr connection getting lost in Admin panel.
>
> Let me do some more tests and install older version back to confirm and
> will share the findings.
>
> Thanks,
> Susheel
>
> On Wed, May 4, 2016 at 8:11 PM, Chris Hostetter 
> wrote:
>
>>
>> : Thanks, Nick. Do we know any suggested # for file descriptor limit with
>> : Solr6?  Also wondering why i haven't seen this problem before with Solr
>> 5.x?
>>
>> are you running Solr6 on the exact same host OS that you were running
>> Solr5 on?
>>
>> even if you are using the "same OS version" on a diff machine, that could
>> explain the discrepency if you (or someone else) increased the file
>> descriptor limit on the "old machine" but that neverh appened on the 'new
>> machine"
>>
>>
>>
>> : On Wed, May 4, 2016 at 4:54 PM, Nick Vasilyev > >
>> : wrote:
>> :
>> : > It looks like you have too many open files, try increasing the file
>> : > descriptor limit.
>> : >
>> : > On Wed, May 4, 2016 at 3:48 PM, Susheel Kumar 
>> : > wrote:
>> : >
>> : > > Hello,
>> : > >
>> : > > I am trying to setup 2 node Solr cloud 6 cluster with ZK 3.4.8 and
>> used
>> : > the
>> : > > install service to setup solr.
>> : > >
>> : > > After launching Solr Admin Panel on server1, it looses connections
>> in few
>> : > > seconds and then comes back and other node server2 is marked as
>> Down in
>> : > > cloud graph. After few seconds its loosing the connection and comes
>> back.
>> : > >
>> : > > Any idea what may be going wrong? Has anyone used Solr 6 with ZK
>> 3.4.8.
>> : > > Have never seen this error before with solr 5.x with ZK 3.4.6.
>> : > >
>> : > > Below log from server1 & server2.  The ZK has 3 nodes with chroot
>> : > enabled.
>> : > >
>> : > > Thanks,
>> : > > Susheel
>> : > >
>> : > > server1/solr.log
>> : > >
>> : > > 
>> : > >
>> : > >
>> : > > 2016-05-04 19:20:53.804 INFO  (qtp1989972246-14) [   ]
>> : > > o.a.s.c.c.ZkStateReader path=[/collections/collection1]
>> : > > [configName]=[collection1] specified config exists in ZooKeeper
>> : > >
>> : > > 2016-05-04 19:20:53.806 INFO  (qtp1989972246-14) [   ]
>> : > o.a.s.s.HttpSolrCall
>> : > > [admin] webapp=null path=/admin/collections
>> : > > params={action=CLUSTERSTATUS&wt=json&_=1462389588125} status=0
>> QTime=25
>> : > >
>> : > > 2016-05-04 19:20:53.859 INFO  (qtp1989972246-19) [   ]
>> : > > o.a.s.h.a.CollectionsHandler Invoked Collection Action :list with
>> params
>> : > > action=LIST&wt=json&_=1462389588125 and sendToOCPQueue=true
>> : > >
>> : > > 2016-05-04 19:20:53.861 INFO  (qtp1989972246-19) [   ]
>> : > o.a.s.s.HttpSolrCall
>> : > > [admin] webapp=null path=/admin/collections
>> : > > params={action=LIST&wt=json&_=1462389588125} status=0 QTime=2
>> : > >
>> : > > 2016-05-04 19:20:57.520 INFO  (qtp1989972246-13) [   ]
>> : > o.a.s.s.HttpSolrCall
>> : > > [admin] webapp=null path=/admin/cores
>> : > > params={indexInfo=false&wt=json&_=1462389588124} status=0 QTime=0
>> : > >
>> : > > 2016-05-04 19:20:57.546 INFO  (qtp1989972246-15) [   ]
>> : > o.a.s.s.HttpSolrCall
>> : > > [admin] webapp=null path=/admin/info/system
>> : > > params={wt=json&_=1462389588126} status=0 QTime=25
>> : > >
>> : > > 2016-05-04 19:20:57.610 INFO  (qtp1989972246-13) [   ]
>> : > > o.a.s.h.a.CollectionsHandler Invoked Collection Action :list with
>> params
>> : > > action=

Re: Solr cloud 6.0.0 with ZooKeeper 3.4.8 Errors

2016-05-05 Thread Susheel Kumar
Yes, Nick, I am using the chroot to share the ZK for different instances.

On Thu, May 5, 2016 at 3:08 PM, Nick Vasilyev 
wrote:

> Just out of curiosity, are you using sharing the zookeepers between the
> different versions of Solr? If So, are you specifying a zookeeper chroot?
> On May 5, 2016 2:05 PM, "Susheel Kumar"  wrote:
>
> > Nick, Hoss -  Things are back to normal with ZK 3.4.8 and ZK-6.0.0.  I
> > switched to Solr 5.5.0 with ZK 3.4.8 which worked fine and then installed
> > 6.0.0.  I suspect (not 100% sure) i left ZK dataDir / Solr collection
> > directory data from previous ZK/solr version which probably was making
> Solr
> > 6 in unstable state.
> >
> > Thanks,
> > Susheel
> >
> > On Wed, May 4, 2016 at 9:56 PM, Susheel Kumar 
> > wrote:
> >
> > > Thanks, Nick & Hoss.  I am using the exact same machine, have wiped out
> > > solr 5.5.0 and installed solr-6.0.0 with external ZK 3.4.8.  I checked
> > the
> > > File Description limit for user solr, which is 12000 and increased to
> > > 52000. Don't see "too many files open..." error now in Solr log but
> still
> > > Solr connection getting lost in Admin panel.
> > >
> > > Let me do some more tests and install older version back to confirm and
> > > will share the findings.
> > >
> > > Thanks,
> > > Susheel
> > >
> > > On Wed, May 4, 2016 at 8:11 PM, Chris Hostetter <
> > hossman_luc...@fucit.org>
> > > wrote:
> > >
> > >>
> > >> : Thanks, Nick. Do we know any suggested # for file descriptor limit
> > with
> > >> : Solr6?  Also wondering why i haven't seen this problem before with
> > Solr
> > >> 5.x?
> > >>
> > >> are you running Solr6 on the exact same host OS that you were running
> > >> Solr5 on?
> > >>
> > >> even if you are using the "same OS version" on a diff machine, that
> > could
> > >> explain the discrepency if you (or someone else) increased the file
> > >> descriptor limit on the "old machine" but that neverh appened on the
> > 'new
> > >> machine"
> > >>
> > >>
> > >>
> > >> : On Wed, May 4, 2016 at 4:54 PM, Nick Vasilyev <
> > nick.vasily...@gmail.com
> > >> >
> > >> : wrote:
> > >> :
> > >> : > It looks like you have too many open files, try increasing the
> file
> > >> : > descriptor limit.
> > >> : >
> > >> : > On Wed, May 4, 2016 at 3:48 PM, Susheel Kumar <
> > susheel2...@gmail.com>
> > >> : > wrote:
> > >> : >
> > >> : > > Hello,
> > >> : > >
> > >> : > > I am trying to setup 2 node Solr cloud 6 cluster with ZK 3.4.8
> and
> > >> used
> > >> : > the
> > >> : > > install service to setup solr.
> > >> : > >
> > >> : > > After launching Solr Admin Panel on server1, it looses
> connections
> > >> in few
> > >> : > > seconds and then comes back and other node server2 is marked as
> > >> Down in
> > >> : > > cloud graph. After few seconds its loosing the connection and
> > comes
> > >> back.
> > >> : > >
> > >> : > > Any idea what may be going wrong? Has anyone used Solr 6 with ZK
> > >> 3.4.8.
> > >> : > > Have never seen this error before with solr 5.x with ZK 3.4.6.
> > >> : > >
> > >> : > > Below log from server1 & server2.  The ZK has 3 nodes with
> chroot
> > >> : > enabled.
> > >> : > >
> > >> : > > Thanks,
> > >> : > > Susheel
> > >> : > >
> > >> : > > server1/solr.log
> > >> : > >
> > >> : > > 
> > >> : > >
> > >> : > >
> > >> : > > 2016-05-04 19:20:53.804 INFO  (qtp1989972246-14) [   ]
> > >> : > > o.a.s.c.c.ZkStateReader path=[/collections/collection1]
> > >> : > > [configName]=[collection1] specified config exists in ZooKeeper
> > >> : > >
> > >> : > > 2016-05-04 19:20:53.806 INFO  (qtp1989972246-14) [   ]
> > >> : > o.a.s.s.HttpSolrCall
> > >> : > > [admin] webapp=null path=/admin/collections
> > >> : > > params={acti

Re: understanding phonetic matching

2016-05-07 Thread Susheel Kumar
Jay,

There are mainly three phonetics algorithms available in Solr i.e.
RefinedSoundex, DoubleMetaphone & BeiderMorse.  We did extensive comparison
considering various tests cases and found BeiderMorse to be the best among
those for finding sound like matches and it also supports multiple
languages.  We also customized Beider Morse extensively for our use case.

So please take a closer look at Beider Morse and i am sure it will help you
out.

Thanks,
Susheel

On Sat, May 7, 2016 at 2:13 PM, Jay Potharaju  wrote:

> Thanks for the feedback, I was getting correct results when searching for
> jon & john. But when I tried other names like 'khloe' it matched on
> 'collier' because the phonetic filter generated KL as the token.
> Is phonetic filter the best way to find similar sounding names?
>
>
> On Wed, Mar 23, 2016 at 12:01 AM, davidphilip cherian <
> davidphilipcher...@gmail.com> wrote:
>
> > The "phonetic_en" analyzer definition available in solr-schema does
> return
> > documents having "Jon", "JN", "John" when search term is "John". Checkout
> > screen shot here : http://imgur.com/0R6SvX2
> >
> > This wiki page explains how phonetic matching works :
> >
> >
> https://cwiki.apache.org/confluence/display/solr/Phonetic+Matching#PhoneticMatching-DoubleMetaphone
> >
> >
> > Hope that helps.
> >
> >
> >
> > On Wed, Mar 23, 2016 at 11:18 AM, Alexandre Rafalovitch <
> > arafa...@gmail.com>
> > wrote:
> >
> > > I'd start by putting LowerCaseFF before the PhoneticFilter.
> > >
> > > But then, you say you were using Analysis screen and what? Do you get
> > > the matches when you put your sample text and the query text in the
> > > two boxes in the UI? I am not sure what "look at my solr data" means
> > > in this particular context.
> > >
> > > Regards,
> > >Alex.
> > > 
> > > Newsletter and resources for Solr beginners and intermediates:
> > > http://www.solr-start.com/
> > >
> > >
> > > On 23 March 2016 at 16:27, Jay Potharaju 
> wrote:
> > > > Hi,
> > > > I am trying to do name matching using the phonetic filter factory. As
> > > part
> > > > of that I was analyzing the data using analysis screen in solr UI.
> If i
> > > > search for john, any documents containing john or jon should be
> found.
> > > >
> > > > Following is my definition of the custom field that I use for
> indexing
> > > the
> > > > data. When I look at my solr data I dont see any similar sounding
> names
> > > in
> > > > my solr data, even though I have set inject="true". Is that not how
> it
> > is
> > > > supposed to work?
> > > > Can someone explain how phonetic matching works?
> > > >
> > > >   > > positionIncrementGap
> > > > ="100">
> > > >
> > > >  
> > > >
> > > > 
> > > >
> > > >  > > encoder="DoubleMetaphone"
> > > > inject="true" maxCodeLength="5"/>
> > > >
> > > > 
> > > >
> > > >  
> > > >
> > > > 
> > > >
> > > > --
> > > > Thanks
> > > > Jay
> > >
> >
>
>
>
> --
> Thanks
> Jay Potharaju
>


Re: Difficulties in getting Solrcloud running

2015-08-19 Thread Susheel Kumar
Use command like below to create collection

http://
:/solr/admin/collections?action=CREATE&name=&numShards=2&replicationFactor=2&maxShardsPerNode=2&collection.configName=

Susheel



On Wed, Aug 19, 2015 at 11:03 AM, Kevin Lee 
wrote:

> Hi,
>
> Have you created a collection yet?  If not, then there won’t be a graph
> yet.  It doesn’t show up until there is at least one collection.
>
> - Kevin
>
> > On Aug 19, 2015, at 5:48 AM, Merlin Morgenstern <
> merlin.morgenst...@gmail.com> wrote:
> >
> > HI everybody,
> >
> > I am trying to setup solrcloud on ubuntu and somehow the graph on the
> admin
> > interface does not show up. It is simply blanck. The tree is available.
> >
> > This is a test installation on one machine.
> >
> > There are 3 zookeepers running.
> >
> > I start two solr nodes like this:
> >
> > solr-5.2.1$ bin/solr start -cloud -s server/solr1 -p 8983 -z
> > zk1:2181,zk1:2182,zk1:2183 -noprompt
> >
> > solr-5.2.1$ bin/solr start -cloud -s server/solr2 -p 8984 -z
> > zk1:2181,zk1:2182,zk1:2183 -noprompt
> >
> > zk1 is a local interface with 10.0.0.120
> >
> > it all looks OK, no error messages.
> >
> > Thank you in advance for any help on this
>
>


Re: Solrcloud node is not comming up

2015-08-19 Thread Susheel Kumar
When you are adding a node,what exactly you are looking for that node to
do.  Are you adding node to create a new Replica in which case you will
call ADDREPLICA collections API.

Thanks,
Susheel

On Wed, Aug 19, 2015 at 3:42 PM, Merlin Morgenstern <
merlin.morgenst...@gmail.com> wrote:

> I have a Solrcloud cluster running with 2 nodes, configured with 1 shard
> and 2 replica. Now I have added a node on a new server, registered with the
> same three zookeepers. The node shows up inside the tree of the Solrcloud
> admin GUI under "live nodes".
>
> Unfortunatelly the new node is not inside the graphical view and it shows 0
> cores available while the other admin interface shows the available core. I
> have also shutdown the second replica server which is now grayed out. But
> still third node not available.
>
> Is there something I have to do in order to add a node, despite registering
> it? This is the startup command I am using:
> bin/solr start -cloud -s server/solr2 -p 8983 -z zk1:2181,zk1:2182,zk1:2183
> -noprompt
>


Re: How to Fast Bulk Inserting documents

2015-08-19 Thread Susheel Kumar
For Indexing 3.5 billion documents, you will not only run into bottleneck
with Solr but also at different places (data acquisition, solr document
object creation, submitting in bulk/batches to Solr).

This will require parallelizing the above operations at each of the above
steps which can get you maximum throughput.  Multi-threaded java solrj
based Indexer & CloudSolrClient is required as described by Shawn.   I have
used ConcurrentSolrUpdate in the past but with CloudSolrClient,
setParallelUpdates should be tried out.

Thanks,
Susheel

On Wed, Aug 19, 2015 at 2:41 PM, Erick Erickson 
wrote:

> Ir you're sitting on HDFS anyway, you could use MapReduceIndexerTool. I'm
> not
> sure that'll hit your rate, it spends some time copying things around.
> If you're not on
> HDFS, though, it's not an option.
>
> Best,
> Erick
>
> On Wed, Aug 19, 2015 at 11:36 AM, Upayavira  wrote:
> >
> >
> > On Wed, Aug 19, 2015, at 07:13 PM, Toke Eskildsen wrote:
> >> Troy Edwards  wrote:
> >> > My average document size is 400 bytes
> >> > Number of documents that need to be inserted 25/second
> >> > (for a total of about 3.6 Billion documents)
> >>
> >> > Any ideas/suggestions on how that can be done? (use a client
> >> > or uploadcsv or stream or data import handler)
> >>
> >> Use more than one cloud. Make them fully independent. As I suggested
> when
> >> you asked 4 days ago. That would also make it easy to scale: Just
> measure
> >> how much a single setup can take and do the math.
> >
> > Yes - work out how much each node can handle, then you can work out how
> > many nodes you need.
> >
> > You could consider using implicit routing rather than compositeId, which
> > means that you take on responsibility for hashing your ID to push
> > content to the right node. (Or, if you use compositeId, you could use
> > the same algorithm, and be sure that you send docs directly to the
> > correct shard.
> >
> > At the moment, if you push five documents to a five shard collection,
> > the node you send them to could end up doing four HTTP requests to the
> > other nodes in the collection. This means you don't need to worry about
> > where to post your content - it is just handled for you. However, there
> > is a performance hit there. Push content direct to the correct node
> > (either using implicit routing, or by replicating the compositeId hash
> > calculation in your client) and you'd increase your indexing throughput
> > significantly, I would theorise.
> >
> > Upayavira
>


Re: How to close log when use the solrj api

2015-08-20 Thread Susheel Kumar
You may want to see the logging level using the Dashboard URL
http://localhost:8983/solr/#/~logging/level & even can set for the session
but otherwise you can look into server/resources/log4j.properties. Refer
https://cwiki.apache.org/confluence/display/solr/Configuring+Logging

On Thu, Aug 20, 2015 at 4:30 AM, fent  wrote:

> when  i use solrj api to add category  data to solr ,
> their will have a lot of DEBUG info ,
> how to close this ,or how to set the log ?
> ths
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-close-log-when-use-the-solrj-api-tp4224142.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Too many updates received since start

2015-08-22 Thread Susheel Kumar
You can try to follow the suggestions at below link which had similar
issued and see if that helps.


http://lucene.472066.n3.nabble.com/ColrCloud-IOException-occured-when-talking-to-server-at-td4061831.html


Thnx

On Sat, Aug 22, 2015 at 9:05 AM, Yago Riveiro 
wrote:

> Hi,
>
> Can anyone explain me the possible causes of this warning?
>
> too many updates received since start - startingUpdates no longer overlaps
> with our currentUpdates
>
> This warning triggers an full recovery for the shard that throw the
> warning.
>
>
>
> -
> Best regards
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Too-many-updates-received-since-start-tp4224617.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Ant Ivy resolve / Authenticated Proxy Issue

2015-09-16 Thread Susheel Kumar
Hi,

Sending it to Solr group in addition to Ivy group.


I have been building Solr trunk (
http://svn.apache.org/repos/asf/lucene/dev/trunk/) using "ant eclipse" from
quite some time but this week i am on a job where things are behind the
firewall and a proxy is used.

Issue: When not in company network then build works fine but when inside
company network  Ivy stucks during resolve when downloading
https://repo1.maven.org/maven2/org/apache/ant/ant/1.8.2/ant-1.8.2.jar (see
below) I have set ANT_OPTS=-Dhttp.proxyHost=myproxyhost
-Dhttp.proxyPort=8080 -Dhttp.proxyUser=myproxyusername
-Dhttp.proxyPassword=myproxypassword  but that doesn't help.   Similar
issue i run into with SVN but i was able to specify proxy & auth into
.subversion/servers file and it worked.With Ant Ivy no idea what's
going wrong.  I also tried -autoproxy with ant command line but no luck.
In the meantime .ivy2 folder which got populated outside network would help
to proceed temporarily.

Machine : mac 10.10.3
Ant : 1.9.6
Ivy : 2.4.0

Attach build.xml & ivysettings.xml

kumar$ ant eclipse

Buildfile: /Users/kumar/sourcecode/trunk/build.xml

resolve:

resolve:

ivy-availability-check:

ivy-fail:

ivy-configure:

[ivy:configure] :: Apache Ivy 2.4.0 - 20141213170938 ::
http://ant.apache.org/ivy/ ::

[ivy:configure] :: loading settings :: file =
/Users/kumar/sourcecode/trunk/lucene/ivy-settings.xml


resolve:


Re: Ant Ivy resolve / Authenticated Proxy Issue

2015-09-16 Thread Susheel Kumar
Not really. There are no lock files & even after cleaning up lock files (to
be sure) problem still persists.  It works outside company network but
inside it stucks.  let me try to see if jconsole can show something
meaningful.

Thanks,
Susheel

On Wed, Sep 16, 2015 at 12:17 PM, Shawn Heisey  wrote:

> On 9/16/2015 9:32 AM, Mark Miller wrote:
> > Have you used jconsole or visualvm to see what it is actually hanging on
> to
> > there? Perhaps it is lock files that are not cleaned up or something
> else?
> >
> > You might try: find ~/.ivy2 -name "*.lck" -type f -exec rm {} \;
>
> If that does turn out to be the problem and deleting lockfiles fixes it,
> then you may be running into what I believe is a bug.  It is a bug that
> was (in theory) fixed in IVY-1388.
>
> https://issues.apache.org/jira/browse/IVY-1388
>
> I have seen the same problem even in version 2.3.0 which contains a fix
> for IVY-1388, so I filed a new issue:
>
> https://issues.apache.org/jira/browse/IVY-1489
>
> Thanks,
> Shawn
>
>


Re: How to check Zookeeper ensemble status?

2015-09-18 Thread Susheel Kumar
Additionally you may want to use the four letter commands like stat et.c.
using nc or telnet
http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html

Thanks,
Susheel



On Fri, Sep 18, 2015 at 11:54 AM, Sameer Maggon 
wrote:

> Have you tried zkServer.sh status?
>
> This will tell you whether zookeeper is running or not and whether it's
> acting as a leader or follower.
>
> Sameer.
>
> On Friday, September 18, 2015, Merlin Morgenstern <
> merlin.morgenst...@gmail.com> wrote:
>
> > I am running a 3 node zookeeper ensemble on 3 machines dedicated to
> > SolrCloud 5.2.x
> >
> > Inside the Solr Admin-UI I can check "live nodes", but how can I check if
> > all three zookeeper nodes are up?
> >
> > I am asking since node2 has 25% CPU usage by zookeeper while beeing idle
> > and I wonder what the cause is. Maybe zookeeper can not connect to the
> > other nodes or whatever it is, which braught me to the question how to
> > check if all 3 nodes are operational.
> >
> > Thank you for any help on this!
> >
>
>
> --
> *Sameer Maggon*
> Measured Search
> c: 310.344.7266
> www.measuredsearch.com 
>


Re: Different ports for search and upload request

2015-09-24 Thread Susheel Kumar
I am not aware of such a feature in Solr but do want to know your use case
/ logic behind coming up with different ports.  If it is for security /
exposing to user, usually Solr shouldn't be exposed to user directly but
via application / service / api.

Thanks,
Susheel

On Thu, Sep 24, 2015 at 4:01 PM, Siddhartha Singh Sandhu <
sandhus...@gmail.com> wrote:

> Hi,
>
> I wanted to know if we can configure different ports as end points for
> uploading and searching API. Also, if someone could point me in the right
> direction.
>
> Regards,
>
> Sid.
>


Re: Solr cross core join special condition

2015-10-07 Thread Susheel Kumar
You may want to take a look at new Solr feature of Streaming API &
Expressions https://issues.apache.org/jira/browse/SOLR-7584?filter=12333278
for making joins between collections.

On Wed, Oct 7, 2015 at 9:42 AM, Ryan Josal  wrote:

> I developed a join transformer plugin that did that (although it didn't
> flatten the results like that).  The one thing that was painful about it is
> that the TextResponseWriter has references to both the IndexSchema and
> SolrReturnFields objects for the primary core.  So when you add a
> SolrDocument from another core it returned the wrong fields.  I worked
> around that by transforming the SolrDocument to a NamedList.  Then when it
> gets to processing the IndexableFields it uses the wrong IndexSchema, I
> worked around that by transforming each field to a hard Java object
> (through the IndexSchema and FieldType of the correct core).  I think it
> would be great to patch TextResponseWriter with multi core writing
> abilities, but there is one question, how can it tell which core a
> SolrDocument or IndexableField is from?  Seems we'd have to add an
> attribute for that.
>
> The other possibly simpler thing to do is execute the join at index time
> with an update processor.
>
> Ryan
>
> On Tuesday, October 6, 2015, Mikhail Khludnev 
> wrote:
>
> > On Wed, Oct 7, 2015 at 7:05 AM, Ali Nazemian  > > wrote:
> >
> > > it
> > > seems there is not any way to do that right now and it should be
> > developed
> > > somehow. Am I right?
> > >
> >
> > yep
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Principal Engineer,
> > Grid Dynamics
> >
> > 
> > >
> >
>


Re: Best Indexing Approaches - To max the throughput

2015-10-08 Thread Susheel Kumar
The ConcurrentUpdateSolrClient is not cloud aware or takes zkHostString as
input.  So only option is to use CloudSolrClient with SolrJ & Thread pool
executor framework.

On Thu, Oct 8, 2015 at 12:50 PM, Alessandro Benedetti <
benedetti.ale...@gmail.com> wrote:

> This depends of the number of active producers, but ideally it's ok.
> Different threads will access the ThreadSafe ConcurrentUpdateSolrClient and
> send the document in batches.
>
> Or you were meaning something different ?
>
>
> On 8 October 2015 at 16:00, Mugeesh Husain  wrote:
>
> > Good way Using SolrJ with Thread pool executor framework, increase number
> > of
> > Thread as per your requirement
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Best-Indexing-Approaches-To-max-the-throughput-tp4232740p4233513.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card - http://about.me/alessandro_benedetti
> Blog - http://alexbenedetti.blogspot.co.uk
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>


Re: Scramble data

2015-10-08 Thread Susheel Kumar
Like Erick said,  would something like using replace function on individual
sensitive fields in fl param would work? replacing to something REDACTED
etc.

On Thu, Oct 8, 2015 at 2:58 PM, Tarala, Magesh  wrote:

> I already have the data ingested and it takes several days to do that. I
> was trying to avoid re-ingesting the data.
>
> Thanks,
> Magesh
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Wednesday, October 07, 2015 9:26 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Scramble data
>
> Probably sanitize the data on the front end? Something simple like put
> "REDACTED" for all of the customer-sensitive fields.
>
> You might also write a DocTransformer plugin, all you have to do is
> implement subclass DocTransformer and override one very simple "transform"
> method,
>
> Best,
> Erick
>
> On Wed, Oct 7, 2015 at 5:09 PM, Tarala, Magesh  wrote:
> > Folks,
> > I have a strange question. We have a Solr implementation that we would
> like to demo to external customers. But we don't want to display the real
> data, which contains our customer information and so is sensitive data.
> What's the best way to scramble the data of the Solr Query results? By best
> I mean the simplest way with least amount of work. BTW, we have a .NET
> front end application.
> >
> > Thanks,
> > Magesh
> >
> >
> >
>


Re: Exclude documents having same data in two fields

2015-10-09 Thread Susheel Kumar
Hi Aman,  Did the problem resolved or still having some errors.

Thnx

On Fri, Oct 9, 2015 at 8:28 AM, Aman Tandon  wrote:

> okay Thanks
>
> With Regards
> Aman Tandon
>
> On Fri, Oct 9, 2015 at 4:25 PM, Upayavira  wrote:
>
> > Just beware of performance here. This is fine for smaller indexes, but
> > for larger ones won't work so well. It will need to do this calculation
> > for every document in your index, thereby undoing all benefits of having
> > an inverted index.
> >
> > If your index (or resultset) is small enough, it can work, but might
> > catch you out later.
> >
> > Upayavira
> >
> > On Fri, Oct 9, 2015, at 10:59 AM, Aman Tandon wrote:
> > > Hi,
> > >
> > > I tried to use the same as mentioned in the url
> > > <
> >
> http://stackoverflow.com/questions/16258605/query-for-document-that-two-fields-are-equal
> > >
> > > .
> > >
> > > And I used the description field to check because mapping field
> > > is multivalued.
> > >
> > > So I add the fq={!frange%20l=0%20u=1}strdist(title,description,edit) in
> > > my
> > > url, but I am getting this error. As mentioned below. Please take a
> look.
> > >
> > > *Solr Version 4.8.1*
> > >
> > > *Url is*
> > >
> >
> http://localhost:8150/solr/core1/select?q.alt=*:*&fl=big*,title,catid&fq={!frange%20l=0%20u=1}strdist(title,description,edit)&defType=edismax
> > >
> > > > 
> > > > 
> > > > 500
> > > > 8
> > > > 
> > > > *:*
> > > > edismax
> > > > big*,title,catid
> > > > {!frange l=0 u=1}strdist(title,description,edit)
> > > > 
> > > > 
> > > > 
> > > > 
> > > > java.lang.RuntimeException at
> > > >
> >
> org.apache.solr.search.ExtendedDismaxQParser$ExtendedDismaxConfiguration.(ExtendedDismaxQParser.java:1455)
> > > > at
> > > >
> >
> org.apache.solr.search.ExtendedDismaxQParser.createConfiguration(ExtendedDismaxQParser.java:239)
> > > > at
> > > >
> >
> org.apache.solr.search.ExtendedDismaxQParser.(ExtendedDismaxQParser.java:108)
> > > > at
> > > >
> >
> org.apache.solr.search.ExtendedDismaxQParserPlugin.createParser(ExtendedDismaxQParserPlugin.java:37)
> > > > at org.apache.solr.search.QParser.getParser(QParser.java:315) at
> > > >
> >
> org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:144)
> > > > at
> > > >
> >
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:197)
> > > > at
> > > >
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> > > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1952) at
> > > >
> >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:774)
> > > > at
> > > >
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
> > > > at
> > > >
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
> > > > at
> > > >
> >
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
> > > > at
> > > >
> >
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
> > > > at
> > > >
> >
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222)
> > > > at
> > > >
> >
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123)
> > > > at
> > > >
> >
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171)
> > > > at
> > > >
> >
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99)
> > > > at
> > > >
> > org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:953)
> > > > at
> > > >
> >
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
> > > > at
> > > >
> >
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
> > > > at
> > > >
> >
> org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1023)
> > > > at
> > > >
> >
> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589)
> > > > at
> > > >
> >
> org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:310)
> > > > at
> > > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> > > > at
> > > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> > > > at java.lang.Thread.run(Thread.java:745)
> > > > 
> > > > 500
> > > > 
> > > > 
> > > >
> > >
> > > With Regards
> > > Aman Tandon
> > >
> > > On Thu, Oct 8, 2015 at 8:07 PM, Alessandro Benedetti <
> > > benedetti.ale...@gmail.com> wrote:
> > >
> > > > Hi agree with Nutch,
> > > > using the Function Range Query Parser, should do your trick :
> > > >
> > > >
> > > >
> >
> https://lucene.apache.org/solr/5_3_0/solr-core/org/apache/solr/search/FunctionRangeQParserPlugin.html
> > > >
> > > > Cheers
> > > >
> > > > On 8 October 2015 at 13:31, NutchDev 
> wrote:
> > > >
> > > > > Hi Aman,
> > > > >
> > > > > Have a look at this , it has query time approach also

Re: Solr cross core join special condition

2015-10-11 Thread Susheel Kumar
Yes, Ali.  These are targeted for Solr 6 but you have the option download
source from trunk, build it and try out these features if that helps in the
meantime.

Thanks
Susheel

On Sun, Oct 11, 2015 at 10:01 AM, Ali Nazemian 
wrote:

> Dear Susheel,
> Hi,
>
> I did check the jira issue that you mentioned but it seems its target is
> Solr 6! Am I correct? The patch failed for Solr 5.3 due to class not found.
> For Solr 5.x should I try to implement something similar myself?
>
> Sincerely yours.
>
>
> On Wed, Oct 7, 2015 at 7:15 PM, Susheel Kumar 
> wrote:
>
> > You may want to take a look at new Solr feature of Streaming API &
> > Expressions
> > https://issues.apache.org/jira/browse/SOLR-7584?filter=12333278
> > for making joins between collections.
> >
> > On Wed, Oct 7, 2015 at 9:42 AM, Ryan Josal  wrote:
> >
> > > I developed a join transformer plugin that did that (although it didn't
> > > flatten the results like that).  The one thing that was painful about
> it
> > is
> > > that the TextResponseWriter has references to both the IndexSchema and
> > > SolrReturnFields objects for the primary core.  So when you add a
> > > SolrDocument from another core it returned the wrong fields.  I worked
> > > around that by transforming the SolrDocument to a NamedList.  Then when
> > it
> > > gets to processing the IndexableFields it uses the wrong IndexSchema, I
> > > worked around that by transforming each field to a hard Java object
> > > (through the IndexSchema and FieldType of the correct core).  I think
> it
> > > would be great to patch TextResponseWriter with multi core writing
> > > abilities, but there is one question, how can it tell which core a
> > > SolrDocument or IndexableField is from?  Seems we'd have to add an
> > > attribute for that.
> > >
> > > The other possibly simpler thing to do is execute the join at index
> time
> > > with an update processor.
> > >
> > > Ryan
> > >
> > > On Tuesday, October 6, 2015, Mikhail Khludnev <
> > mkhlud...@griddynamics.com>
> > > wrote:
> > >
> > > > On Wed, Oct 7, 2015 at 7:05 AM, Ali Nazemian  > > > > wrote:
> > > >
> > > > > it
> > > > > seems there is not any way to do that right now and it should be
> > > > developed
> > > > > somehow. Am I right?
> > > > >
> > > >
> > > > yep
> > > >
> > > >
> > > > --
> > > > Sincerely yours
> > > > Mikhail Khludnev
> > > > Principal Engineer,
> > > > Grid Dynamics
> > > >
> > > > <http://www.griddynamics.com>
> > > > >
> > > >
> > >
> >
>
>
>
> --
> A.Nazemian
>


Re: Spell Check and Privacy

2015-10-12 Thread Susheel Kumar
Hi Arnon,

I couldn't fully understood your use case regarding Privacy. Are you
concerned that SpellCheck may reveal user names part of suggestions which
could have belonged to different organizations / ACLS OR after providing
suggestions you are concerned that user may be able to click and view other
organization users?

Please provide some details on your concern for Privacy with Spell Checker.

Thanks,
Susheel

On Mon, Oct 12, 2015 at 9:45 AM, Dyer, James 
wrote:

> Arnon,
>
> Use "spellcheck.collate=true" with "spellcheck.maxCollationTries" set to a
> non-zero value.  This will give you re-written queries that are guaranteed
> to return hits, given the original query and filters.  If you are using an
> "mm" value other than 100%, you also will want specify "
> spellcheck.collateParam.mm=100%". (or if using "q.op=OR", then use
> "spellcheck.collateParam.q.op=AND")
>
> Of course, the first section of the spellcheck result will still show
> every possible suggestion, so your client needs to discard these and not
> divulge them to the user.  If you need to know word-by-word how the
> collations were constructed, then specify
> "spellcheck.collateExtendedResults=true".  Use the extended collation
> results for this information and not the first section of the spellcheck
> results.
>
> This is all fairly well-documented on the old solr wiki:
> https://wiki.apache.org/solr/SpellCheckComponent#spellcheck.collate
>
> James Dyer
> Ingram Content Group
>
> -Original Message-
> From: Arnon Yogev [mailto:arn...@il.ibm.com]
> Sent: Monday, October 12, 2015 2:33 AM
> To: solr-user@lucene.apache.org
> Subject: Spell Check and Privacy
>
> Hi,
>
> Our system supports many users from different organizations and with
> different ACLs.
> We consider adding a spell check ("did you mean") functionality using
> DirectSolrSpellChecker. However, a privacy concern was raised, as this
> might lead to private information being revealed between users via the
> suggested terms. Using the FileBasedSpellChecker is another option, but
> naturally a static list of terms is not optimal.
>
> Is there a best practice or a suggested method for these kind of cases?
>
> Thanks,
> Arnon
>
>


Re: How to formulate query

2015-10-12 Thread Susheel Kumar
Hi Prassana, This is a highly custom relevancy/ordering requirement and one
possible way you can try is by creating multiple fields and coming up with
query for each of the searches and boost them accordingly.

Thnx

On Mon, Oct 12, 2015 at 12:50 PM, Erick Erickson 
wrote:

> Nothing exists currently that would do this. I would urge you to revisit
> the
> requirements, this kind of super-specific ordering is often not worth the
> effort to try to enforce, how does the _user_ benefit here?
>
> Best,
> Erick
>
> On Mon, Oct 12, 2015 at 12:47 AM, Prasanna S. Dhakephalkar
>  wrote:
> > Hi,
> >
> >
> >
> > I am trying to make a solr search query to get result as under I am
> unable
> > to get do
> >
> >
> >
> > I have a search term say "pit"
> >
> > The result should have (in that order)
> >
> >
> >
> > All docs that have "pit" as first WORD in search field  (pit\ *)+
> >
> > All docs that have first WORD that starts with "pit"  (pit*\  *)+
> >
> > All docs that have "pit" as WORD anywhere in search field  (except first)
> > (*\ pit\ *)+
> >
> > All docs that have  a WORD starting with "pit" anywhere in search field
> > (except first) (*\ pit*\ *)+
> >
> > All docs that have "pit" as string anywhere in the search field except
> cases
> > covered above (*pit*)
> >
> >
> >
> > Example :
> >
> >
> >
> > Pit the pat
> >
> > Pit digger
> >
> > Pitch ball
> >
> > Pitcher man
> >
> > Dig a pit with shovel
> >
> > Why do you want to dig a pit with shovel
> >
> > Cricket pitch is 22 yards
> >
> > What is pithy, I don't know
> >
> > Per capita income
> >
> > Epitome of blah blah
> >
> >
> >
> >
> >
> > How can I achieve this ?
> >
> >
> >
> > Regards,
> >
> >
> >
> > Prasanna.
> >
> >
> >
>


Re: are there any SolrCloud supervisors?

2015-10-13 Thread Susheel Kumar
Sounds interesting...

On Tue, Oct 13, 2015 at 12:58 AM, Trey Grainger  wrote:

> I'd be very interested in taking a look if you post the code.
>
> Trey Grainger
> Co-Author, Solr in Action
> Director of Engineering, Search & Recommendations @ CareerBuilder
>
> On Fri, Oct 2, 2015 at 3:09 PM, r b  wrote:
>
> > I've been working on something that just monitors ZooKeeper to add and
> > remove nodes from collections. the use case being I put SolrCloud in
> > an autoscaling group on EC2 and as instances go up and down, I need
> > them added to the collection. It's something I've built for work and
> > could clean up to share on GitHub if there is much interest.
> >
> > I asked in the IRC about a SolrCloud supervisor utility but wanted to
> > extend that question to this list. are there any more "full featured"
> > supervisors out there?
> >
> >
> > -renning
> >
>


Re: slow queries

2015-10-14 Thread Susheel Kumar
Hi Lorenzo,

Can you provide which solr version you are using, index size on disks &
hardware config (memory/processor on each machine.

Thanks,
Susheel

On Wed, Oct 14, 2015 at 6:03 AM, Lorenzo Fundaró <
lorenzo.fund...@dawandamail.com> wrote:

> Hello,
>
> I have following conf for filters and commits :
>
> Concurrent LFU Cache(maxSize=64, initialSize=64, minSize=57,
> acceptableSize=60, cleanupThread=false, timeDecay=true, autowarmCount=8,
> regenerator=org.apache.solr.search.SolrIndexSearcher$2@169ee0fd)
>
>  
>
>${solr.autoCommit.maxTime:15000}
>false
>  
>
>  
>
>${solr.autoSoftCommit.maxTime:60}
>  
>
> and the following stats for filters:
>
> lookups = 3602
> hits  =  3148
> hit ratio = 0.87
> inserts = 455
> evictions = 400
> size = 63
> warmupTime = 770
>
> *Problem: *a lot of slow queries, for example:
>
> {q=*:*&tie=1.0&defType=edismax&qt=standard&json.nl
> =map&qf=&fl=pk_i,​score&start=0&sort=view_counter_i
> desc&fq={!cost=1 cache=true}type_s:Product AND is_valid_b:true&fq={!cost=50
> cache=true}in_languages_t:de&fq={!cost=99
> cache=false}(shipping_country_codes_mt: (DE OR EURO OR EUR OR ALL)) AND
> (cents_ri: [* 3000])&rows=36&wt=json} hits=3768003 status=0 QTime=1378
>
> I could increase the size of the filter so I would decrease the amount of
> evictions, but it seems to me this would not be solving the root problem.
>
> Some ideas on where/how to start for optimisation ? Is it actually normal
> that this query takes this time ?
>
> We have an index of ~14 million docs. 4 replicas with two cores and 1 shard
> each.
>
> thank you.
>
>
> --
>
> --
> Lorenzo Fundaro
> Backend Engineer
> E-Mail: lorenzo.fund...@dawandamail.com
>
> Fax   + 49 - (0)30 - 25 76 08 52
> Tel+ 49 - (0)179 - 51 10 982
>
> DaWanda GmbH
> Windscheidstraße 18
> 10627 Berlin
>
> Geschäftsführer: Claudia Helming, Michael Pütz
> Amtsgericht Charlottenburg HRB 104695 B
>


Re: Lucene Revolution ?

2015-10-18 Thread Susheel Kumar
I couldn't also make it.  Would love to hear more who make it.

Thanks,
Susheel

On Sun, Oct 18, 2015 at 10:53 AM, Jack Krupansky 
wrote:

> Sorry I missed out this year. I thought it was next month and hadn't seen
> any reminders. Just last Tuesday I finally got around to googling the
> conference and was shocked to read that it was the next day. Oh well.
> Personally I'm less interested in the formal sessions than the informal
> networking.
>
> In any case, keep those user reports flowing. I'm sure there are plenty of
> people who didn't make it to the conference.
>
> -- Jack Krupansky
>
> On Sun, Oct 18, 2015 at 8:52 AM, Erik Hatcher 
> wrote:
>
> > The Revolution was not televised (though heavily tweeted, and videos of
> > sessions to follow eventually).  A great time was had by all.  Much
> > learning!  Much collaboration. Awesome event if I may say so myself.  I'm
> > proud to be a part of the organization that put on the best conference
> I've
> > been to to date (til next years Revolution). Don't miss the next one :)
> >
> > Re ES/AWS: what about it?   Solr is a first class AWS citizen, employing
> > Solr folks, and certainly where many of our customers deploy their
> > infrastructure, Solr, Fusion, etc.
> >
> >Erik
> >
> >
> >
> > > On Oct 18, 2015, at 01:02, William Bell  wrote:
> > >
> > > How did Lucene Revolution 2015 go last week?
> > >
> > > Also, what about Amazon's release of Elastic Search as a managed
> service
> > in
> > > AWS?
> > >
> > > --
> > > Bill Bell
> > > billnb...@gmail.com
> > > cell 720-256-8076
> >
>


auto deploument/setup of Solr & Zookeeper on medium-large clusters

2015-10-19 Thread Susheel Kumar
Hi,

I am trying to find the best practises for setting up Solr on new 20+
machines  & ZK (5+) and repeating same on other environments.  What's the
best way to download, extract, setup Solr & ZK in an automated way along
with other dependencies like java etc.  Among shell scripts or puppet or
docker or imaged vm's what is being used & suggested from Dev-Ops
perspective.

Thanks,
Susheel


Re: Validating idea of architecture for RDB / Import / Solr

2015-10-20 Thread Susheel Kumar
Hello Hangu,


OPTION1.   you can write complex/nested join queries with DIH and have
functions written in javascript for transformations in data-config if that
meets your domain requirement
OPTION2. use Java program with SolrJ and read data using jdbc and apply
domain specific rules, create solr document and then using SolrJ classes
like (ClouldSolrClient...) write the documents to Solr.  This can scale and
will have more flexibility as you go further.

The current solution you are suggesting to go,   can work depending on how
much data you are talking about / how many updates etc. but will have its
own consequences like maintaining a separate state of data (if it is large
you are storing it twice etc...), may not scale etc.

Thanks,
Susheel



On Tue, Oct 20, 2015 at 2:47 AM, hangu choi  wrote:

> Hi,
>
> I am newbie for solr and I hope to check my idea is good or terrible if
> someone can help.
>
> # background
> * I have mysql as my primary data storage.
> * I want to import data from mysql to solr (solrcloud).
> * I have domain logics to make solr document - (means I can't make solr
> document with plain sql query)
>
> # my idea
> * In mysql I make table for solr document, and insert/update whenever
> any changes made for solr document.
> * run data import handler (mostly delta query) from [solr document
> table] to solr with cron scheduler frequently with soft-commit.
> * hard-commit less frequently.
>
> Is it good or terrible idea?
> for any advice, I will be happy.
>
>
> Regards,
> Hangu
>


DevOps question : auto deployment/setup of Solr & Zookeeper on medium-large clusters

2015-10-20 Thread Susheel Kumar
Hello,

Resending to see opinion from Dev-Ops perspective on the tools for
installing/deployment of Solr & ZK on large no of machines and maintaining
them. I have heard Bladelogic or HP OO (commercial tools) etc. being used.
Please share your experience or pros / cons of such tools.

Thanks,
Susheel

On Mon, Oct 19, 2015 at 3:32 PM, Susheel Kumar 
wrote:

> Hi,
>
> I am trying to find the best practises for setting up Solr on new 20+
> machines  & ZK (5+) and repeating same on other environments.  What's the
> best way to download, extract, setup Solr & ZK in an automated way along
> with other dependencies like java etc.  Among shell scripts or puppet or
> docker or imaged vm's what is being used & suggested from Dev-Ops
> perspective.
>
> Thanks,
> Susheel
>


Re: DevOps question : auto deployment/setup of Solr & Zookeeper on medium-large clusters

2015-10-20 Thread Susheel Kumar
Thanks, Davis, Jeff.

We are not using AWS.  Is there any scripts/framework already developed
using puppet available?

On Tue, Oct 20, 2015 at 7:59 PM, Jeff Wartes  wrote:

>
> If you’re using AWS, there’s this:
> https://github.com/LucidWorks/solr-scale-tk
> If you’re using chef, there’s this:
> https://github.com/vkhatri/chef-solrcloud
>
> (There are several other chef cookbooks for Solr out there, but this is
> the only one I’m aware of that supports Solr 5.3.)
>
> For ZK, I’m less familiar, but if you’re using chef there’s this:
> https://github.com/SimpleFinance/chef-zookeeper
> And this might be handy to know about too:
> https://github.com/Netflix/exhibitor/wiki
>
>
> On 10/20/15, 6:37 AM, "Davis, Daniel (NIH/NLM) [C]" 
> wrote:
>
> >Waste of money in my opinion.   I would point you towards other tools -
> >bash scripts and free configuration managers such as puppet, chef, salt,
> >or ansible.Depending on what development you are doing, you may want
> >a continuous integration environment.   For a small company starting out,
> >using a free CI, maybe SaaS, is a good choice.   A professional version
> >such as Bamboo, TeamCity, Jenkins are almost essential in a large
> >enterprise if you are doing diverse builds.
> >
> >When you create a VM, you can generally specify a script to run after the
> >VM is mostly created.   There is a protocol (PXE Boot) that enables this
> >- a PXE server listens and hears that a new server with such-and-such
> >Ethernet Address is starting.   The PXE server makes it boot like a
> >CD-ROM/DVD install, booting from installation media on the network and
> >installing.Once that install is down, a custom script may be invoked.
> >  This script is typically a bash script, because you may not be able to
> >count on too much else being installed.   However, python/perl are also
> >reasonable choices - just be careful that the modules/libraries you are
> >using for the script are present.The same PXE protocol is used in
> >large on-premises installations (vCenter) and in the cloud (AWS/Digital
> >Ocean).  We don't care about the PXE server - the point is that you can
> >generally run a bash script after your install.
> >
> >The bash script can bootstrap other services such as puppet, chef, or
> >salt, and/or setup keys so that push configuration management tools such
> >as ansible can reach the server.   The bash script may even be smart
> >enough to do all of the setup you need, depending on what other servers
> >you need to configure.   Smart bash scripts are good for a small company,
> >but for large setups, I'd use puppet, chef, salt, and/or ansible.
> >
> >What I tend to do is to deploy things in such a way that puppet (because
> >it is what we use here) can setup things so that a "solradm" account can
> >setup everything else, and solr and zookeeper are running as a "solrapp"
> >user using puppet.Then, my continuous integration server, which is
> >Atlassian Bamboo (you can also use tools such as Jenkins, TeamCity,
> >BuildBot), installs solr as "solradm" and sets it up to run as "solrapp".
> >
> >I am not a systems administrator, and I'm not really in "DevOps", my job
> >is to be above all of that and do "systems architecture" which I am lucky
> >still involves coding both in system administration and applications
> >development.   So, that's my 2 cents.
> >
> >Dan Davis, Systems/Applications Architect (Contractor),
> >Office of Computer and Communications Systems,
> >National Library of Medicine, NIH
> >
> >-Original Message-
> >From: Susheel Kumar [mailto:susheel2...@gmail.com]
> >Sent: Tuesday, October 20, 2015 9:19 AM
> >To: solr-user@lucene.apache.org
> >Subject: DevOps question : auto deployment/setup of Solr & Zookeeper on
> >medium-large clusters
> >
> >Hello,
> >
> >Resending to see opinion from Dev-Ops perspective on the tools for
> >installing/deployment of Solr & ZK on large no of machines and
> >maintaining them. I have heard Bladelogic or HP OO (commercial tools)
> >etc. being used.
> >Please share your experience or pros / cons of such tools.
> >
> >Thanks,
> >Susheel
> >
> >On Mon, Oct 19, 2015 at 3:32 PM, Susheel Kumar 
> >wrote:
> >
> >> Hi,
> >>
> >> I am trying to find the best practises for setting up Solr on new 20+
> >> machines  & ZK (5+) and repeating same on other environments.  What's
> >> the best way to download, extract, setup Solr & ZK in an automated way
> >> along with other dependencies like java etc.  Among shell scripts or
> >> puppet or docker or imaged vm's what is being used & suggested from
> >> Dev-Ops perspective.
> >>
> >> Thanks,
> >> Susheel
> >>
>
>


Re: SolrJ stalls/hangs on client.add(); and doesn't return

2015-10-30 Thread Susheel Kumar
Just a suggestion Markus that sending 50k documents in your case worked but
you may want to benchmark sending batches in 5K, 10k or 20k batches and
compare with sending 50k batches.  It may turn out that smaller batch size
may be faster than very big batch size...

On Fri, Oct 30, 2015 at 7:59 AM, Markus Jelsma 
wrote:

> Hi - Solr doesn't seem to receive anything, and it certainly doesn't log
> anything, nothing is running out of memory. Indeed, i was clearly
> misunderstanding ConcurrentUpdateSolrClient.
>
> I hoped, without reading its code, it would partition input, which it
> clearly doesn't. I changed the code to partition my own input up to 50k
> documents and everything is running fine.
>
> Markus
>
>
>
> -Original message-
> > From:Erick Erickson 
> > Sent: Thursday 29th October 2015 22:28
> > To: solr-user 
> > Subject: Re: SolrJ stalls/hangs on client.add(); and doesn't return
> >
> > You're sending 100K docs in a single packet? It's vaguely possible that
> you're
> > getting a timeout although that doesn't square with no docs being
> indexed...
> >
> > Hmmm, to check you could do a manual commit. Or watch the Solr log to
> > see if update
> > requests ever go there.
> >
> > Or you're running out of memory on the client.
> >
> > Or even exceeding the packet size that the servlet container will accept?
> >
> > But I think at root you're misunderstanding
> > ConcurrentUpdateSolrClient. It doesn't
> > partition up a huge array and send them in parallel, it parallelized
> sending the
> > packet each call is given. So it's trying to send all 100K docs at
> > once. Probably not
> > what you were aiming for.
> >
> > Try making batches of 1,000 docs and sending them through instead.
> >
> > So the parameters are a bit of magic. You can have up to the number of
> threads
> > you specify sending their entire packet to solr in parallel, and up to
> queueSize
> > requests. Note this is the _request_, not the docs in the list if I'm
> > reading the code
> > correctly.
> >
> > Best,
> > Erick
> >
> > On Thu, Oct 29, 2015 at 1:52 AM, Markus Jelsma
> >  wrote:
> > > Hello - we have some processes periodically sending documents to 5.3.0
> in local mode using ConcurrentUpdateSolrClient 5.3.0, it has queueSize 10
> and threadCount 4, just chosen arbitrarily having no idea what is right.
> > >
> > > Usually its a few thousand up to some tens of thousands of rather
> small documents. Now, when the number of documents is around or near a
> hundred thousand, client.add(Iterator docIterator)
> stalls and never returns. It also doesn't index any of the documents. Upon
> calling, it quickly eats CPU and a load of heap but shortly after it goes
> idle, no CPU and memory is released.
> > >
> > > I am puzzled, any ideas to share?
> > > Markus
> >
>


Re: Problem with the Content Field during Solr Indexing

2015-11-02 Thread Susheel Kumar
Hi Shruti,

If you are looking to index images to make them searchable (Image Search)
then you will have to look at LIRE (Lucene Image Retrieval)
http://www.lire-project.net/  and can follow Lire Solr Plugin at this site
https://bitbucket.org/dermotte/liresolr.

Thanks,
Susheel

On Sat, Oct 31, 2015 at 9:46 PM, Zheng Lin Edwin Yeo 
wrote:

> Hi Shruti,
>
> From what I understand, the /update/extract handler is for indexing
> rich-text documents, and does not support ".png" files.
>
> It only supports the following files format: pdf, doc, docx, ppt, pptx,
> xls, xlsx, odt, odp, ods, ott, otp, ots, rtf, htm, html, txt, log
> If you use the default post.jar, I believe the other formats will get
> filtered out.
>
> When I tried to index ".png" file in my custom handler, it just index "
> " in the content.
>
> Regards,
> Edwin
>
>
>
> On 31 October 2015 at 09:35, Shruti Mundra  wrote:
>
> > Hi Edwin,
> >
> > The file extension of the image file is ".png" and we are following this
> > url for indexing:
> > "
> >
> >
> http://blog.thedigitalgroup.com/vijaym/wp-content/uploads/sites/11/2015/07/SolrImageExtract.png
> > "
> >
> > Thanks and Regards,
> > Shruti Mundra
> >
> > On Thu, Oct 29, 2015 at 8:33 PM, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com
> > >
> > wrote:
> >
> > > The "\n" actually means new line as decoded by Solr from the indexed
> > > document.
> > >
> > > What is your file extension of your image file, and which method are
> you
> > > using to do the indexing?
> > >
> > > Regards,
> > > Edwin
> > >
> > >
> > > On 30 October 2015 at 04:38, Shruti Mundra  wrote:
> > >
> > > > Hi,
> > > >
> > > > When I'm trying index an image file directly to Solr, the attribute
> > > > content, consists of trails of "\n"s and not the data.
> > > > We are successful in getting the metadata for that image.
> > > >
> > > > Can anyone help us out on how we could get the content along with the
> > > > Metadata.
> > > >
> > > > Thanks!
> > > >
> > > > - Shruti Mundra
> > > >
> > >
> >
>


Solr Search: Access Control / Role based security

2015-11-05 Thread Susheel Kumar
Hi,

I have seen couple of use cases / need where we want to restrict result of
search based on role of a user.  For e.g.

- if user role is admin, any document from the search result will be
returned
- if user role is manager, only documents intended for managers will be
returned
- if user role is worker, only documents intended for workers will be
returned

Typical practise is to tag the documents with the roles (using a
multi-valued field) during indexing and then during search append filter
query to restrict result based on roles.

Wondering if there is any other better way out there and if this common
requirement should be added as a Solr feature/plugin.

The current security plugins are more towards making Solr apis/resources
secure not towards securing/controlling data during search.
https://cwiki.apache.org/confluence/display/solr/Authentication+and+Authorization+Plugins


Please share your thoughts.

Thanks,
Susheel


Re: Solr Search: Access Control / Role based security

2015-11-10 Thread Susheel Kumar
Thanks everyone for the suggestions.

Hi Noble - Were there any thoughts made on utilizing Apache ManifoldCF
while developing Authentication/Authorization plugins or anything to add
there.

Thanks,
Susheel

On Tue, Nov 10, 2015 at 5:01 AM, Alessandro Benedetti  wrote:

> I've been working for a while with Apache ManifoldCF and Enterprise Search
> in Solr ( with Document level security) .
> Basically you can add a couple of extra fields , for example :
>
> allow_token : containing all the tokens that can view the document
> deny_token : containing all the tokens that are denied to view the document
>
> Apache ManifoldCF provides an integration that add an additional layer, and
> is able to combine different data sources permission schemes.
> The Authority Service endpoint will take in input the user name and return
> all the allow_token values and deny_token.
> At this point you can append the related filter queries to your queries and
> be sure that the user will only see what is supposed to see.
>
> It's basically an extension of the strategy you were proposing, role based.
> Of course keep protected your endpoints and avoid users to put custom fq,
> or all your document security model would be useless :)
>
> Cheers
>
>
> On 9 November 2015 at 21:52, Scott Stults <
> sstu...@opensourceconnections.com
> > wrote:
>
> > Susheel,
> >
> > This is perfectly fine for simple use-cases and has the benefit that the
> > filterCache will help things stay nice and speedy. Apache ManifoldCF
> goes a
> > bit further and ties back to your authentication and authorization
> > mechanism:
> >
> >
> >
> http://manifoldcf.apache.org/release/trunk/en_US/concepts.html#ManifoldCF+security+model
> >
> >
> > k/r,
> > Scott
> >
> > On Thu, Nov 5, 2015 at 2:26 PM, Susheel Kumar 
> > wrote:
> >
> > > Hi,
> > >
> > > I have seen couple of use cases / need where we want to restrict result
> > of
> > > search based on role of a user.  For e.g.
> > >
> > > - if user role is admin, any document from the search result will be
> > > returned
> > > - if user role is manager, only documents intended for managers will be
> > > returned
> > > - if user role is worker, only documents intended for workers will be
> > > returned
> > >
> > > Typical practise is to tag the documents with the roles (using a
> > > multi-valued field) during indexing and then during search append
> filter
> > > query to restrict result based on roles.
> > >
> > > Wondering if there is any other better way out there and if this common
> > > requirement should be added as a Solr feature/plugin.
> > >
> > > The current security plugins are more towards making Solr
> apis/resources
> > > secure not towards securing/controlling data during search.
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/solr/Authentication+and+Authorization+Plugins
> > >
> > >
> > > Please share your thoughts.
> > >
> > > Thanks,
> > > Susheel
> > >
> >
> >
> >
> > --
> > Scott Stults | Founder & Solutions Architect | OpenSource Connections,
> LLC
> > | 434.409.2780
> > http://www.opensourceconnections.com
> >
>
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>


Re: DevOps question : auto deployment/setup of Solr & Zookeeper on medium-large clusters

2015-11-13 Thread Susheel Kumar
Hi Davis,  I wanted to thank you for suggesting Ansible as one of the
automation tool as it has been working very well in automating the
deployments of Zookeeper, Solr on our clusters.

Thanks,
Susheel

On Wed, Oct 21, 2015 at 10:47 AM, Davis, Daniel (NIH/NLM) [C] <
daniel.da...@nih.gov> wrote:

> Susheel,
>
> Our puppet stuff is very close to our infrastructure, using specific
> Netapp volumes and such, and assuming some files come from NFS.
> It is also personally embarrassing to me that we still use NIS - doh!
>
> -Original Message-
> From: Susheel Kumar [mailto:susheel2...@gmail.com]
> Sent: Tuesday, October 20, 2015 8:34 PM
> To: solr-user@lucene.apache.org
> Subject: Re: DevOps question : auto deployment/setup of Solr & Zookeeper
> on medium-large clusters
>
> Thanks, Davis, Jeff.
>
> We are not using AWS.  Is there any scripts/framework already developed
> using puppet available?
>
> On Tue, Oct 20, 2015 at 7:59 PM, Jeff Wartes 
> wrote:
>
> >
> > If you’re using AWS, there’s this:
> > https://github.com/LucidWorks/solr-scale-tk
> > If you’re using chef, there’s this:
> > https://github.com/vkhatri/chef-solrcloud
> >
> > (There are several other chef cookbooks for Solr out there, but this
> > is the only one I’m aware of that supports Solr 5.3.)
> >
> > For ZK, I’m less familiar, but if you’re using chef there’s this:
> > https://github.com/SimpleFinance/chef-zookeeper
> > And this might be handy to know about too:
> > https://github.com/Netflix/exhibitor/wiki
> >
> >
> > On 10/20/15, 6:37 AM, "Davis, Daniel (NIH/NLM) [C]"
> > 
> > wrote:
> >
> > >Waste of money in my opinion.   I would point you towards other tools -
> > >bash scripts and free configuration managers such as puppet, chef, salt,
> > >or ansible.Depending on what development you are doing, you may want
> > >a continuous integration environment.   For a small company starting
> out,
> > >using a free CI, maybe SaaS, is a good choice.   A professional version
> > >such as Bamboo, TeamCity, Jenkins are almost essential in a large
> > >enterprise if you are doing diverse builds.
> > >
> > >When you create a VM, you can generally specify a script to run after
> the
> > >VM is mostly created.   There is a protocol (PXE Boot) that enables this
> > >- a PXE server listens and hears that a new server with such-and-such
> > >Ethernet Address is starting.   The PXE server makes it boot like a
> > >CD-ROM/DVD install, booting from installation media on the network and
> > >installing.Once that install is down, a custom script may be
> invoked.
> > >  This script is typically a bash script, because you may not be able to
> > >count on too much else being installed.   However, python/perl are also
> > >reasonable choices - just be careful that the modules/libraries you are
> > >using for the script are present.The same PXE protocol is used in
> > >large on-premises installations (vCenter) and in the cloud
> > >(AWS/Digital Ocean).  We don't care about the PXE server - the point
> > >is that you can generally run a bash script after your install.
> > >
> > >The bash script can bootstrap other services such as puppet, chef, or
> > >salt, and/or setup keys so that push configuration management tools such
> > >as ansible can reach the server.   The bash script may even be smart
> > >enough to do all of the setup you need, depending on what other servers
> > >you need to configure.   Smart bash scripts are good for a small
> company,
> > >but for large setups, I'd use puppet, chef, salt, and/or ansible.
> > >
> > >What I tend to do is to deploy things in such a way that puppet
> > >(because it is what we use here) can setup things so that a "solradm"
> > >account can setup everything else, and solr and zookeeper are running
> as a "solrapp"
> > >user using puppet.Then, my continuous integration server, which is
> > >Atlassian Bamboo (you can also use tools such as Jenkins, TeamCity,
> > >BuildBot), installs solr as "solradm" and sets it up to run as
> "solrapp".
> > >
> > >I am not a systems administrator, and I'm not really in "DevOps", my
> > >job is to be above all of that and do "systems architecture" which I
> > >am lucky still involves coding both in system administration and
> applications
> > >development.   So, that's my 2 cents.
> > >
> > >Dan Davis,

Re: capacity of storage a single core

2015-12-09 Thread Susheel Kumar
Hi Jack,

Just to add, OS Disk Cache will still make query performant even though
entire index can't be loaded into memory. How much more latency compare to
if index gets completely loaded into memory may vary depending to index
size etc.  I am trying to clarify this here because lot of folks takes this
as a hard guideline (to fit index into memory)  and try to come up with
hardware/machines (100's of machines) just for the sake of fitting index
into memory even though there may not be much load/qps on the cluster.  For
e.g. this may vary and needs to be tested on case by case basis but a
machine with 64GB  should still provide good performance (not the best) for
100G index on that machine.  Do you agree / any thoughts?

Same i believe is the case with Replicas,   as on a single machine you have
replicas which itself may not fit into memory as well along with shard
index.

Thanks,
Susheel

On Tue, Dec 8, 2015 at 11:31 AM, Jack Krupansky 
wrote:

> Generally, you will be resource limited (memory, cpu) rather than by some
> arbitrary numeric limit (like 2 billion.)
>
> My personal general recommendation is for a practical limit is 100 million
> documents on a machine/node. Depending on your data model and actual data
> that number could be higher or lower. A proof of concept test will allow
> you to determine the actual number for your particular use case, but a
> presumed limit of 100 million is not a bad start.
>
> You should have enough memory to hold the entire index in system memory. If
> not, your query latency will suffer due to I/O required to constantly
> re-read portions of the index into memory.
>
> The practical limit for documents is not per core or number of cores but
> across all cores on the node since it is mostly a memory limit and the
> available CPU resources for accessing that memory.
>
> -- Jack Krupansky
>
> On Tue, Dec 8, 2015 at 8:57 AM, Toke Eskildsen 
> wrote:
>
> > On Tue, 2015-12-08 at 05:18 -0700, Mugeesh Husain wrote:
> > > Capacity regarding 2 simple question:
> > >
> > > 1.) How many document we could store in single core(capacity of core
> > > storage)
> >
> > There is hard limit of 2 billion documents.
> >
> > > 2.) How many core we could create in a single server(single node
> cluster)
> >
> > There is no hard limit. Except for 2 billion cores, I guess. But at this
> > point in time that is a ridiculously high number of cores.
> >
> > It is hard to give a suggestion for real-world limits as indexes vary a
> > lot and the rules of thumb tend to be quite poor when scaling up.
> >
> >
> http://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> >
> > People generally seems to run into problems with more than 1000
> > not-too-large cores. If the cores are large, there will probably be
> > performance problems long before that.
> >
> > You will have to build a prototype and test.
> >
> > - Toke Eskildsen, State and University Library, Denmark
> >
> >
> >
>


Re: capacity of storage a single core

2015-12-09 Thread Susheel Kumar
Thanks, Jack for quick reply.  With Replica / Shard I mean to say on a
given machine there may be two/more replicas and all of them may not fit
into memory.

On Wed, Dec 9, 2015 at 11:00 AM, Jack Krupansky 
wrote:

> Yes, there are nuances to any general rule. It's just a starting point, and
> your own testing will confirm specific details for your specific app and
> data. For example, maybe you don't query all fields commonly, so each
> field-specific index may not require memory or not require it so commonly.
> And, yes, each app has its own latency requirements. The purpose of a
> general rule is to generally avoid unhappiness, but if you have an appetite
> and tolerance for unhappiness, then go for it.
>
> Replica vs. shard? They're basically the same - a replica is a copy of a
> shard.
>
> -- Jack Krupansky
>
> On Wed, Dec 9, 2015 at 10:36 AM, Susheel Kumar 
> wrote:
>
> > Hi Jack,
> >
> > Just to add, OS Disk Cache will still make query performant even though
> > entire index can't be loaded into memory. How much more latency compare
> to
> > if index gets completely loaded into memory may vary depending to index
> > size etc.  I am trying to clarify this here because lot of folks takes
> this
> > as a hard guideline (to fit index into memory)  and try to come up with
> > hardware/machines (100's of machines) just for the sake of fitting index
> > into memory even though there may not be much load/qps on the cluster.
> For
> > e.g. this may vary and needs to be tested on case by case basis but a
> > machine with 64GB  should still provide good performance (not the best)
> for
> > 100G index on that machine.  Do you agree / any thoughts?
> >
> > Same i believe is the case with Replicas,   as on a single machine you
> have
> > replicas which itself may not fit into memory as well along with shard
> > index.
> >
> > Thanks,
> > Susheel
> >
> > On Tue, Dec 8, 2015 at 11:31 AM, Jack Krupansky <
> jack.krupan...@gmail.com>
> > wrote:
> >
> > > Generally, you will be resource limited (memory, cpu) rather than by
> some
> > > arbitrary numeric limit (like 2 billion.)
> > >
> > > My personal general recommendation is for a practical limit is 100
> > million
> > > documents on a machine/node. Depending on your data model and actual
> data
> > > that number could be higher or lower. A proof of concept test will
> allow
> > > you to determine the actual number for your particular use case, but a
> > > presumed limit of 100 million is not a bad start.
> > >
> > > You should have enough memory to hold the entire index in system
> memory.
> > If
> > > not, your query latency will suffer due to I/O required to constantly
> > > re-read portions of the index into memory.
> > >
> > > The practical limit for documents is not per core or number of cores
> but
> > > across all cores on the node since it is mostly a memory limit and the
> > > available CPU resources for accessing that memory.
> > >
> > > -- Jack Krupansky
> > >
> > > On Tue, Dec 8, 2015 at 8:57 AM, Toke Eskildsen  >
> > > wrote:
> > >
> > > > On Tue, 2015-12-08 at 05:18 -0700, Mugeesh Husain wrote:
> > > > > Capacity regarding 2 simple question:
> > > > >
> > > > > 1.) How many document we could store in single core(capacity of
> core
> > > > > storage)
> > > >
> > > > There is hard limit of 2 billion documents.
> > > >
> > > > > 2.) How many core we could create in a single server(single node
> > > cluster)
> > > >
> > > > There is no hard limit. Except for 2 billion cores, I guess. But at
> > this
> > > > point in time that is a ridiculously high number of cores.
> > > >
> > > > It is hard to give a suggestion for real-world limits as indexes
> vary a
> > > > lot and the rules of thumb tend to be quite poor when scaling up.
> > > >
> > > >
> > >
> >
> http://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> > > >
> > > > People generally seems to run into problems with more than 1000
> > > > not-too-large cores. If the cores are large, there will probably be
> > > > performance problems long before that.
> > > >
> > > > You will have to build a prototype and test.
> > > >
> > > > - Toke Eskildsen, State and University Library, Denmark
> > > >
> > > >
> > > >
> > >
> >
>


Re: Increasing Solr5 time out from 30 seconds while starting solr

2015-12-09 Thread Susheel Kumar
Yes, Either look into log files as Eric suggested or run with -f and see
the startup error on the console.  Kill any existing instance or remove any
old PID file before starting with -f.

Thnx

On Wed, Dec 9, 2015 at 12:46 PM, Erick Erickson 
wrote:

> What does the Solr log file say? Often this is the result
> of Solr being in a weird state due to startup problems.
>
> Best,
> Erick
>
> On Tue, Dec 8, 2015 at 8:43 PM, Debraj Manna 
> wrote:
> > . After failed attempt to start solr if I try to start solr again on same
> > port it says solr is already running. Try running solr on different port.
> >
> > Can you let me know if it is possible to increase the timeout? So that I
> > can observe how does it behave.
> > On Dec 9, 2015 10:10 AM, "Rahul Ramesh"  wrote:
> >
> >> Hi Debraj,
> >> I dont think increasing the timeout will help. Are you sure solr/ any
> other
> >> program is not running on 8789? Please check the output of lsof -i
> :8789 .
> >>
> >> Regards,
> >> Rahul
> >>
> >> On Tue, Dec 8, 2015 at 11:58 PM, Debraj Manna  >
> >> wrote:
> >>
> >> > Can someone help me on this?
> >> > On Dec 7, 2015 7:55 PM, "D"  wrote:
> >> >
> >> > > Hi,
> >> > >
> >> > > Many time while starting solr I see the below message and then the
> solr
> >> > is
> >> > > not reachable.
> >> > >
> >> > > debraj@boutique3:~/solr5$ sudo bin/solr start -p 8789
> >> > > Waiting to see Solr listening on port 8789 [-]  Still not seeing
> Solr
> >> > listening on 8789 after 30 seconds!
> >> > >
> >> > > However when I try to start solr again by trying to execute the same
> >> > > command. It says that *"solr is already running on port 8789. Try
> >> using a
> >> > > different port with -p"*
> >> > >
> >> > > I am having two cores in my local set-up. I am guessing this is
> >> happening
> >> > > because one of the core is a little big. So solr is timing out while
> >> > > loading the core. If I take one of the core out of solr then
> everything
> >> > > works fine.
> >> > >
> >> > > Can some one let me know how can I increase this timeout value from
> >> > > default 30 seconds?
> >> > >
> >> > > I am using Solr 5.2.1 on Debian 7.
> >> > >
> >> > > Thanks,
> >> > >
> >> > >
> >> >
> >>
>


Re: Increasing Solr5 time out from 30 seconds while starting solr

2015-12-10 Thread Susheel Kumar
g/Downloads/software/dev/solr5/contrib/velocity/lib/velocity-tools-2.0.jar'
> to classloader
> WARN  - 2015-12-10 15:37:16.848; [   ]
> org.apache.solr.core.SolrResourceLoader; Can't find (or read) directory to
> add to classloader: /home/jabong/Downloads/software/dev/solr5/dist/
> (resolved as: /home/jabong/Downloads/software/dev/solr5/dist).
> INFO  - 2015-12-10 15:37:17.476; [   ]
> org.apache.solr.update.SolrIndexConfig; IndexWriter infoStream solr logging
> is enabled
> INFO  - 2015-12-10 15:37:17.482; [   ]
> org.apache.solr.update.SolrIndexConfig; IndexWriter infoStream solr logging
> is enabled
> INFO  - 2015-12-10 15:37:17.485; [   ] org.apache.solr.core.SolrConfig;
> Using Lucene MatchVersion: 5.0.0
> INFO  - 2015-12-10 15:37:17.488; [   ] org.apache.solr.core.SolrConfig;
> Using Lucene MatchVersion: 5.2.1
> INFO  - 2015-12-10 15:37:17.702; [   ] org.apache.solr.core.SolrConfig;
> Loaded SolrConfig: solrconfig.xml
> INFO  - 2015-12-10 15:37:17.702; [   ] org.apache.solr.core.SolrConfig;
> Loaded SolrConfig: solrconfig.xml
> INFO  - 2015-12-10 15:37:17.718; [   ] org.apache.solr.schema.IndexSchema;
> Reading Solr Schema from
>
> /home/jabong/Downloads/software/dev/solr5/server/solr/jabong_discovery_visual/conf/schema.xml
> INFO  - 2015-12-10 15:37:17.720; [   ] org.apache.solr.schema.IndexSchema;
> Reading Solr Schema from
>
> /home/jabong/Downloads/software/dev/solr5/server/solr/discovery/conf/schema.xml
> INFO  - 2015-12-10 15:37:17.766; [   ] org.apache.solr.schema.IndexSchema;
> [jabong_discovery_visual] Schema name=example-data-driven-schema
> INFO  - 2015-12-10 15:37:17.782; [   ] org.apache.solr.schema.IndexSchema;
> [discovery] Schema name=example-data-driven-schema
>
>
> On Thu, Dec 10, 2015 at 12:30 AM, Susheel Kumar 
> wrote:
>
> > Yes, Either look into log files as Eric suggested or run with -f and see
> > the startup error on the console.  Kill any existing instance or remove
> any
> > old PID file before starting with -f.
> >
> > Thnx
> >
> > On Wed, Dec 9, 2015 at 12:46 PM, Erick Erickson  >
> > wrote:
> >
> > > What does the Solr log file say? Often this is the result
> > > of Solr being in a weird state due to startup problems.
> > >
> > > Best,
> > > Erick
> > >
> > > On Tue, Dec 8, 2015 at 8:43 PM, Debraj Manna  >
> > > wrote:
> > > > . After failed attempt to start solr if I try to start solr again on
> > same
> > > > port it says solr is already running. Try running solr on different
> > port.
> > > >
> > > > Can you let me know if it is possible to increase the timeout? So
> that
> > I
> > > > can observe how does it behave.
> > > > On Dec 9, 2015 10:10 AM, "Rahul Ramesh"  wrote:
> > > >
> > > >> Hi Debraj,
> > > >> I dont think increasing the timeout will help. Are you sure solr/
> any
> > > other
> > > >> program is not running on 8789? Please check the output of lsof -i
> > > :8789 .
> > > >>
> > > >> Regards,
> > > >> Rahul
> > > >>
> > > >> On Tue, Dec 8, 2015 at 11:58 PM, Debraj Manna <
> > subharaj.ma...@gmail.com
> > > >
> > > >> wrote:
> > > >>
> > > >> > Can someone help me on this?
> > > >> > On Dec 7, 2015 7:55 PM, "D"  wrote:
> > > >> >
> > > >> > > Hi,
> > > >> > >
> > > >> > > Many time while starting solr I see the below message and then
> the
> > > solr
> > > >> > is
> > > >> > > not reachable.
> > > >> > >
> > > >> > > debraj@boutique3:~/solr5$ sudo bin/solr start -p 8789
> > > >> > > Waiting to see Solr listening on port 8789 [-]  Still not seeing
> > > Solr
> > > >> > listening on 8789 after 30 seconds!
> > > >> > >
> > > >> > > However when I try to start solr again by trying to execute the
> > same
> > > >> > > command. It says that *"solr is already running on port 8789.
> Try
> > > >> using a
> > > >> > > different port with -p"*
> > > >> > >
> > > >> > > I am having two cores in my local set-up. I am guessing this is
> > > >> happening
> > > >> > > because one of the core is a little big. So solr is timing out
> > while
> > > >> > > loading the core. If I take one of the core out of solr then
> > > everything
> > > >> > > works fine.
> > > >> > >
> > > >> > > Can some one let me know how can I increase this timeout value
> > from
> > > >> > > default 30 seconds?
> > > >> > >
> > > >> > > I am using Solr 5.2.1 on Debian 7.
> > > >> > >
> > > >> > > Thanks,
> > > >> > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
>


Re: capacity of storage a single core

2015-12-10 Thread Susheel Kumar
Like the details here Eric how you broke memory into different parts. I
feel if we can combine lot of this knowledge from your various posts, above
sizing blog, Solr wiki pages, Uwe article on MMap/heap,  consolidate and
present in at single place which may help lot of new folks/folks struggling
with memory/heap/sizing issues questions etc.

Thanks,
Susheel

On Wed, Dec 9, 2015 at 12:40 PM, Erick Erickson 
wrote:

> I object to the question. And the advice. And... ;).
>
> Practically, IMO guidance that "the entire index should
> fit into memory" is misleading, especially for newbies.
> Let's break it down:
>
> 1>  "the entire index". What's this? The size on disk?
> 90% of that size on disk may be stored data which
> uses very little memory, which is limited by the
> documentCache in Solr. OTOH, only 10% of the on-disk
> size might be stored data.
>
> 2> "fit into memory". What memory? Certainly not
> the JVM as much of the Lucene-level data is in
> MMapDirectory which uses the OS memory. So
> this _probably_ means JVM + OS memory, and OS
> memory is shared amongst other processes as well.
>
> 3> Solr and Lucene build in-memory structures that
> aren't reflected in the index size on disk. I've seen
> filterCaches for instance that have been (mis) configured
> that could grow to 100s of G. This is totally not reflected in
> the "index size".
>
> 4> Try faceting on a text field with lots of unique
> values. Bad Practice, but you'll see just how quickly
> the _query_ can change the memory requirements.
>
> 5> Sure, with modern hardware we can create huge JVM
> heaps... that hit GC pauses that'll drive performance
> down, sometimes radically.
>
> I've seen 350M docs, 200-300 fields (aggregate) fit into 12G
> of JVM. I've seen 25M docs (really big ones) strain 48G
> JVM heaps.
>
> Jack's approach is what I use; pick a number and test with it.
> Here's an approach:
>
> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>
> Best,
> Erick
>
> On Wed, Dec 9, 2015 at 8:54 AM, Susheel Kumar 
> wrote:
> > Thanks, Jack for quick reply.  With Replica / Shard I mean to say on a
> > given machine there may be two/more replicas and all of them may not fit
> > into memory.
> >
> > On Wed, Dec 9, 2015 at 11:00 AM, Jack Krupansky <
> jack.krupan...@gmail.com>
> > wrote:
> >
> >> Yes, there are nuances to any general rule. It's just a starting point,
> and
> >> your own testing will confirm specific details for your specific app and
> >> data. For example, maybe you don't query all fields commonly, so each
> >> field-specific index may not require memory or not require it so
> commonly.
> >> And, yes, each app has its own latency requirements. The purpose of a
> >> general rule is to generally avoid unhappiness, but if you have an
> appetite
> >> and tolerance for unhappiness, then go for it.
> >>
> >> Replica vs. shard? They're basically the same - a replica is a copy of a
> >> shard.
> >>
> >> -- Jack Krupansky
> >>
> >> On Wed, Dec 9, 2015 at 10:36 AM, Susheel Kumar 
> >> wrote:
> >>
> >> > Hi Jack,
> >> >
> >> > Just to add, OS Disk Cache will still make query performant even
> though
> >> > entire index can't be loaded into memory. How much more latency
> compare
> >> to
> >> > if index gets completely loaded into memory may vary depending to
> index
> >> > size etc.  I am trying to clarify this here because lot of folks takes
> >> this
> >> > as a hard guideline (to fit index into memory)  and try to come up
> with
> >> > hardware/machines (100's of machines) just for the sake of fitting
> index
> >> > into memory even though there may not be much load/qps on the cluster.
> >> For
> >> > e.g. this may vary and needs to be tested on case by case basis but a
> >> > machine with 64GB  should still provide good performance (not the
> best)
> >> for
> >> > 100G index on that machine.  Do you agree / any thoughts?
> >> >
> >> > Same i believe is the case with Replicas,   as on a single machine you
> >> have
> >> > replicas which itself may not fit into memory as well along with shard
> >> > index.
> >> >
> >> > Thanks,
> >> > Susheel
> >> >
> >> > On Tue, Dec 8, 2015 at 11:31 AM, Jack Krupa

Re: capacity of storage a single core

2015-12-11 Thread Susheel Kumar
Thanks, Alessandro.  We can attempt to come up with such a blog and I can
volunteer for bullets/headings to start with. I also agree that we can
can't come up with some definitive answer as mentioned in other places but
can give an attempt to at least consolidate all these knowledge into one
place.   As of now i see few sources which can be referred to come up with
some consolidated knowledge

https://wiki.apache.org/solr/SolrPerformanceProblems
http://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
Uwe's Article on MMAP
Erick's and others valuable posts



On Fri, Dec 11, 2015 at 6:20 AM, Alessandro Benedetti  wrote:

> Susheel, this is a very good idea.
> I am a little bit busy this period, so I doubt I can contribute with a blog
> post, but it would be great if anyone has time.
> If not I will add it to my backlog and sooner or later I will do it :)
>
> Furthermore latest observations from Erick are pure gold, and I agree
> completely.
> I have only a question related this :
>
> 1>  "the entire index". What's this? The size on disk?
> > 90% of that size on disk may be stored data which
> > uses very little memory, which is limited by the
> > documentCache in Solr. OTOH, only 10% of the on-disk
> > size might be stored data.
>
>
> If I am correct the documentCache in Solr is a map that relates the Lucene
> document ordinal to the stored fields for that document.
> We have control on that and we can assign our preferred values.
> First question :
> 1) Is this using the JVM memory to store this cache ? I assume yes.
> So we need to take care of our JVM memory if we want to store in memory big
> chunks of the stored index.
>
> 2) MMap index segments are actually only the segments used for searching ?
> Is not the Lucene directory memory mapping the stored segments as well ?
> This was my understanding but maybe I am wrong.
> In the case we first memory map the stored segments and then potentially
> store them on the Solr cache as well, right ?
>
> Cheers
>
>
> On 10 December 2015 at 19:43, Susheel Kumar  wrote:
>
> > Like the details here Eric how you broke memory into different parts. I
> > feel if we can combine lot of this knowledge from your various posts,
> above
> > sizing blog, Solr wiki pages, Uwe article on MMap/heap,  consolidate and
> > present in at single place which may help lot of new folks/folks
> struggling
> > with memory/heap/sizing issues questions etc.
> >
> > Thanks,
> > Susheel
> >
> > On Wed, Dec 9, 2015 at 12:40 PM, Erick Erickson  >
> > wrote:
> >
> > > I object to the question. And the advice. And... ;).
> > >
> > > Practically, IMO guidance that "the entire index should
> > > fit into memory" is misleading, especially for newbies.
> > > Let's break it down:
> > >
> > > 1>  "the entire index". What's this? The size on disk?
> > > 90% of that size on disk may be stored data which
> > > uses very little memory, which is limited by the
> > > documentCache in Solr. OTOH, only 10% of the on-disk
> > > size might be stored data.
> > >
> > > 2> "fit into memory". What memory? Certainly not
> > > the JVM as much of the Lucene-level data is in
> > > MMapDirectory which uses the OS memory. So
> > > this _probably_ means JVM + OS memory, and OS
> > > memory is shared amongst other processes as well.
> > >
> > > 3> Solr and Lucene build in-memory structures that
> > > aren't reflected in the index size on disk. I've seen
> > > filterCaches for instance that have been (mis) configured
> > > that could grow to 100s of G. This is totally not reflected in
> > > the "index size".
> > >
> > > 4> Try faceting on a text field with lots of unique
> > > values. Bad Practice, but you'll see just how quickly
> > > the _query_ can change the memory requirements.
> > >
> > > 5> Sure, with modern hardware we can create huge JVM
> > > heaps... that hit GC pauses that'll drive performance
> > > down, sometimes radically.
> > >
> > > I've seen 350M docs, 200-300 fields (aggregate) fit into 12G
> > > of JVM. I've seen 25M docs (really big ones) strain 48G
> > > JVM heaps.
> > >
> > > Jack's approach is what I use; pick a number and test with it.
> > > Here's an approach:
> > >
> > >
> >
> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-

RE: Is it possible to use multiple index data directory in Apache Solr?

2015-03-01 Thread Susheel Kumar
Under Solr/example folder, you will find "multicore" folder under which you can 
create multiple core/index directory folders and edit the solr.xml to specify 
each of the new core/directory.  

When you start Solr under examples directory, use command line like below to 
load Solr and then you should be able to see these multiple core in Solr admin 
and index data in each of the core/data directory.

> java -Dsolr.solr.home=multicore -jar start.jar 

Thnx

-Original Message-
From: Jou Sung-Shik [mailto:lik...@gmail.com] 
Sent: February 28, 2015 10:03 PM
To: solr-user@lucene.apache.org
Subject: Is it possible to use multiple index data directory in Apache Solr?

I'm new in Apache Lucene/Solr.

I try to move from Elasticsearch to Apache Solr.

So, I have a question about following index data location configuration.


*in Elasticsearch*

# Can optionally include more than one lo # the locations (a la RAID 0) on a 
file l # space on creation. For example:
#
# path.data: /path/to/data1,/path/to/data2

*in Apache Solr*

/var/data/solr/


I want to configure multiple index data directory like Elasticsearch in Apache 
Solr.

Is it possible?

How I can reach the goal?





--
-
BLOG : http://www.codingstar.net
-


Re: SOLR cloud sharding

2016-06-03 Thread Susheel Kumar
Also not sure about your domain but you may want to double check if you
really need 350 fields for searching & storing. Many times when you
challenge this against the higher cost of hardware, you may be able to
reduce # of searchable / stored fields.

Thanks,
Susheel

On Thu, Jun 2, 2016 at 9:21 AM, Shawn Heisey  wrote:

> On 6/2/2016 1:28 AM, Selvam wrote:
> > We need to run a heavy SOLR with 300 million documents, with each
> > document having around 350 fields. The average length of the fields
> > will be around 100 characters, it may have date and integers fields as
> > well. Now we are not sure whether to have single server or run
> > multiple servers (for each node/shards?). We are using Solr 5.5 and
> > want best performance. We are new to SolrCloud, I would like to
> > request your inputs on how many nodes/shards we need to have and how
> > many servers for best performance. We primarily use geo-statial search.
>
> The really fast answer, which I know isn't really an answer, is this:
>
>
> https://lucidworks.com/blog/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>
> This is *also* the answer if I take time to really think about it ...
> and I do realize that none of this actually helps you.  You will need to
> prototype.  Ideally, your prototype should be the entire index.
> Performance will generally not scale linearly, so if you make decisions
> based on a small-scale prototype, you might find that you don't have
> enough hardware.
>
> The answer will be *heavily* influenced by how many of those 350 fields
> will be used for searching, sorting, faceting, etc.  It will also be
> influenced by the complexity of the queries, how fast the queries must
> complete, and how many queries per second the cluster must handle.
>
> With the information you have supplied, your whole index is likely to be
> in the 10-20TB range.  Performance on an index that large, even with
> plenty of hardware and good tuning, is probably not going to be
> stellar.  You are likely to need several terabytes of total RAM (across
> all servers) to achieve reasonable performance *on a single copy*.  If
> you want two copies of the index for high availability, your RAM
> requirements will double.  Handling an index this size is not going to
> be inexpensive.
>
> An unavoidable fact about Solr performance:  For best results, Solr must
> be able to read critical data entirely from RAM for queries.  If it must
> go to disk, then performance will not be optimal -- disks are REALLY
> slow.  Putting the data on SSD will help, but even SSD storage is quite
> a lot slower than RAM.
>
> For *perfect* performance, the index data on a server must fit entirely
> into unallocated memory -- which means memory beyond the Java heap and
> the basic operating system requirements.  The operating system (not
> Java) will automatically handle caching the index in this available
> memory.  This perfect situation is usually not required in practice,
> though -- the *entire* index is not needed when you do a query.
>
> Here's something I wrote about the topic of Solr performance.  It is not
> as comprehensive as I would like it to be, because I have tried to make
> it relatively concise and useful:
>
> https://wiki.apache.org/solr/SolrPerformanceProblems
>
> Thanks,
> Shawn
>
>


Re: Solr Schema for same field names within different input entities

2016-06-08 Thread Susheel Kumar
How about creating schema with temperature, humidity & a day field (and
other fields you may have like zipcode/city/country etc). Put day="next" or
day="previous" and during query use fq (filter query) to have
fq=day:previous or fq=day:next.

Thanks,
Susheel

On Wed, Jun 8, 2016 at 2:46 PM, Aniruddh Sharma 
wrote:

> Hi
>
> Request help
>
> I have following XML data to start with
>
> 
>
>   13
>   50
> 
>
>   15
>   60
> 
> 
>
>
> Please notice it has "previousDay" and "nextDay" and both of them contains
> details of same field "temperature" and "humidity"
>
> What is best way to create schema for it , where I could query for
> temperature on previousDay as well as on currentDay
>
>
>
> Thanks and Regards
> Aniruddh
>


ImplicitSnitch preferredNodes

2016-06-30 Thread Susheel Kumar
Hello Arcadius, Noble,

I have a single Solr cluster setup across two DC's having good connectivity
with below similar configuration and looking to use preferredNodes
feature/rule that search queries executed from DC1 client, uses all dc1
replica's and DC2 client, uses all dc2 replica's.

Bit confused with the current documentation (
https://issues.apache.org/jira/browse/SOLR-8522) on what different steps
needs to be taken care on client and zookeeper side?

Can you please summarize what needs to be executed part of Solrj client
configuration/properties and Zookeeper clusterstate (MODIFYCOLLECTION) to
make this work.  In the meantime I'll take a look on tests closely.

DC1 - 3-dc1 shards replica and 3-dc2 shards replica
DC2 - 3-dc2 shards replica and 3-dc1 shards replica


Thanks,
Susheel


ImplicitSnitch Documentation for querying Multi-DataCenter replicas using preferredNodes

2016-07-05 Thread Susheel Kumar
Hello,

Can someone help me to clarify and document how to use ImplicitSnitch
preferredNodes rule to implement scenario where search queries executed
from data center DC1 client, uses all dc1 replica's and data center DC2
client, uses all dc2 replica's.

The only source I see is the discussion from JIRA associated with tickets
https://issues.apache.org/jira/browse/SOLR-8146  and the blog
http://menelic.com/2015/12/05/allowing-solrj-cloudsolrclient-to-have-preferred-replica-for-query-operations/


Thanks,
Susheel


Re: Shard vs Replica

2016-07-06 Thread Susheel Kumar
To understand shard & replica, let's first understand what is sharding and
why it is needed.

Sharding -  Assume your index grows large that it doesn't fit into a single
machine (for e.g. your index size is 80GB and your machine is 64GB in which
case index won't fit into memory).  Now to get better performance either
you increase your RAM or have another machine with similar configuration
and you divide (aka partitioning) your index into two and have each 40GB
index fit into two machines.So your complete index = Index1 + Index2
(each 40GB).  These index1 and index2 etc. are called shards.  and
depending on how big is your index and machines resources, you may plan to
have N shards.  To do a complete search, the search has to be performed on
all the shards.

 Hope that clarifies and explains why sharding and what is shard.

Replication - Assume one of the above machine goes down and now you won't
be able to search on complete index since half of the copy/data is not
available. To avoid this single point of failure, you can create replica
(which is copy of shard) on either existing machines (or have another new
machines depending on requirements)

Machine 1 = Shard1 +  Copy of Shard2 (Shard2_Replica1)
Machine 2 = Shard2 +  Copy of Shard1 (Shard1_Replica1)

So you create replica to avoid single point of failure and also to serve
higher queries per second (in case if replica gets created on another
machines)

Hope that clarifies.

Thanks,
Susheel



On Wed, Jul 6, 2016 at 5:32 PM, John Doe  wrote:

> Hey,
>
> I have have the same question on freenode channel , people answered me ,
> but I believe that I still got doubts. Just because I never had approach to
> such data store technologies before it makes me hardly understand what is
> exactly is replica and shard in solr. I believe once I understand what
> exactly are these two, then I would be able to see the difference.
>
> According to English dictionary replica is exact copy of something, which
> sounds like a true to me, but what is shard then here and how is it
> connected with all this context ? Can someone explain this in brief and
> give some examples ?
>
> Thank you in advance
>


Re: SolrCloud - Query performance degrades with multiple servers(Shards)

2016-07-18 Thread Susheel Kumar
Hello,

Question:  Do you really need sharding/can live without sharding since you
mentioned only 10K records in one shard. What's your index/document size?

Thanks,
Susheel

On Mon, Jul 18, 2016 at 2:08 AM, kasimjinwala 
wrote:

> currently I am using solrCloud 5.0 and I am facing query performance issue
> while using 3 implicit shards, each shard contain around 10K records.
> when I am specifying shards parameter(*shards=shard1*) in query it gives
> 30K-35K qps. but while removing shards parameter from query it give
> *1000-1500qps*. performance decreases drastically.
>
> please provide comment or suggestion to solve above issue
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/SolrCloud-Query-performance-degrades-with-multiple-servers-tp4024660p4287600.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: SolrCloud - Query performance degrades with multiple servers(Shards)

2016-07-19 Thread Susheel Kumar
You may want to utilise Document routing (_route_) option to have your
query serve faster but above you are trying to compare apple with oranges
meaning your performance tests numbers have to be based on either your
actual numbers like 3-5 million docs per shard or sufficient enough to see
advantage of using sharding.  10K is nothing for your performance tests and
will not give you anything.

Otherwise as Eric mentioned don't shard  and add replica's if there is no
need to distribute/divide data into shards.


See
https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud

https://cwiki.apache.org/confluence/display/solr/Advanced+Distributed+Request+Options


Thanks,
Susheel

On Tue, Jul 19, 2016 at 1:41 AM, kasimjinwala 
wrote:

> This is just for performance testing we have taken 10K records per shard.
> In
> live scenario it would be 30L-50L per shard. I want to search document from
> all shards, it will slow down and take too long time.
>
> I know in case of solr Cloud, it will query all shard node and then return
> result. Is there any way to search document in all shard with best
> performance(qps)
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/SolrCloud-Query-performance-degrades-with-multiple-servers-tp4024660p4287763.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


How to set credentials when querying using SolrJ - Basic Authentication

2016-08-01 Thread Susheel Kumar
Hello,

I am looking to pass user / pwd when querying using CloudSolrClient.  The
documentation
https://cwiki.apache.org/confluence/display/solr/Basic+Authentication+Plugin
describes about setting the credential when calling the request method like
below

SolrRequest req ;//create a new request object
req.setBasicAuthCredentials(userName, password);
solrClient.request(req);

BUT how do we set the credentials when calling


//HOW to set credentials before calling query method
???
???
solrClient.query(collection, query);method?


Looking the CloudSolrClient source code, i see query method creates a new
QueryRequest object and thus doesn't provide a easy way to set
credentials.  Is there any way to easily hook/set credentials when calling
query method using SolrJ?

Thanks,
Susheel






==


Re: How to set credentials when querying using SolrJ - Basic Authentication

2016-08-01 Thread Susheel Kumar
Thank you so much, Shawn.  Didn't realize that i could call process
directly. I think it will be helpful to add this code to solr
documentation. I'll create a jira to update the documentation.

Thanks,
Susheel

On Mon, Aug 1, 2016 at 7:14 PM, Shawn Heisey  wrote:

> On 8/1/2016 1:59 PM, Susheel Kumar wrote:
> > BUT how do we set the credentials when calling
> >
> > //HOW to set credentials before calling query method
> > ???
> > ???
> > solrClient.query(collection, query);method?
>
> Here's an example of setting credentials on an arbitrary request object
> that is then sent to a specific collectionthat should always work:
>
>   SolrClient client = new CloudSolrClient("localhost:9893");
>   String collection = "gettingstarted";
>   String username = "test";
>   String password = "password";
>
>   SolrQuery query = new SolrQuery();
>   query.setQuery("*:*");
>   // Do any other query setup needed.
>
>   SolrRequest req = new QueryRequest(query);
>   req.setBasicAuthCredentials(username, password);
>   QueryResponse rsp = req.process(client, collection);
>   System.out.println("numFound: " + rsp.getResults().getNumFound());
>
> You can use a similar approach with UpdateRequest to what I've done here
> with QueryRequest.
>
> When you use the sugar methods (like query, update, commit, etc),
> SolrClient builds a request object and then uses the "process" method on
> the request.  The example above just makes this more explicit.
>
> After I wrote the above code, I discovered that there is an alternate
> request method that includes the collection parameter.  This method
> exists in the 5.4 version of SolrJ, which is the minimum version
> required for setBasicAuthCredentials.  I think the user code would be
> about the same size either way.
>
> Thanks,
> Shawn
>
>


Re: problems with bulk indexing with concurrent DIH

2016-08-02 Thread Susheel Kumar
My experience with DIH was we couldn't scale to the level we wanted.  SorlJ
with multi-threading & batch updates (parallel threads pushing data into
solr) worked and were able to ingest 5K-10K docs per second.

Thanks,
Susheel

On Tue, Aug 2, 2016 at 9:15 AM, Mikhail Khludnev  wrote:

> Bernd,
> But why do you have so many deletes? Is it expected?
> When you run DIHs concurrently, do you shard intput data by uniqueKey?
>
> On Wed, Jul 27, 2016 at 6:20 PM, Bernd Fehling <
> bernd.fehl...@uni-bielefeld.de> wrote:
>
> > If there is a problem in single index then it might also be in CloudSolr.
> > As far as I could figure out from INFOSTREAM, documents are added to
> > segments
> > and terms are "collected". Duplicate term are "deleted" (or whatever).
> > These deletes (or whatever) are not concurrent.
> > I have a lines like:
> > BD 0 [Wed Jul 27 13:28:48 GMT+01:00 2016; Thread-27879]: applyDeletes:
> > infos=...
> > BD 0 [Wed Jul 27 13:31:48 GMT+01:00 2016; Thread-27879]: applyDeletes
> took
> > 180028 msec
> > ...
> > BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]: applyDeletes:
> > infos=...
> > BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]: applyDeletes
> took
> > 3411845 msec
> >
> > 3411545 msec are about 56 minutes where the system is doing what???
> > At least not indexing because only one JAVA process and no I/O at all!
> >
> > How can SolrJ help me now with this problem?
> >
> > Best
> > Bernd
> >
> >
> > Am 27.07.2016 um 16:41 schrieb Erick Erickson:
> > > Well, at least it'll be easier to debug in my experience. Simple
> example.
> > > At some point you'll call CloudSolrClient.add(doc list). Comment just
> > that
> > > out and you'll be able to isolate whether the issue is querying the be
> or
> > > sending to Solr.
> > >
> > > Then CloudSolrClient (assuming SolrCloud) has efficiencies in terms of
> > > routing...
> > >
> > > Best
> > > Erick
> > >
> > > On Jul 27, 2016 7:24 AM, "Bernd Fehling" <
> bernd.fehl...@uni-bielefeld.de
> > >
> > > wrote:
> > >
> > >> So writing some SolrJ doing the same job as the DIH script
> > >> and using that concurrent will solve my problem?
> > >> I'm not using Tika.
> > >>
> > >> I don't think that DIH is my problem, even if it is not the best
> > solution
> > >> right now.
> > >> Nevertheless, you are right SolrJ has higher performance, but what
> > >> if I have the same problems with SolrJ like with DIH?
> > >>
> > >> If it runs with DIH it should run with SolrJ with additional
> performance
> > >> boost.
> > >>
> > >> Bernd
> > >>
> > >>
> > >> On 27.07.2016 at 16:03, Erick Erickson:
> > >>> I'd actually recommend you move to a SolrJ solution
> > >>> or similar. Currently, you're putting a load on the Solr
> > >>> servers (especially if you're also using Tika) in addition
> > >>> to all indexing etc.
> > >>>
> > >>> Here's a sample:
> > >>> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
> > >>>
> > >>> Dodging the question I know, but DIH sometimes isn't
> > >>> the best solution.
> > >>>
> > >>> Best,
> > >>> Erick
> > >>>
> > >>> On Wed, Jul 27, 2016 at 6:59 AM, Bernd Fehling
> > >>>  wrote:
> >  After enhancing the server with SSDs I'm trying to speed up
> indexing.
> > 
> >  The server has 16 CPUs and more than 100G RAM.
> >  JAVA (1.8.0_92) has 24G.
> >  SOLR is 4.10.4.
> >  Plain XML data to load is 218G with about 96M records.
> >  This will result in a single index of 299G.
> > 
> >  I tried with 4, 8, 12 and 16 concurrent DIHs.
> >  16 and 12 was to much because for 16 CPUs and my test continued
> with 8
> > >> concurrent DIHs.
> >  Then i was trying different  and 
> settings
> > >> but now I'm stuck.
> >  I can't figure out what is the best setting for bulk indexing.
> >  What I see is that the indexing is "falling asleep" after some time
> of
> > >> indexing.
> >  It is only producing del-files, like _11_1.del, _w_2.del,
> _h_3.del,...
> > 
> >  
> >  8
> >  1024
> >  -1
> >  
> >    8
> >    100
> >    512
> >  
> >  8
> >   > >> class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
> >  ${solr.lock.type:native}
> >  ...
> >  
> > 
> >  
> >   ### no autocommit at all
> >   
> > ${solr.autoSoftCommit.maxTime:-1}
> >   
> >  
> > 
> > 
> > 
> > >>
> >
> command=full-import&optimize=false&clean=false&commit=false&waitSearcher=false
> >  After indexing finishes there is a final optimize.
> > 
> >  My idea is, if 8 DIHs use 8 CPUs then I have 8 CPUs left for merging
> >  (maxIndexingThreads/maxMergeAtOnce/mergeFactor).
> >  It should do no commit, no optimize.
> >  ramBufferSizeMB is high because I have plenty of RAM and I want make
> > >> use the speed of RAM.
> >  segmentsPerTier is high to reduce merging.
> > 
> >  But somewhere is a misconfiguration because

Re: [Non-DoD Source] Re: Solr 6.1.0 issue (UNCLASSIFIED)

2016-08-06 Thread Susheel Kumar
As Eric mentioned you may want to check your analysis chain and see if you
are not using *KeywordTokenizer* for content type /  content type is String
in your schema.xml. I have seen similar errors before due
to KeywordTokenizer being used.

Thanks,
Susheel

On Fri, Aug 5, 2016 at 11:46 PM, Erick Erickson 
wrote:

> You also need to find out _why_ you're trying to index such huge
> tokens, they indicate that something you're ingesting isn't
> reasonable
>
> Just truncating the input will index things, true. But a 32K token is
> unexpected, and indicates what's in your index may not be what you
> expect and may not be useful.
>
> But you know what you're indexing best, this is just a general statement.
>
> Erick
>
> On Fri, Aug 5, 2016 at 12:55 PM, Musshorn, Kris T CTR USARMY RDECOM
> ARL (US)  wrote:
> > CLASSIFICATION: UNCLASSIFIED
> >
> > What I did was force nutch to truncate content to 32765 max before
> indexing into solr and it solved my problem.
> >
> >
> > Thanks,
> > Kris
> >
> > ~~
> > Kris T. Musshorn
> > FileMaker Developer - Contractor – Catapult Technology Inc.
> > US Army Research Lab
> > Aberdeen Proving Ground
> > Application Management & Development Branch
> > 410-278-7251
> > kris.t.musshorn@mail.mil
> > ~~
> >
> >
> > -Original Message-
> > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > Sent: Friday, August 05, 2016 3:29 PM
> > To: solr-user 
> > Subject: [Non-DoD Source] Re: Solr 6.1.0 issue (UNCLASSIFIED)
> >
> > All active links contained in this email were disabled.  Please verify
> the identity of the sender, and confirm the authenticity of all links
> contained within the message prior to copying and pasting the address to a
> Web browser.
> >
> >
> >
> >
> > 
> >
> > what that error is telling you is that you have an unanalyzed term that
> is, well, huge (i..e > 32K). Is your "content" field by chance a "string"
> type? It's very rare that a term > 32K is actually useful.
> > You can't search on it except with, say, wildcards,there's no stemming
> etc. So the first question is whether the "content" field is appropriately
> defined in your schema for your use case.
> >
> > If your content field is some kind of text-based field (i.e.
> > solr.Textfield), then the second issue may be that you just have wonky
> data coming in, say a base-64 encoded image or something scraped from
> somewhere. In that case you need to NOT index it. You can try Or try
> LengthFilterFactory, see:
> > Caution-https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilter
> s#solr.LengthFilterFactory.
> >
> > This is a fundamental limitation enforced at the Lucene layer, so if
> that doesn't work, the only real solution is "don't do that". You'll have
> to intercept the doc and omit that data, perhaps write a custom update
> processor to throw out huge fields or the like.
> >
> > Best,
> > Erick
> >
> >
> > On Fri, Aug 5, 2016 at 10:59 AM, Musshorn, Kris T CTR USARMY RDECOM ARL
> (US)  wrote:
> >> CLASSIFICATION: UNCLASSIFIED
> >>
> >> I am trying to index from nutch 1.12 to SOLR 6.1.0.
> >> Got this error.
> >> java.lang.Exception:
> >> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
> >> Error from server at Caution-http://localhost:8983/solr/ARLInside:
> >> Exception writing document id
> >> Caution-https://emcstage.arl.army.mil/inside/fellows/corner/research.v
> >> ol.3.2/index.cfm to the index; possible analysis error: Document
> >> contains at least one immense term in field="content" (whose UTF8
> >> encoding is longer than the max length 32766
> >>
> >> How to correct?
> >>
> >> Thanks,
> >> Kris
> >>
> >> ~~
> >> Kris T. Musshorn
> >> FileMaker Developer - Contractor - Catapult Technology Inc.
> >> US Army Research Lab
> >> Aberdeen Proving Ground
> >> Application Management & Development Branch
> >> 410-278-7251
> >> kris.t.musshorn@mail.mil
> >> ~~
> >>
> >>
> >>
> >> CLASSIFICATION: UNCLASSIFIED
> >
> >
> > CLASSIFICATION: UNCLASSIFIED
>


Need Permission to commit feature branch for Pull Request SOLR-8146

2016-08-09 Thread Susheel Kumar
Hello,

I created a feature branch for SOLR-8146 that i can submit a pull request
(PR) for review. While pushing the feature branch i am getting below error.
My github id is susheel2...@gmail.com

Thanks,

Susheel

lucene-solr git:(SOLR-8146) git push origin SOLR-8146

Username for 'https://github.com': susheel2...@gmail.com

Password for 'https://susheel2...@gmail.com@github.com':

remote: Permission to apache/lucene-solr.git denied to sushil2777.


Re: Solr 6.1 :: language specific analysis

2016-08-10 Thread Susheel Kumar
BeiderMorse supports these phonetics variations like Foto / Photo and have
support for many languages including German.  Please see
https://cwiki.apache.org/confluence/display/solr/Phonetic+Matching

Thanks,
Susheel

On Wed, Aug 10, 2016 at 2:47 PM, Alexandre Drouin <
alexandre.dro...@orckestra.com> wrote:

> Can you use Solr's synonym feature?  You can find a German synonym file
> here: https://sites.google.com/site/kevinbouge/synonyms-lists
>
> Alexandre Drouin
>
> -Original Message-
> From: Rainer Gnan [mailto:rainer.g...@bsb-muenchen.de]
> Sent: Wednesday, August 10, 2016 10:21 AM
> To: solr-user@lucene.apache.org
> Subject: Solr 6.1 :: language specific analysis
>
> Hello,
>
> I wonder if solr offers a feature (class) to handle different orthogaphy
> versions?
> For the German language for example ... in order to find the same
> documents when searching after "Foto" or "Photo".
>
> I appreachiate any help!
>
> Rainer
>
>
> 
> Rainer Gnan
> Bayerische Staatsbibliothek
> BibliotheksVerbund Bayern
> Verbundnahe Dienste
> 80539 München
> Tel.: +49(0)89/28638-2445
> Fax: +49(0)89/28638-2665
> E-Mail: rainer.g...@bsb-muenchen.de
> 
>
>
>
>


Re: Want zero results from SOLR when there are no matches for "querystring"

2016-08-12 Thread Susheel Kumar
Not exactly sure what you are looking from chaining the results but similar
functionality is available in Streaming expressions where result of inner
expressions are passed to outer expressions and so on
https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions

HTH
Susheel

On Fri, Aug 12, 2016 at 1:08 PM, John Bickerstaff 
wrote:

> Hossman - many thanks again for your comprehensive and very helpful answer!
>
> All,
>
> I am (possibly mis-remembering) reading something about being able to pass
> the results of one query to another query...  Essentially "chaining" result
> sets.
>
> I have looked in docs and can't find anything on a quick search -- I may
> have been reading about the Re-Ranking feature, which doesn't help me (I
> know because I just tried and it seems to return all results anyway, just
> re-ranking the number specified in the reRankDocs flag...)
>
> Is there a way to (cleanly) send the results of one query to another query
> for further processing?  Essentially, pass ONLY the results (including an
> empty set of results) to another query for processing?
>
> thanks...
>
> On Thu, Aug 11, 2016 at 6:19 PM, John Bickerstaff <
> j...@johnbickerstaff.com>
> wrote:
>
> > Thanks!
> >
> > To answer your questions, while I digest the rest of that information...
> >
> > I'm using the hon-lucene-synonyms.5.0.4.jar from here:
> > https://github.com/healthonnet/hon-lucene-synonyms
> >
> > The config looks like this - and IIRC, is simply a copy from the
> > recommended cofig on the site mentioned above.
> >
> >  
> > 
> > 
> >   
> >   
> > 
> > 
> >   solr.PatternTokenizerFactory
> >   
> > 
> > 
> > 
> >   solr.ShingleFilterFactory
> >   true
> >   true
> >   2
> >   4
> > 
> > 
> > 
> >   solr.SynonymFilterFactory
> >   solr.
> KeywordTokenizerFactory
> >   example_synonym_file.txt
> >   true
> >   true
> > 
> >   
> > 
> >   
> >
> >
> >
> > On Thu, Aug 11, 2016 at 6:01 PM, Chris Hostetter <
> hossman_luc...@fucit.org
> > > wrote:
> >
> >>
> >> : First let me say that this is very possibly the "x - y problem" so let
> >> me
> >> : state up front what my ultimate need is -- then I'll ask about the
> >> thing I
> >> : imagine might help...  which, of course, is heavily biased in the
> >> direction
> >> : of my experience coding Java and writing SQL...
> >>
> >> Thank you so much for asking your question this way!
> >>
> >> Right off the bat, the background you've provided seems supicious...
> >>
> >> : I have a piece of a query that calculates a score based on a
> "weighting"
> >> ...
> >> : The specific line is this:
> >> : product(field(category_weight),20)
> >> :
> >> : What I just realized is that when I query Solr for a string that has
> NO
> >> : matches in the entire corpus, I still get a slew of results because
> >> EVERY
> >> : doc has the weighting value in the category_weight field - and
> therefore
> >> : every doc gets some score.
> >>
> >> ...that is *NOT* how dismax and edisamx normally work.
> >>
> >> While both the "bf" abd "bq" params result in "additive" boosting, and
> the
> >> implementation of that "additive boost" comes from adding new optional
> >> clauses to the top level BooleanQuery that is executed, that only
> happens
> >> after the "main" query (from your "q" param) is added to that top level
> >> BooleanQuery as a "mandaory" clause.
> >>
> >> So, for example, "bf=true()" and "bq=*:*" should match & boost every
> doc,
> >> but with the techprducts configs/data these requests still don't match
> >> anything...
> >>
> >> /select?defType=edismax&q=bogus&bf=true()&bq=*:*&debug=query
> >> /select?defType=dismax&q=bogus&bf=true()&bq=*:*&debug=query
> >>
> >> ...and if you look at the debug output, the parsed queries shows that
> the
> >> "bogus" part of the query is mandatory...
> >>
> >> +DisjunctionMaxQuery((text:bogus)) MatchAllDocsQuery(*:*)
> >> FunctionQuery(const(true))
> >>
> >> (i didn't use "pf" in that example, but the effect is the same, the "pf"
> >> based clauses are optional, while the "qf" based clauses are mandatory)
> >>
> >> If you compare that example to your debug output, you'll notice a
> >> difference in structure -- it's a bit hard to see in your example, but
> if
> >> you simplify your qf, pf, and q fields it should be more obvious, but
> >> AFAICT the "main" parts of your query are getting wrapped in an extra
> >> layer of parents (ie: an extra BooleanQuery) which is *not* mandatory in
> >> the top level query ... i don't see *any* mandatory clauses in your top
> >> level BooleanQuery, which is why any match on a bf or bq function is
> >> enough to cause a document to match.
> >>
> >> I suspect the reason your parsed query structure is so diff has to do
> with
> >> this...
> >>
> >> :synonym_edismax>
> >>
> >>
> >> 1) how exactly is "s

BasicAuthentication & blockUnknown Issue

2016-08-26 Thread Susheel Kumar
Hello,

I configured Solr for Basic Authentication with blockUnknown:true.  It
works well and no issue observed in ingestion & querying the solr cloud
cluster but in the Logging i see below errors being logged.

I see SOLR-9188 and SOLR 8236 logged for similar issue.  Is there any
workaround/fix that I can avoid these errors in the log even though cluster
is working fine ?

Thanks,
Susheel

 https://issues.apache.org/jira/browse/SOLR-9188
https://issues.apache.org/jira/browse/SOLR-8326


security.json
===
{
"authentication":{
"blockUnknown": true,
"class":"solr.BasicAuthPlugin",
"credentials":{"solr":"pv7VOv0Riny47Gg7B+dEbI6DZNx/2lP1ZRUkoU1zf+k=
po+GSKNNGfRmlgWPfo8fOphw8EzVP0F+YfKUBrfNzQA="}},
"authorization":{
"class":"solr.RuleBasedAuthorizationPlugin",
"permissions":[{
"name":"security-edit",
"role":"admin"}],
"user-role":{"solr":"admin"}
}}

Logging Errors
===
Time (Local) Level Core Logger Message
8/25/2016, 11:50:45 PM
ERROR false
PKIAuthenticationPlugin
Exception trying to get public key from : http://host1:8983/solr
8/25/2016, 11:50:45 PM
ERROR false
PKIAuthenticationPlugin
Decryption failed ,​ key must be wrong
8/25/2016, 11:50:45 PM
WARN false
PKIAuthenticationPlugin
Failed to decrypt header,​ trying after refreshing the key
8/25/2016, 11:50:45 PM
ERROR false
PKIAuthenticationPlugin
Exception trying to get public key from : http://host1:8983/solr
8/25/2016, 11:50:45 PM
ERROR false
PKIAuthenticationPlugin
Decryption failed ,​ key must be wrong
8/25/2016, 11:50:45 PM
ERROR false
PKIAuthenticationPlugin
Could not decipher a header host1:8984_solr
UE4VOAXkNYDmZGmMBe34VCPoWJgVoRU5IByP9TCS7bXP6QLTB37D9R6DNCWbeAzPOekQ3t7rB+8dS7YUxrGg==
. No principal set


Re: Solrcloud with rest api

2016-09-21 Thread Susheel Kumar
This link has more info
http://lucene.472066.n3.nabble.com/Configure-SolrCloud-for-Loadbalance-for-net-client-td4280074.html


Another suggestion to consider. If you are able to use Java for developing
the search API/service/client then please explore this option since that
will make life easier and you would be able to use SolrJ library to connect
to SolrCloud cluster etc.

We did implement the same way before for a client and then they were able
to call the service from .Net apps though there will be an extra hop.

Thanks,
Susheel

On Wed, Sep 21, 2016 at 10:27 AM, Preeti Bhat 
wrote:

> Hi Jon,
>
> Thanks for quick reply. I have below questions:
> 1) zookeeper acts as load balancer right? 2)Do we need to setup separate
> load balancer to access the zookeeper for .net?
>
> Thanks,
> Preeti
>
> Sent from my Windows Phone
> 
> From: Jon Hawkesworth
> Sent: ‎21-‎09-‎2016 18:58
> To: solr-user@lucene.apache.org
> Subject: RE: Solrcloud with rest api
>
> Things may have changed but a couple of months ago when my colleague was
> working on something similar he reported that SolrNet was not able to talk
> to solrCloud via zookeeper, so we had to set up a load balancer so that our
> SolrNet client applications could send updates to our solrCloud cluster.
>
> Hope this helps,
>
> Jon
>
> -Original Message-
> From: Preeti Bhat [mailto:preeti.b...@shoregrp.com]
> Sent: Wednesday, September 21, 2016 2:12 PM
> To: solr-user@lucene.apache.org
> Subject: Solrcloud with rest api
>
> Hi All,
>
> We are trying to access the solrcloud using rest API in .net. But we are
> not sure how to access the zookeeper setup here. Could someone please guide
> us
>
> Thanks,
> Preeti
>
> Sent from my Windows Phone
>
>
> NOTICE TO RECIPIENTS: This communication may contain confidential and/or
> privileged information. If you are not the intended recipient (or have
> received this communication in error) please notify the sender and
> it-supp...@shoregrp.com immediately, and destroy this communication. Any
> unauthorized copying, disclosure or distribution of the material in this
> communication is strictly forbidden. Any views or opinions presented in
> this email are solely those of the author and do not necessarily represent
> those of the company. Finally, the recipient should check this email and
> any attachments for the presence of viruses. The company accepts no
> liability for any damage caused by any virus transmitted by this email.
>
>
>
>
> NOTICE TO RECIPIENTS: This communication may contain confidential and/or
> privileged information. If you are not the intended recipient (or have
> received this communication in error) please notify the sender and
> it-supp...@shoregrp.com immediately, and destroy this communication. Any
> unauthorized copying, disclosure or distribution of the material in this
> communication is strictly forbidden. Any views or opinions presented in
> this email are solely those of the author and do not necessarily represent
> those of the company. Finally, the recipient should check this email and
> any attachments for the presence of viruses. The company accepts no
> liability for any damage caused by any virus transmitted by this email.
>
>
>


Re: SolrJ App Engine Client

2016-09-22 Thread Susheel Kumar
As per this doc, socket are allowed for paid apps. Not sure if this would
make it unrestricted.

https://cloud.google.com/appengine/docs/java/sockets/

On Thu, Sep 22, 2016 at 3:38 PM, Jay Parashar  wrote:

> I sent a similar message earlier but do not see it. Apologize if its
> duplicated.
>
> I am unable to connect to Solr Cloud zkhost (using CloudSolrClient) from a
> SolrJ client running on Google App Engine.
> The error message is "java.nio.channels.SocketChannel is a restricted
> class. Please see the Google  App Engine developer's guide for more
> details."
>
> Is there a workaround? Its required that the client is SolrJ and running
> on App Engine.
>
> Any feedback is much appreciated. Thanks
>


Re: Connect to SolrCloud using proxy in SolrJ

2016-09-29 Thread Susheel Kumar
As Vincenzo  mentioned above you shall try to check using telnet and if
connection fails, then you should try to set http proxy on terminal/command
line using this and then give try again with telnet.  As long as telnet
works, your code should also be able to connect

export http.proxy=http://
\:@:


Thanks,

Susheel

On Thu, Sep 29, 2016 at 8:09 AM, Vincenzo D'Amore 
wrote:

> Well, how have configured *cliend-side* zookeeper connection? I mean from
> the SolrJ side.
>
> And again as simple check to see if the proxy is working correctly with
> zookeeper you could use telnet.
>
> You should use the proxy hostname and proxy port available for zookeeper
> (it should be different from the port used for http proxy)
>
> # telnet proxy-host-name proxy-port
>
> and when the connection is established write:
>
> ruok 
>
> > "The server will respond with imok if it is running."
>
> https://zookeeper.apache.org/doc/r3.1.2/zookeeperAdmin.html
>
>
> On Thu, Sep 29, 2016 at 1:40 PM, Preeti Bhat 
> wrote:
>
> > Hi Vincenzo,
> >
> > We are verified that the zookeeper is not working on http already. We are
> > getting the error message stating that there is no response from server
> for
> > both proxy and non proxy enabled browsers.
> > I have setup the Zookeeper in AWS, SOLR is connecting to Zookeeper using
> > zkCli.sh from SOLR. I have not made any changes in the zookeeper settings
> > other than specifying the data directory and servers names in quorum.
> >
> > Also, I tried applying the SOCKS proxy in my application and got the
> error
> > stating "Malformed reply from SOCKS server". I am currently working with
> > Network team to see if we have separate Socks proxy settings.
> >
> > Thanks and Regards,
> > Preeti Bhat
> >
> > -Original Message-
> > From: Vincenzo D'Amore [mailto:v.dam...@gmail.com]
> > Sent: Thursday, September 29, 2016 5:02 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Connect to SolrCloud using proxy in SolrJ
> >
> > Well, looking around I found at this issue
> > https://issues.apache.org/jira/browse/ZOOKEEPER-2250
> > As far as I know zookeeper doesn't support socks proxy (may be better ask
> > in the zookeeper forum).
> >
> > Anyway in your email you wrote that zookeeper is "whitelisted in the
> proxy
> > under TCP protocol", so I suppose your proxy is able to transparently
> > bridge tcp connections.
> >
> > Given that make sure your proxy configuration for zookeeper is not
> working
> > on http.
> >
> > Just to understand, how have you configured your zookeeper connection?
> >
> >
> >
> > On Thu, Sep 29, 2016 at 11:06 AM, Mikhail Khludnev 
> > wrote:
> >
> > > Zookeeper clients connect on tcp not http. Perhaps SOCKS proxy might
> > > help, but I don't know exactly.
> > >
> > > On Thu, Sep 29, 2016 at 11:55 AM, Preeti Bhat
> > > 
> > > wrote:
> > >
> > > > Hi Vincenzo,
> > > >
> > > > Yes, I have tried using the https protocol.  We are not able to
> > > > connect
> > > to
> > > > Zookeeper's.
> > > >
> > > > I am getting the below error message.
> > > >
> > > > Could not connect to ZooKeeper zkHost within 1 ms
> > > >
> > > > Thanks and Regards,
> > > > Preeti Bhat
> > > >
> > > > -Original Message-
> > > > From: Vincenzo D'Amore [mailto:v.dam...@gmail.com]
> > > > Sent: Thursday, September 29, 2016 1:57 PM
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Re: Connect to SolrCloud using proxy in SolrJ
> > > >
> > > > Hi,
> > > >
> > > > not sure, have you tried to add proxy configuration for https ?
> > > >
> > > > System.setProperty("https.proxyHost", ProxyHost);
> > > > System.setProperty("https.proxyPort", ProxyPort);
> > > >
> > > >
> > > > Bests,
> > > > Vincenzo
> > > >
> > > > On Thu, Sep 29, 2016 at 10:12 AM, Preeti Bhat
> > > > 
> > > > wrote:
> > > >
> > > > > HI All,
> > > > >
> > > > > Pinging this again. Could someone please advise.
> > > > >
> > > > >
> > > > > Thanks and Regards,
> > > > > Preeti Bhat
> > > > >
> > > > > From: Preeti Bhat
> > > > > Sent: Wednesday, September 28, 2016 7:14 PM
> > > > > To: solr-user@lucene.apache.org
> > > > > Subject: Connect to SolrCloud using proxy in SolrJ
> > > > >
> > > > > Hi All,
> > > > >
> > > > > I am trying to connect to the Solrcloud using the zookeeper host
> > > > > string in my java application(CloudSolrClient). I am able to
> > > > > connect to the solrcloud when there are no proxy settings needed,
> > > > > but when trying to connect to the code using proxy settings, I am
> > > > > getting the
> > > > below error.
> > > > >
> > > > >
> > > > > 1)  Without Proxy
> > > > >
> > > > > System.setProperty("javax.net.ssl.keyStore", keyStore);
> > > > >
> > > > > System.setProperty("javax.net.ssl.keyStorePassword",
> > > > > keyStorePsswd);
> > > > >
> > > > > System.setProperty("javax.net.ssl.trustStore", trustStore);
> > > > >
> > > > > System.setProperty("javax.net.ssl.trustStorePassword",
> > > > > trustStorePsswd);
> > > > >
> > > > >
> > > > > HttpClientBuilder builder = HttpClientBuilder.cre

  1   2   3   4   5   >