Re: SOLR Cloud: 1500+ threads are in TIMED_WAITING status

2018-04-05 Thread Emir Arnautović
Hi,
I’ve seen similar jump in thread number when DBQ was used. Do you delete 
documents while indexing?

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 5 Apr 2018, at 07:56, Doss  wrote:
> 
> @wunder
> 
> Are you sending updates in batches? Are you doing a commit after every
> update? 
> 
>>> We want the system to be near real time, so we are not doing updates in
>>> batches and also we are not doing commit after every update.
>>> autoSoftCommit once in every minute, and autoCommit once in every  10
>>> minutes.
> 
> This thread increase is not happening in all the time, on our peak hours
> where used we to get more user interactions the system works absolutely
> fine, suddenly this problem creeps up and system gets into trouble.
> 
> nproc value increased 18000. 
> 
> Did jetty related linux fine tuning  as described in the below link
> 
> http://www.eclipse.org/jetty/documentation/current/high-load.html
> 
> Thanks.
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



RE: ZKPropertiesWriter error DIH (SolrCloud 6.6.1)

2018-04-05 Thread msaunier
I have use this process to create the DIH :

1. Create the BLOB collection:
* curl
http://localhost:8983/solr/admin/collections?action=CREATE&name=.system

2. Send definition and file for DIH
* curl -X POST -H 'Content-Type: application/octet-stream' --data-binary
@ solr-dataimporthandler-6.6.1.jar
http://localhost:8983/solr/.system/blob/DataImportHandler
* curl -X POST -H 'Content-Type: application/octet-stream' --data-binary
@ mysql-connector-java-5.1.46.jar
http://localhost:8983/solr/.system/blob/MySQLConnector
* curl http://localhost:8983/solr/advertisements2/config -H
'Content-type:application/json' -d '{"add-runtimelib": {
"name":"DataImportHandler", "version":1 }}'
* curl http://localhost:8983/solr/advertisements2/config -H
'Content-type:application/json' -d '{"add-runtimelib": {
"name":"MySQLConnector", "version":1 }}'

3. I have add on the config file the requestHandler with the API. Result :
###
  "/full-advertisements": {
"runtimeLib": true,
"version": 1,
"class": "org.apache.solr.handler.dataimport.DataImportHandler",
"defaults": {
  "config": "DIH/advertisements.xml"
},
"name": "/full-advertisements"
  },
###

4. I have add with the zkcli.sh script the .xml definition file in
/configs/advertisements2/DIH/advertisements.xml
###
















###

Thanks for your help.


-Message d'origine-
De : msaunier [mailto:msaun...@citya.com] 
Envoyé : mercredi 4 avril 2018 09:57
À : solr-user@lucene.apache.org
Cc : fharr...@citya.com
Objet : ZKPropertiesWriter error DIH (SolrCloud 6.6.1)

Hello,
I use Solr Cloud and I test DIH system in cloud, but I have this error :

Full Import
failed:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable
to PropertyWriter implementation:ZKPropertiesWriter at
org.apache.solr.handler.dataimport.DataImporter.createPropertyWriter(DataImp
orter.java:330)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.ja
va:411)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:474
)
at
org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImport
er.java:457)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException at
org.apache.solr.handler.dataimport.DocBuilder.loadClass(DocBuilder.java:935)
at
org.apache.solr.handler.dataimport.DataImporter.createPropertyWriter(DataImp
orter.java:326)
... 4 more

My DIH definition on the cloud


















Call response :
 

http://localhost:8983/solr/advertisements2/full-advertisements?command=full-
import&clean=false&commit=true



0
2


true
1

DIH/advertisements.xml


full-import
idle




I don't understand why I have this error. Can you help me ?
Thanks you.

 





Re: SOLR Cloud: 1500+ threads are in TIMED_WAITING status

2018-04-05 Thread Doss
Hi Emir,

We do fire delete queries but that is very very minimal.

Thanks!
Mohandoss



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: some parent documents

2018-04-05 Thread Arturas Mazeika
Hi Mikhail et al,

Thanks a lot for sharing the code snippet. I would not have been able to
dig this Java file myself to investigate the complexity of the search
query. Scanning the code I get a feeling that it is well structured and
well thought of. There is a concept like advance (Parent Approximation) as
well as ParentPhaseTwo, matches, matchCost, BlockJoinScorer, Explanation,
Query rewriting. Is there a documentation available how the architecture
looks like and what school of thought/doctrine used here?

W.r.t. to my complexity question, I expected to see an answer in the Big-O
notation (rather than as Java code). Typically one makes assumptions there
about the key parameters (e.g., number of Products to be N_P, number of
SKUs to be N_Sk, number of storages to be N_St, number of vendors to be
N_V, JOIN Selectivities (in terms of percentage) be  p(P,SK), p(SK,ST),
p(P,V) between the corresponding entities and computes a formula.

What is the complexity of this query in big-O notation?

Cheers,
Arturas



On Wed, Apr 4, 2018 at 6:16 PM, Mikhail Khludnev  wrote:

> >
> > What's happening under the hood of
> > solr in answering query [1] from [2]?
>
> https://github.com/apache/lucene-solr/blob/master/
> lucene/join/src/java/org/apache/lucene/search/join/
> ToParentBlockJoinQuery.java#L178
>
> On Wed, Apr 4, 2018 at 3:39 PM, Arturas Mazeika  wrote:
>
> > Hi Mikhail et al,
> >
> > Thanks a lot for a very thorough answer. This is an impressive piece of
> > knowledge you just shared.
> >
> > Not surprisingly, I was caught unprepared by the 'v=...' part of the
> > answer. This brought me to the links you posted (starts with http). From
> > those links I went to the more updated link (starts with https), which
> > brought me to other very resourceful links. Combined with some meditation
> > session, it came into my mind that it is not possible to express block
> > queries using mathematical logic only. The format of the input document
> is
> > deeply built into the query expression and answering. Expressing these
> > queries mathematically / logically may give an impression that solr is
> > capable of answering (NP-?) hard problems. I have a feeling though that
> > solr answers to queries in polynomial (or even almost linear) times.
> >
> > Just to connect the remaining dots.. What's happening under the hood of
> > solr in answering query [1] from [2]? Is it really so that inverted index
> > is used to identify the vectors of ids, that are scanned linearly in a
> hope
> > to get matches on _root_ and other internal variables?
> >
> > [1] q=+{!parent which=type_s:product v=$skuq} +{!parent
> > which=type_s:product v=$vendorq}&skuq=+COLOR_s:Blue +SIZE_s:XL +{!parent
> > which=type_s:sku v='+QTY_i:[10 TO *] +STATE_s:CA'}&vendorq=+NAME_s:Bob
> > +PRICE_i:[20 TO 25]
> > [2]
> > https://blog.griddynamics.com/searching-grandchildren-and-
> > siblings-with-solr-block-join/
> >
> > Thanks!
> > Arturas
> >
> > On Wed, Apr 4, 2018 at 12:36 PM, Mikhail Khludnev 
> wrote:
> >
> > > q=+{!parent which=ntype:p v='+msg:Hello +person:Arturas'} +{!parent
> > which=
> > > ntype:p v='+msg:ciao +person:Vai'}
> > >
> > > On Wed, Apr 4, 2018 at 12:19 PM, Arturas Mazeika 
> > > wrote:
> > >
> > > > Hi Mikhail et al,
> > > >
> > > > It seems to me that the nested documents must include nodes that
> encode
> > > the
> > > > level of nodes (within the document). Therefore, the minimal example
> > must
> > > > include the node type. Is the following structure sufficient?
> > > >
> > > > {
> > > > "id":1,
> > > > "ntype":"p",
> > > > "_childDocuments_":
> > > > [
> > > > {"id":"1_1", "ntype":"c", "person":"Vai", "time":"3:14",
> > > > "msg":"Hello"},
> > > > {"id":"1_2", "ntype":"c", "person":"Arturas", "time":"3:14",
> > > > "msg":"Hello"},
> > > > {"id":"1_3", "ntype":"c", "person":"Vai", "time":"3:15",
> > > > "msg":"Coz Mathias is working on another system- different screen."},
> > > > {"id":"1_4", "ntype":"c", "person":"Vai", "time":"3:15",
> > > > "msg":"It can get annoying"},
> > > > {"id":"1_5", "ntype":"c", "person":"Arturas", "time":"3:15",
> > > > "msg":"Thank you. this is very nice of you"},
> > > > {"id":"1_6", "ntype":"c", "person":"Vai", "time":"3:16",
> > > > "msg":"ciao"},
> > > > {"id":"1_7", "ntype":"c", "person":"Arturas", "time":"3:16",
> > > > "msg":"ciao"}
> > > > ]
> > > > },
> > > > {
> > > > "id":2,
> > > > "ntype":"p",
> > > > "_childDocuments_":
> > > > [
> > > > {"id":"2_1", "ntype":"c", "person":"Vai", "time":"4:14",
> > > > "msg":"Hi"},
> > > > {"id":"2_2", "ntype":"c", "person":"Arturas", "time":"4:14",
> > > > "msg":"IBM Watson"},
> > > > {"id":"2_3", "ntype":"c", "person":"Vai", "time":"4:15",
> > > > "msg":"need to retain content"},
> > > > {"id":"2_4", "ntype":"c", "person":"Vai", "time":"4:15",
> > > > "msg":"It can get annoying"},
> > > > {"id":"

Re: SOLR Cloud: 1500+ threads are in TIMED_WAITING status

2018-04-05 Thread Emir Arnautović
Hi Mohandoss,
I would check to see if thread increase is correlated to DBQ since it does not 
play well with concurrent indexing: 
http://www.od-bits.com/2018/03/dbq-or-delete-by-query.html 


HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 5 Apr 2018, at 10:59, Doss  wrote:
> 
> Hi Emir,
> 
> We do fire delete queries but that is very very minimal.
> 
> Thanks!
> Mohandoss
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



How to create my schema and add document, thank you

2018-04-05 Thread Raymond Xie
 I have the data ready for index now, it is a json file:

{"122": "20180320-08:08:35.038", "49": "VIPER", "382": "0", "151": "1.0",
"9": "653", "10071": "20180320-08:08:35.088", "15": "JPY", "56": "XSVC",
"54": "1", "10202": "APMKTMAKING", "10537": "XOSE", "10217": "Y", "48":
"179492540", "201": "1", "40": "2", "8": "FIX.4.4", "167": "OPT", "421":
"JPN", "10292": "115", "10184": "3379122", "456": "101", "11210":
"3379122", "1133": "G", "10515": "178", "10": "200", "11032": "-1",
"10436": "20180320-08:08:35.038", "10518": "178", "11":
"3379122", "75":
"20180320", "10005": "178", "10104": "Y", "35": "RIO", "10208":
"APAC.VIPER.OOE", "59": "0", "60": "20180320-08:08:35.088", "528": "P",
"581": "13", "1": "TEST", "202": "25375.0", "455": "179492540", "55":
"JNI253D8.OS", "100": "XOSE", "52": "20180320-08:08:35.088", "10241":
"viperooe", "150": "A", "10039": "viperooe", "39": "A", "10438": "RIO.4.5",
"38": "1", "37": "3379122", "372": "D", "660": "102", "44": "2.0",
"10066": "20180320-08:08:35.038", "29": "4", "50": "JPNIK01", "22": "101"}

You can inspect the json here: https://jsonformatter.org/

I need to create index and enable searching on tags: 37, 75 and 10242
(where available, this sample message doesn't have it)

My understanding is I need to create the file managed-schema, I added two
fields as below:




Then I go back to Solr Admin, I don't see the two new fields in Schema
section

Anything I am missing here? and once the two fields are put in the
managed-schema, can I add the json file through upload in Solr Admin?

Thank you very much.


**
*Sincerely yours,*


*Raymond*


Re: How to create my schema and add document, thank you

2018-04-05 Thread Adhyan Arizki
Raymond,

1. Please ensure your Solr instance does indeed load up the correct
managed-schema file. You do not need to create the file, it should have
been created automatically in the newer version of Solr out of the box. you
just need to edit it
2. Have you reload your instance after you made the modification?

On Thu, Apr 5, 2018 at 6:56 PM, Raymond Xie  wrote:

>  I have the data ready for index now, it is a json file:
>
> {"122": "20180320-08:08:35.038", "49": "VIPER", "382": "0", "151": "1.0",
> "9": "653", "10071": "20180320-08:08:35.088", "15": "JPY", "56": "XSVC",
> "54": "1", "10202": "APMKTMAKING", "10537": "XOSE", "10217": "Y", "48":
> "179492540", "201": "1", "40": "2", "8": "FIX.4.4", "167": "OPT", "421":
> "JPN", "10292": "115", "10184": "3379122", "456": "101", "11210":
> "3379122", "1133": "G", "10515": "178", "10": "200", "11032": "-1",
> "10436": "20180320-08:08:35.038", "10518": "178", "11":
> "3379122", "75":
> "20180320", "10005": "178", "10104": "Y", "35": "RIO", "10208":
> "APAC.VIPER.OOE", "59": "0", "60": "20180320-08:08:35.088", "528": "P",
> "581": "13", "1": "TEST", "202": "25375.0", "455": "179492540", "55":
> "JNI253D8.OS", "100": "XOSE", "52": "20180320-08:08:35.088", "10241":
> "viperooe", "150": "A", "10039": "viperooe", "39": "A", "10438": "RIO.4.5",
> "38": "1", "37": "3379122", "372": "D", "660": "102", "44": "2.0",
> "10066": "20180320-08:08:35.038", "29": "4", "50": "JPNIK01", "22": "101"}
>
> You can inspect the json here: https://jsonformatter.org/
>
> I need to create index and enable searching on tags: 37, 75 and 10242
> (where available, this sample message doesn't have it)
>
> My understanding is I need to create the file managed-schema, I added two
> fields as below:
>
>  multiValued="true"/>
>  stored="false" multiValued="true"/>
>
> Then I go back to Solr Admin, I don't see the two new fields in Schema
> section
>
> Anything I am missing here? and once the two fields are put in the
> managed-schema, can I add the json file through upload in Solr Admin?
>
> Thank you very much.
>
>
> **
> *Sincerely yours,*
>
>
> *Raymond*
>



-- 

Best regards,
Adhyan Arizki


Re: SOLR Cloud: 1500+ threads are in TIMED_WAITING status

2018-04-05 Thread Doss
Hi Emir,

Just realised DBQ = Delete by Query,  we are not using that, we are deleting
documents using the document id / unique id.

Thanks,
Mohandoss.





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: ZK CLI script giving IOException doing upconfig

2018-04-05 Thread Doug Turnbull
Shawn, that's the ticket... I see where I screwed up now.

My upconfig was also trying to upload the data dir (I had used this as a
solr home in a standalone non cloud Solr), I'm missing *conf* here

-confdir solr_home/foo/

Changing to:

-confdir solr_home/foo/conf

works...

I wonder too if there's anything that can be changed in zkcli to see if the
confdir is a reasonable configuration directory?

Thanks
-Doug


On Wed, Apr 4, 2018 at 3:51 PM Shawn Heisey  wrote:

> On 4/4/2018 12:13 PM, Doug Turnbull wrote:
> > Thanks for the responses. Yeah I thought they were weird errors too... :)
> >
> > Below are the logs from zookeeper running in foreground after a
> connection
> > attempt. But this Exception looks suspicous to me:
> >
> > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@383] -
> Exception
> > causing close of session 0x10024db7e280006: *Len error 5327937*
>
> With that information, I think I can tell you what went wrong.
>
> It looks like one of the files you're trying to upload is 5 megabytes in
> size.  ZooKeeper doesn't allow anything bigger than about 1 megabyte by
> default, because is not designed for handling large amounts of data.
>
> I think that the ZK uploading functionality probably needs to check the
> size of what it is uploading against the max buffer setting and log a
> useful error message.
>
> You can get this to work, but to do so will require setting a system
> property on *all* ZK clients and servers.  The clients will include Solr
> itself and the zkcli script.  The system property to set is
> "jute.maxbuffer".  Info can be found in ZK documentation.
>
> https://zookeeper.apache.org/doc/r3.4.11/zookeeperAdmin.html
>
> Thanks,
> Shawn
>
> --
CTO, OpenSource Connections
Author, Relevant Search
http://o19s.com/doug


[ANNOUNCE] Solr Reference Guide for Solr 7.3 released

2018-04-05 Thread Cassandra Targett
The Lucene PMC is pleased to announce that the Solr Reference Guide for
Solr 7.3 is now available.

This 1,295 page PDF is the definitive guide to using Apache Solr, the
search server built on Apache Lucene.

The PDF Guide can be downloaded from:
https://www.apache.org/dyn/closer.cgi/lucene/solr/ref-guide/apache-solr-ref-guide-7.3.pdf

It is also available online at https://lucene.apache.org/solr/guide/7_3.


Re: Solr 7.1.0 - concurrent.ExecutionException building model

2018-04-05 Thread Joe Obernberger
Thank you Shawn - sorry so long to respond, been playing around with 
this a good bit.  It is an amazing capability.  It looks like it could 
be related to certain nodes in the cluster not responding quickly 
enough.  In one case, I got the concurrent.ExecutionException, but it 
looks like the root cause was a SocketTimeoutException.  I'm using HDFS 
for the index and it gets hit pretty hard by other processes running, 
and I'm wondering if that's causing this.


java.io.IOException: java.util.concurrent.ExecutionException: 
java.io.IOException: params 
expr=update(models,+batchSize%3D"50",train(MODEL1033_1522883727011,features(MODEL1033_1522883727011,q%3D"*:*",featureSet%3D"FSet_MODEL1033_1522883727011",field%3D"Text",outcome%3D"out_i",positiveLabel%3D1,numTerms%3D1000),q%3D"*:*",name%3D"MODEL1033",field%3D"Text",outcome%3D"out_i",maxIterations%3D"1000"))&qt=/stream&explain=true&q=*:*&fl=id&sort=id+asc&distrib=false
    at 
org.apache.solr.client.solrj.io.stream.CloudSolrStream.openStreams(CloudSolrStream.java:405)
    at 
org.apache.solr.client.solrj.io.stream.CloudSolrStream.open(CloudSolrStream.java:275)
    at 
com.ngc.bigdata.ie_solrmodelbuilder.SolrModelBuilderProcessor.doWork(SolrModelBuilderProcessor.java:114)
    at 
com.ngc.intelenterprise.intelentutil.utils.Processor.run(Processor.java:140)
    at 
com.ngc.intelenterprise.intelentutil.jms.IntelEntQueueProc.process(IntelEntQueueProc.java:208)
    at 
org.apache.camel.processor.DelegateSyncProcessor.process(DelegateSyncProcessor.java:63)
    at 
org.apache.camel.management.InstrumentationProcessor.process(InstrumentationProcessor.java:77)
    at 
org.apache.camel.processor.RedeliveryErrorHandler.process(RedeliveryErrorHandler.java:460)
    at 
org.apache.camel.processor.CamelInternalProcessor.process(CamelInternalProcessor.java:190)
    at 
org.apache.camel.processor.CamelInternalProcessor.process(CamelInternalProcessor.java:190)
    at 
org.apache.camel.component.direct.DirectProducer.process(DirectProducer.java:62)
    at 
org.apache.camel.processor.SendProcessor.process(SendProcessor.java:141)
    at 
org.apache.camel.management.InstrumentationProcessor.process(InstrumentationProcessor.java:77)
    at 
org.apache.camel.processor.RedeliveryErrorHandler.process(RedeliveryErrorHandler.java:460)
    at 
org.apache.camel.processor.CamelInternalProcessor.process(CamelInternalProcessor.java:190)
    at 
org.apache.camel.processor.CamelInternalProcessor.process(CamelInternalProcessor.java:190)
    at 
org.apache.camel.component.jms.EndpointMessageListener.onMessage(EndpointMessageListener.java:114)
    at 
org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:699)
    at 
org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:637)
    at 
org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:605)
    at 
org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:308)
    at 
org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:246)
    at 
org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:1144)
    at 
org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.executeOngoingLoop(DefaultMessageListenerContainer.java:1136)
    at 
org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:1033)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

    at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: 
params 
expr=update(models,+batchSize%3D"50",train(MODEL1033_1522883727011,features(MODEL1033_1522883727011,q%3D"*:*",featureSet%3D"FSet_MODEL1033_1522883727011",field%3D"Text",outcome%3D"out_i",positiveLabel%3D1,numTerms%3D1000),q%3D"*:*",name%3D"MODEL1033",field%3D"Text",outcome%3D"out_i",maxIterations%3D"1000"))&qt=/stream&explain=true&q=*:*&fl=id&sort=id+asc&distrib=false

    at java.util.concurrent.FutureTask.report(FutureTask.java:122)
    at java.util.concurrent.FutureTask.get(FutureTask.java:192)
    at 
org.apache.solr.client.solrj.io.stream.CloudSolrStream.openStreams(CloudSolrStream.java:399)

    ... 27 more
Caused by: java.io.IOException: params 
expr=update(models,+batchSize%3D"50",train(MODEL1033_1522883727011,features(MODEL1033_1522883727011,q%3D"*:*",featureSet%3D"FSet_MODEL1

Re: ZK CLI script giving IOException doing upconfig

2018-04-05 Thread Shawn Heisey

On 4/5/2018 7:01 AM, Doug Turnbull wrote:

My upconfig was also trying to upload the data dir (I had used this as a
solr home in a standalone non cloud Solr), I'm missing *conf* here


Oops. :)  Uploading an entire core would be a problem! Glad you figured 
it out.



I wonder too if there's anything that can be changed in zkcli to see if the
confdir is a reasonable configuration directory?


The process could look at what's being uploaded and log a warning that 
says it didn't see any files with names that looked like a config, but I 
wouldn't want it to make any decisions (like aborting the upload before 
it begins) based on that information.


As an example of something we would want to allow:  It is perfectly 
acceptable to make a change to an existing config in zookeeper by 
uploading a directory with only one file in it, say a synonym list.  The 
upconfig action never *deletes* anything from the config in zookeeper.


Thanks,
Shawn



Re: [ANNOUNCE] Solr Reference Guide for Solr 7.3 released

2018-04-05 Thread Terry Steichen
I'm a bit confused because of the issue I was concerned about earlier: 
https://issues.apache.org/jira/browse/SOLR-11622
It was supposed to be fixed and included in (the then-future) 7.3, but I
don't see it there in the listed 7.3.0 changes/bug-fixes.
Am I missing something?


On 04/05/2018 10:05 AM, Cassandra Targett wrote:
> The Lucene PMC is pleased to announce that the Solr Reference Guide for
> Solr 7.3 is now available.
>
> This 1,295 page PDF is the definitive guide to using Apache Solr, the
> search server built on Apache Lucene.
>
> The PDF Guide can be downloaded from:
> https://www.apache.org/dyn/closer.cgi/lucene/solr/ref-guide/apache-solr-ref-guide-7.3.pdf
>
> It is also available online at https://lucene.apache.org/solr/guide/7_3.
>



Re: [ANNOUNCE] Solr Reference Guide for Solr 7.3 released

2018-04-05 Thread Steve Rowe
You’re missing Erick Erickson’s last comment on the issue[1]:

> Fixed as part of SOLR-11701


SOLR-11701[2] is listed in CHANGES[3].

[1] 
https://issues.apache.org/jira/browse/SOLR-11622?focusedCommentId=16303006&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16303006
[2] https://issues.apache.org/jira/browse/SOLR-11701
[3] 
https://lucene.apache.org/solr/7_3_0/changes/Changes.html#v7.3.0.other_changes
--
Steve
www.lucidworks.com

> On Apr 5, 2018, at 11:05 AM, Terry Steichen  wrote:
> 
> I'm a bit confused because of the issue I was concerned about earlier: 
> https://issues.apache.org/jira/browse/SOLR-11622
> It was supposed to be fixed and included in (the then-future) 7.3, but I
> don't see it there in the listed 7.3.0 changes/bug-fixes.
> Am I missing something?
> 
> 
> On 04/05/2018 10:05 AM, Cassandra Targett wrote:
>> The Lucene PMC is pleased to announce that the Solr Reference Guide for
>> Solr 7.3 is now available.
>> 
>> This 1,295 page PDF is the definitive guide to using Apache Solr, the
>> search server built on Apache Lucene.
>> 
>> The PDF Guide can be downloaded from:
>> https://www.apache.org/dyn/closer.cgi/lucene/solr/ref-guide/apache-solr-ref-guide-7.3.pdf
>> 
>> It is also available online at https://lucene.apache.org/solr/guide/7_3.
>> 
> 



Re: [ANNOUNCE] Solr Reference Guide for Solr 7.3 released

2018-04-05 Thread Shawn Heisey

On 4/5/2018 9:05 AM, Terry Steichen wrote:

I'm a bit confused because of the issue I was concerned about earlier:
https://issues.apache.org/jira/browse/SOLR-11622
It was supposed to be fixed and included in (the then-future) 7.3, but I
don't see it there in the listed 7.3.0 changes/bug-fixes.
Am I missing something?


One of the final comments in that issue says "Fixed as part of 
SOLR-11701".  That issue is listed in the CHANGES.txt.


Perhaps the changelog entry for SOLR-11701 should have mentioned any 
other issues that were also fixed by the commit.  In Erick's defense, 
I'll say this:  Making sure that everything for one issue gets handled 
correctly in a decent timeframe can be a little overwhelming.  Details 
like the fact that the commit for one issue also solves another issue 
are easy to miss until later.


Thanks,
Shawn



Re: [ANNOUNCE] Solr Reference Guide for Solr 7.3 released

2018-04-05 Thread Terry Steichen
OK, I guess this means this change been included in 7.3.0  I really
appreciate what all of the committers do, so please don't get this wrong.

Even with this and the preceding comment, I find it difficult to clearly
follow these changes.  Perhaps, as Shawn suggests, any such
consolidation and/or early release might be reflected back in the
original change (11622).

Anyway, I'm a happy camper now.  Thanks to all.


On 04/05/2018 11:37 AM, Shawn Heisey wrote:
> On 4/5/2018 9:05 AM, Terry Steichen wrote:
>> I'm a bit confused because of the issue I was concerned about earlier:
>> https://issues.apache.org/jira/browse/SOLR-11622
>> It was supposed to be fixed and included in (the then-future) 7.3, but I
>> don't see it there in the listed 7.3.0 changes/bug-fixes.
>> Am I missing something?
>
> One of the final comments in that issue says "Fixed as part of
> SOLR-11701".  That issue is listed in the CHANGES.txt.
>
> Perhaps the changelog entry for SOLR-11701 should have mentioned any
> other issues that were also fixed by the commit.  In Erick's defense,
> I'll say this:  Making sure that everything for one issue gets handled
> correctly in a decent timeframe can be a little overwhelming.  Details
> like the fact that the commit for one issue also solves another issue
> are easy to miss until later.
>
> Thanks,
> Shawn
>
>



Basic Security Plugin and Collection Shard Distribution

2018-04-05 Thread Chris Ulicny
Hi all,

I've been periodically running into a strange permissions issues and have
finally some useful information on it. We've run into the issue on v6.3.0
and v7.X clusters.

Assume we have 2 hosts (1 instance on each) with 2 collections. Collection
c1 has 2 shards, and collection c2 has 1 shard. Each only has one copy of
each shard. The distribution is as follows:

host1: c1-shard1
host2: c1-shard2, c2-shard1

We have security enabled on it where the authorization section looks like:

  "authorization":{
"class":"solr.RuleBasedAuthorizationPlugin",
"permissions":[
  {"name":"read","role":"reader"},
  {"name":"security-read","role":"reader"},
  {"name":"schema-read","role":"reader"},
  {"name":"config-read","role":"reader"},
  {"name":"core-admin-read","role":"reader"},
  {"name":"collection-admin-read","role":"reader"},
  {"name":"update","role":"writer"},
  {"name":"security-edit","role":"admin"},
  {"name":"schema-edit","role":"admin"},
  {"name":"config-edit","role":"admin"},
  {"name":"core-admin-edit","role":"admin"},
  {"name":"collection-admin-edit","role":"admin"},
  {"name":"all","role":"admin"}],
"user-role":{
  "solradmin":["reader","writer","admin"],
  "solrreader":["reader"],
  "solrwriter":["reader","writer"]}}

When sending the query http://host1:8983/solr/c2/select?q=*:* as
solrreader or solrwriter a 403 response is returned

However, when sending the query as solradmin, the expected results are returned.

So what are we missing to allow the reader role to query a collection
that is part of the solrcloud instance, but not actually present on
the host?

Thanks,
Chris


Re: Copy field on dynamic fields?

2018-04-05 Thread Alexandre Rafalovitch
Have you tried reading existing example schemas? They show various
permutations of copy fields.

Regards,
Alex

On Thu, Apr 5, 2018, 2:54 AM jatin roy,  wrote:

> Any update?
> 
> From: jatin roy
> Sent: Tuesday, April 3, 2018 12:37 PM
> To: solr-user@lucene.apache.org
> Subject: Copy field on dynamic fields?
>
> Hi,
>
> Can we create copy field on dynamic fields? If yes then how it decide
> which field should be copied to which one?
>
> For example: if I have dynamic field: category_* and while indexing 4
> fields are formed such as:
> category_1
> category_2
> category_3
> category_4
> and now I have to copy the contents of already existing dynamic field
> "category_*" to "new_category_*".
>
> So my question is how the algorithm decides that category_1 data has to be
> indexed in new_category_1 ?
>
> Regards
> Jatin Roy
> Software developer
>
>


Re: Copy field on dynamic fields?

2018-04-05 Thread Chris Hostetter

: Have you tried reading existing example schemas? They show various
: permutations of copy fields.

Hmm... as the example schema's have been simplified/consolidated/purged it 
seems we have lost the specific examples that are relevant to the users 
question -- the only instance of a glob'ed copyField in any of the 
configsets we ship is with a single destination field.

And the ref guide doesn't mention globs in copyField dest either? 
(created SOLR-12191)

Jatin: what you are asking about is 100% possible -- here's some examples 
from one of our test configs used specifically for testing copyField...

  
  
  

This ensures that any field name starting with "dynamic_" is also copied 
to an "equivilent" field name *ending* with "_dynamic"

so "1234_dynamic" gets copied to "dynamic_1234", "foo_dynamic" gets copied 
to "dynamic_foo" etc...

This "glob" pattern in copyFields also works even if the underlying fields 
are not dynamicField...

  
  
  
  

 so "sku1" and "sku2" will be each copied to "1_s" and "2_s" respectively 
... you could also mix & match that with a  if you wanted sku1 and sku2 to have special types, but some ohther more 
common type for other sku* fields.






: Regards,
: Alex
: 
: On Thu, Apr 5, 2018, 2:54 AM jatin roy,  wrote:
: 
: > Any update?
: > 
: > From: jatin roy
: > Sent: Tuesday, April 3, 2018 12:37 PM
: > To: solr-user@lucene.apache.org
: > Subject: Copy field on dynamic fields?
: >
: > Hi,
: >
: > Can we create copy field on dynamic fields? If yes then how it decide
: > which field should be copied to which one?
: >
: > For example: if I have dynamic field: category_* and while indexing 4
: > fields are formed such as:
: > category_1
: > category_2
: > category_3
: > category_4
: > and now I have to copy the contents of already existing dynamic field
: > "category_*" to "new_category_*".
: >
: > So my question is how the algorithm decides that category_1 data has to be
: > indexed in new_category_1 ?
: >
: > Regards
: > Jatin Roy
: > Software developer
: >
: >
: 

-Hoss
http://www.lucidworks.com/


Re: PreAnalyzed URP and SchemaRequest API

2018-04-05 Thread David Smiley
Is this really a problem when you could easily enough create a TextField
and call setTokenStream?

Does your remote client have Solr-core and all its dependencies on the
classpath?   That's one way to do it... and presumably the direction you
are going because you're asking how to work with PreAnalyzedParser which is
in solr-core.  *Alternatively*, only bring in Lucene core and construct
things yourself in the right format.  You could copy PreAnalyzedParser into
your codebase so that you don't have to reinvent any wheels, even though
that's awkward.  Perhaps that ought to be in Solrj?  But no we don't want
SolrJ depending on Lucene-core, though it'd make a fine "optional"
dependency.

On Wed, Apr 4, 2018 at 4:53 AM Markus Jelsma 
wrote:

> Hello,
>
> We intend to move to PreAnalyzed URP for analysis offloading. Browsing the
> Javadocs i came across the SchemaRequest API looking for a way to get a
> Field object remotely, which i seem to need for
> JsonPreAnalyzedParser.toFormattedString(Field f). But all i can get from
> SchemaRequest API is FieldTypeRepresentation, which offers me
> getIndexAnalyzer() but won't allow me to construct a Field object.
>
> So, to analyze remotely i do need an index-time analyzer. I can get it,
> but not turn it into a Field object, which the PreAnalyzedParser for some
> reason wants.
>
> Any hints here? I must be looking the wrong way.
>
> Many thanks!
> Markus
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com


Re: Largest number of indexed documents used by Solr

2018-04-05 Thread Kelly, Frank
For us we have ~ 350M documents stored using r3.xlarge nodes with 8GB Heap
and about 31GB of RAM

We are using Solr 5.3.1 in a SolrCloud setup (3 collections, each with 3
shards and 3 replicas).

For us lots of RAM memory is not as important as CPU (as the EBS disk we
run on top of 
is quite fast and our memory hit rate is quite low).

Some things that helped
1) Turned off the filter cache (it required too much heap)
2) Set a limit on replication bandwidth (when nodes are recovering they
can tie up a lot of CPU) in particular maxWriteMBPerSec=100
3) Set query timeout to 2 seconds to help kill ³heavy² queries
4) Set preferLocalShards=true to help mitigate when any EC2 nodes are
having a ³noisy neighbor"
5) We implemented our own CloudWatch based monitoring so that when Solr VM
CPU is high (> 90%) we queue up indexing traffic rather than send it to be
indexed.
We found that if you peg Solr CPU for too long replicas can¹t keep up,
they go into recovery, which drives CPU even higher and eventually the
cluster thinks the nodes are ³down² when they repeatedly fail at recovery.
So we really try to manage Solr CPU load (We¹ll probably look to switching
to compute optimized nodes in the future)

Best

-Frank


On 4/3/18, 9:12 PM, "Steven White"  wrote:

>Hi everyone,
>
>I'm about to start a project that requires indexing 36 million records
>using Solr 7.2.1.  Each record range from 500 KB to 0.25 MB where the
>average is 0.1 MB.
>
>Has anyone indexed this number of records?  What are the things I should
>worry about?  And out of curiosity, what is the largest number of records
>that Solr has indexed which is published out there?
>
>Thanks
>
>Steven



Getting "zip bomb" exception while sending HTML document to solr

2018-04-05 Thread Hanjan, Harinder
Hello!

I'm sending a HTML document to Solr and Tika is throwing the "Zip bomb 
detected!" exception back. Looks like Tika has an arbitrary limit of 100 level 
of XML element nesting 
(https://github.com/apache/tika/blob/9130bbc1fa6d69419b2ad294917260d6b1cced08/tika-core/src/main/java/org/apache/tika/sax/SecureContentHandler.java#L72-L75).
  Luckily, the variable (maxDepth) does have a public setter function but I am 
not sure if it's possible to set this at Solr.  Is it possible? If so, how 
would I set the value of maxDepth to a higher number?

Thanks!

Here is the full stack trace:
2018-04-05 16:47:48.034 ERROR (qtp1654589030-15) [   x:aconn] 
o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: 
org.apache.tika.exception.TikaException: Zip bomb detected!
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:234)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
at 
ca.calgary.csc.wds.solr.GsaAconnRequestHandler.handleRequestBody(GsaAconnRequestHandler.java:84)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:177)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2503)
at 
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:710)
at 
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:516)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:382)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:326)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1751)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at 
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.eclipse.jetty.server.Server.handle(Server.java:534)
at 
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
at 
org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
at 
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.tika.exception.TikaException: Zip bomb detected!
at 
org.apache.tika.sax.SecureContentHandler.throwIfCauseOf(SecureContentHandler.java:192)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:138)
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
... 35 more
Caused by: org.apache.tika.sax.SecureContentHandler$SecureSAXException: 
Suspected zip bomb: 100 levels of XML element nesting
at 
org.apache.tika.sax

Re: Largest number of indexed documents used by Solr

2018-04-05 Thread Joe Obernberger

50 billion per day?  Wow!  How large are these documents?

We have a cluster with one large collection that contains 2.4 billion 
documents spread across 40 machines using HDFS for the index.  We store 
our data inside of HBase, and in order to re-index data we pull from 
HBase and index with solr cloud.  Most we can do is around 57 million 
per day; usually limited by pulling data out of HBase not Solr.


-Joe


On 4/4/2018 10:57 PM, 苗海泉 wrote:

When we have 49 shards per collection, there are more than 600 collections.
Solr will have serious performance problems. I don't know how to deal with
them. My advice to you is to minimize the number of collections.
Our environment is 49 solr server nodes, each with 32cpu/128g, and the data
volume is about 50 billion per day.


‌
 Sent with Mailtrack


2018-04-04 9:23 GMT+08:00 Yago Riveiro :


Hi,

In my company we are running a 12 node cluster with 10 (american) Billion
documents 12 shards / 2 replicas.

We do mainly faceting queries with a very reasonable performance.

36 million documents it's not an issue, you can handle that volume of
documents with 2 nodes with SSDs and 32G of ram

Regards.

--

Yago Riveiro

On 4 Apr 2018 02:15 +0100, Abhi Basu <9000r...@gmail.com>, wrote:

We have tested Solr 4.10 with 200 million docs with avg doc size of 250

KB.

No issues with performance when using 3 shards / 2 replicas.



On Tue, Apr 3, 2018 at 8:12 PM, Steven White 

wrote:

Hi everyone,

I'm about to start a project that requires indexing 36 million records
using Solr 7.2.1. Each record range from 500 KB to 0.25 MB where the
average is 0.1 MB.

Has anyone indexed this number of records? What are the things I should
worry about? And out of curiosity, what is the largest number of

records

that Solr has indexed which is published out there?

Thanks

Steven




--
Abhi Basu







Storing Ranking Scores And Documents In Separate Indices

2018-04-05 Thread Huynh, Quynh
Hey Solr Community,

We have a collection of product documents that we’d like to add fields to with 
ranking scores generated by our data scientists.

Two options we’re considering is to either:
-  Have a separate index that contains all the documents from our 
product index, but with these additional ranking fields
-  Have an index with just the score fields and a numerical key to 
represent the product that would require a separate lookup

We wanted to know if any Solr users with a similar problem has tried either of 
those options (and the performance implications you faced), or had a different 
approach to structuring documents in separate collections, where the only 
difference between the documents was the ranking fields.


Thanks!
Quynh


RE: Storing Ranking Scores And Documents In Separate Indices

2018-04-05 Thread Markus Jelsma
Hello Quynh,

Solr has support for external file fields [1]. They are a simple key=float 
based text file where key is ID, and the float can be used for boosting/scoring 
documents. This is a much simpler approach than using a separate collection. 
These files can be reloaded every commit and are really easy to use. We use 
them for boosting documents by their popularity.

Hope that helps,
Markus

[1]  
https://lucene.apache.org/solr/guide/6_6/working-with-external-files-and-processes.html
 
-Original message-
> From:Huynh, Quynh 
> Sent: Thursday 5th April 2018 22:50
> To: solr-user@lucene.apache.org
> Cc: Collazo, Carlos ; Ganesan, VinothKumar 
> 
> Subject: Storing Ranking Scores And Documents In Separate Indices
> 
> Hey Solr Community,
> 
> We have a collection of product documents that we’d like to add fields to 
> with ranking scores generated by our data scientists.
> 
> Two options we’re considering is to either:
> -  Have a separate index that contains all the documents from our 
> product index, but with these additional ranking fields
> -  Have an index with just the score fields and a numerical key to 
> represent the product that would require a separate lookup
> 
> We wanted to know if any Solr users with a similar problem has tried either 
> of those options (and the performance implications you faced), or had a 
> different approach to structuring documents in separate collections, where 
> the only difference between the documents was the ranking fields.
> 
> 
> Thanks!
> Quynh
> 


Re: Solr 7.1.0 - concurrent.ExecutionException building model

2018-04-05 Thread Joe Obernberger
I tried to build a large model based on about 1.2 million documents.  
One of the nodes ran out of memory and killed itself. Is this much data 
not reasonable to use?  The nodes have 16g of heap.  Happy to increase 
it, but not sure if this is possible?


Thank you!

-Joe


On 4/5/2018 10:24 AM, Joe Obernberger wrote:
Thank you Shawn - sorry so long to respond, been playing around with 
this a good bit.  It is an amazing capability.  It looks like it could 
be related to certain nodes in the cluster not responding quickly 
enough.  In one case, I got the concurrent.ExecutionException, but it 
looks like the root cause was a SocketTimeoutException.  I'm using 
HDFS for the index and it gets hit pretty hard by other processes 
running, and I'm wondering if that's causing this.


java.io.IOException: java.util.concurrent.ExecutionException: 
java.io.IOException: params 
expr=update(models,+batchSize%3D"50",train(MODEL1033_1522883727011,features(MODEL1033_1522883727011,q%3D"*:*",featureSet%3D"FSet_MODEL1033_1522883727011",field%3D"Text",outcome%3D"out_i",positiveLabel%3D1,numTerms%3D1000),q%3D"*:*",name%3D"MODEL1033",field%3D"Text",outcome%3D"out_i",maxIterations%3D"1000"))&qt=/stream&explain=true&q=*:*&fl=id&sort=id+asc&distrib=false
    at 
org.apache.solr.client.solrj.io.stream.CloudSolrStream.openStreams(CloudSolrStream.java:405)
    at 
org.apache.solr.client.solrj.io.stream.CloudSolrStream.open(CloudSolrStream.java:275)
    at 
com.ngc.bigdata.ie_solrmodelbuilder.SolrModelBuilderProcessor.doWork(SolrModelBuilderProcessor.java:114)
    at 
com.ngc.intelenterprise.intelentutil.utils.Processor.run(Processor.java:140)
    at 
com.ngc.intelenterprise.intelentutil.jms.IntelEntQueueProc.process(IntelEntQueueProc.java:208)
    at 
org.apache.camel.processor.DelegateSyncProcessor.process(DelegateSyncProcessor.java:63)
    at 
org.apache.camel.management.InstrumentationProcessor.process(InstrumentationProcessor.java:77)
    at 
org.apache.camel.processor.RedeliveryErrorHandler.process(RedeliveryErrorHandler.java:460)
    at 
org.apache.camel.processor.CamelInternalProcessor.process(CamelInternalProcessor.java:190)
    at 
org.apache.camel.processor.CamelInternalProcessor.process(CamelInternalProcessor.java:190)
    at 
org.apache.camel.component.direct.DirectProducer.process(DirectProducer.java:62)
    at 
org.apache.camel.processor.SendProcessor.process(SendProcessor.java:141)
    at 
org.apache.camel.management.InstrumentationProcessor.process(InstrumentationProcessor.java:77)
    at 
org.apache.camel.processor.RedeliveryErrorHandler.process(RedeliveryErrorHandler.java:460)
    at 
org.apache.camel.processor.CamelInternalProcessor.process(CamelInternalProcessor.java:190)
    at 
org.apache.camel.processor.CamelInternalProcessor.process(CamelInternalProcessor.java:190)
    at 
org.apache.camel.component.jms.EndpointMessageListener.onMessage(EndpointMessageListener.java:114)
    at 
org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:699)
    at 
org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:637)
    at 
org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:605)
    at 
org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:308)
    at 
org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:246)
    at 
org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:1144)
    at 
org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.executeOngoingLoop(DefaultMessageListenerContainer.java:1136)
    at 
org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:1033)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

    at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.ExecutionException: 
java.io.IOException: params 
expr=update(models,+batchSize%3D"50",train(MODEL1033_1522883727011,features(MODEL1033_1522883727011,q%3D"*:*",featureSet%3D"FSet_MODEL1033_1522883727011",field%3D"Text",outcome%3D"out_i",positiveLabel%3D1,numTerms%3D1000),q%3D"*:*",name%3D"MODEL1033",field%3D"Text",outcome%3D"out_i",maxIterations%3D"1000"))&qt=/stream&explain=true&q=*:*&fl=id&sort=id+asc&distrib=false

    at java.util.concurrent.FutureTask.report(FutureTask.java:122)
    at java.util.concurrent.FutureTask.get(Fu

Re: Storing Ranking Scores And Documents In Separate Indices

2018-04-05 Thread Erick Erickson
Also, Solr has updateable docValues fields (single-valued only) that
may be another alternative.

Best,
Erick

On Thu, Apr 5, 2018 at 1:59 PM, Markus Jelsma
 wrote:
> Hello Quynh,
>
> Solr has support for external file fields [1]. They are a simple key=float 
> based text file where key is ID, and the float can be used for 
> boosting/scoring documents. This is a much simpler approach than using a 
> separate collection. These files can be reloaded every commit and are really 
> easy to use. We use them for boosting documents by their popularity.
>
> Hope that helps,
> Markus
>
> [1]  
> https://lucene.apache.org/solr/guide/6_6/working-with-external-files-and-processes.html
>
> -Original message-
>> From:Huynh, Quynh 
>> Sent: Thursday 5th April 2018 22:50
>> To: solr-user@lucene.apache.org
>> Cc: Collazo, Carlos ; Ganesan, VinothKumar 
>> 
>> Subject: Storing Ranking Scores And Documents In Separate Indices
>>
>> Hey Solr Community,
>>
>> We have a collection of product documents that we’d like to add fields to 
>> with ranking scores generated by our data scientists.
>>
>> Two options we’re considering is to either:
>> -  Have a separate index that contains all the documents from our 
>> product index, but with these additional ranking fields
>> -  Have an index with just the score fields and a numerical key to 
>> represent the product that would require a separate lookup
>>
>> We wanted to know if any Solr users with a similar problem has tried either 
>> of those options (and the performance implications you faced), or had a 
>> different approach to structuring documents in separate collections, where 
>> the only difference between the documents was the ranking fields.
>>
>>
>> Thanks!
>> Quynh
>>


Re: Solr 7.1.0 - concurrent.ExecutionException building model

2018-04-05 Thread Joel Bernstein
Hi Joe,

Currently you will eventually run into memory problems if the training sets
gets too large. Under the covers on each node it is creating a matrix with
a row for each document and a column for each feature. This can get large
quite quickly. By choosing fewer features you can make this matrix much
smaller.

Its fairly easy to make the train function work on a random sample of the
training set on each iteration rather then the entire training set, but
currently this is not how its implemented. Feel free to create a ticket
requesting this the sampling approach.

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Apr 5, 2018 at 5:32 PM, Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> I tried to build a large model based on about 1.2 million documents.  One
> of the nodes ran out of memory and killed itself. Is this much data not
> reasonable to use?  The nodes have 16g of heap.  Happy to increase it, but
> not sure if this is possible?
>
> Thank you!
>
> -Joe
>
>
>
> On 4/5/2018 10:24 AM, Joe Obernberger wrote:
>
>> Thank you Shawn - sorry so long to respond, been playing around with this
>> a good bit.  It is an amazing capability.  It looks like it could be
>> related to certain nodes in the cluster not responding quickly enough.  In
>> one case, I got the concurrent.ExecutionException, but it looks like the
>> root cause was a SocketTimeoutException.  I'm using HDFS for the index and
>> it gets hit pretty hard by other processes running, and I'm wondering if
>> that's causing this.
>>
>> java.io.IOException: java.util.concurrent.ExecutionException:
>> java.io.IOException: params expr=update(models,+batchSize%
>> 3D"50",train(MODEL1033_1522883727011,features(MODEL1033_
>> 1522883727011,q%3D"*:*",featureSet%3D"FSet_MODEL1033_
>> 1522883727011",field%3D"Text",outcome%3D"out_i",
>> positiveLabel%3D1,numTerms%3D1000),q%3D"*:*",name%3D"MODEL10
>> 33",field%3D"Text",outcome%3D"out_i",maxIterations%3D"1000")
>> )&qt=/stream&explain=true&q=*:*&fl=id&sort=id+asc&distrib=false
>> at org.apache.solr.client.solrj.io.stream.CloudSolrStream.openS
>> treams(CloudSolrStream.java:405)
>> at org.apache.solr.client.solrj.io.stream.CloudSolrStream.open(
>> CloudSolrStream.java:275)
>> at com.ngc.bigdata.ie_solrmodelbuilder.SolrModelBuilderProcesso
>> r.doWork(SolrModelBuilderProcessor.java:114)
>> at com.ngc.intelenterprise.intelentutil.utils.Processor.run(
>> Processor.java:140)
>> at com.ngc.intelenterprise.intelentutil.jms.IntelEntQueueProc.
>> process(IntelEntQueueProc.java:208)
>> at org.apache.camel.processor.DelegateSyncProcessor.process(Del
>> egateSyncProcessor.java:63)
>> at org.apache.camel.management.InstrumentationProcessor.process
>> (InstrumentationProcessor.java:77)
>> at org.apache.camel.processor.RedeliveryErrorHandler.process(Re
>> deliveryErrorHandler.java:460)
>> at org.apache.camel.processor.CamelInternalProcessor.process(Ca
>> melInternalProcessor.java:190)
>> at org.apache.camel.processor.CamelInternalProcessor.process(Ca
>> melInternalProcessor.java:190)
>> at org.apache.camel.component.direct.DirectProducer.process(Dir
>> ectProducer.java:62)
>> at org.apache.camel.processor.SendProcessor.process(SendProcess
>> or.java:141)
>> at org.apache.camel.management.InstrumentationProcessor.process
>> (InstrumentationProcessor.java:77)
>> at org.apache.camel.processor.RedeliveryErrorHandler.process(Re
>> deliveryErrorHandler.java:460)
>> at org.apache.camel.processor.CamelInternalProcessor.process(Ca
>> melInternalProcessor.java:190)
>> at org.apache.camel.processor.CamelInternalProcessor.process(Ca
>> melInternalProcessor.java:190)
>> at org.apache.camel.component.jms.EndpointMessageListener.onMes
>> sage(EndpointMessageListener.java:114)
>> at org.springframework.jms.listener.AbstractMessageListenerCont
>> ainer.doInvokeListener(AbstractMessageListenerContainer.java:699)
>> at org.springframework.jms.listener.AbstractMessageListenerCont
>> ainer.invokeListener(AbstractMessageListenerContainer.java:637)
>> at org.springframework.jms.listener.AbstractMessageListenerCont
>> ainer.doExecuteListener(AbstractMessageListenerContainer.java:605)
>> at org.springframework.jms.listener.AbstractPollingMessageListe
>> nerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.
>> java:308)
>> at org.springframework.jms.listener.AbstractPollingMessageListe
>> nerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.
>> java:246)
>> at org.springframework.jms.listener.DefaultMessageListenerConta
>> iner$AsyncMessageListenerInvoker.invokeListener(DefaultMessageLis
>> tenerContainer.java:1144)
>> at org.springframework.jms.listener.DefaultMessageListenerConta
>> iner$AsyncMessageListenerInvoker.executeOngoingLoop(DefaultMessag
>> eListenerContainer.java:1136)
>> at org.springframew

Re: Solr 7.1.0 - concurrent.ExecutionException building model

2018-04-05 Thread Joe Obernberger
Thank you Joel.  I gave each node in the cluster 24g of heap and it ran, 
but then failed on the 50th iteration (was trying to do 1,000).


This time, I have the error on the node and the exception from the 
client running the stream command.  The node (Doris) has 3 errors that 
occurred at the same time logged:


java.io.IOException: java.io.IOException: 
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
    at 
org.apache.solr.client.solrj.io.stream.TextLogitStream.read(TextLogitStream.java:498)
    at 
org.apache.solr.client.solrj.io.stream.PushBackStream.read(PushBackStream.java:87)
    at 
org.apache.solr.client.solrj.io.stream.UpdateStream.read(UpdateStream.java:109)
    at 
org.apache.solr.client.solrj.io.stream.ExceptionStream.read(ExceptionStream.java:68)
    at 
org.apache.solr.handler.StreamHandler$TimerStream.read(StreamHandler.java:627)
    at 
org.apache.solr.client.solrj.io.stream.TupleStream.lambda$writeMap$0(TupleStream.java:87)
    at 
org.apache.solr.response.JSONWriter.writeIterator(JSONResponseWriter.java:523)
    at 
org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:180)
    at 
org.apache.solr.response.JSONWriter$2.put(JSONResponseWriter.java:559)
    at 
org.apache.solr.client.solrj.io.stream.TupleStream.writeMap(TupleStream.java:84)
    at 
org.apache.solr.response.JSONWriter.writeMap(JSONResponseWriter.java:547)
    at 
org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:198)
    at 
org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:209)
    at 
org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:325)
    at 
org.apache.solr.response.JSONWriter.writeResponse(JSONResponseWriter.java:120)
    at 
org.apache.solr.response.JSONResponseWriter.write(JSONResponseWriter.java:71)
    at 
org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:65)
    at 
org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:806)

    at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:535)
    at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:382)
    at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:326)
    at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1751)
    at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
    at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
    at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
    at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
    at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
    at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
    at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
    at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
    at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
    at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
    at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
    at 
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
    at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)

    at org.eclipse.jetty.server.Server.handle(Server.java:534)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
    at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
    at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)

    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
    at 
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
    at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
    at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
    at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
    at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
    at 
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)

    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: java.lang.IndexOutOfBoundsException: 
Index: 0, Size: 0
    at 
org.apache.solr.client.solrj.io.stream.TextLogitStream.getShardUrls(TextLogitStream.java:365)
    at 
org.apache.solr.client.solrj.io.stream.TextLogitStream.read(TextLogitStream.java:457)

    ... 47 more
Caused by: java.la

Data import batch mode for delta

2018-04-05 Thread gadelkareem
Why the deltaImportQuery uses "where id='${dataimporter.id}'" instead of
something like where id IN ('${dataimporter.id})'



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Data import batch mode for delta

2018-04-05 Thread Shawn Heisey

On 4/5/2018 7:31 PM, gadelkareem wrote:

Why the deltaImportQuery uses "where id='${dataimporter.id}'" instead of
something like where id IN ('${dataimporter.id})'


Because there's only one value for that property.

If the deltaQuery returns a million rows, then deltaImportQuery is going 
to be executed a million times.  Once for each row returned by the 
deltaQuery.


That IS as inefficient as it sounds.  Think of the dataimport handler as 
a stop-gap solution -- to help you get started with loading data from a 
database, until you can write a proper application to do your indexing.


Thanks,
Shawn