Re: inconsistent result count when doing paging

2017-02-09 Thread cmti95035
Thanks Shawn! I will double check to make sure the uniqueKey are really unique across all shards. -- View this message in context: http://lucene.472066.n3.nabble.com/inconsistent-result-count-when-doing-paging-tp4319427p4319633.html Sent from the Solr - User mailing list archive at Nabble.com.

Problem with cyrillics letters through Tika OCR indexing

2017-02-09 Thread Абрашин , Игорь Олегович
Hello, everyone I'm encountered the error mentioned at the title? The original image attached and recognized text below: 3ApaBCTyI7ITe 9| )KVIBy xopomo Does anyone faced the similar? Need to mentioned that tesseract recognize it more correctly with -l rus option. Thanks in advance! С уважением,

how to get modified field data if it doesn't exist in meta

2017-02-09 Thread Gytis Mikuciunas
Hi, We have started to use solr for our documents indexing (vsd, vsdx, xls,xlsx, doc, docx, pdf, txt). Modified date values is needed for each file. MS Office's files, pdfs have this value. Problem is with txt files as they don't have this value in their meta. Is there any possibility to get it

Problem with collection operations in 6.4.1?

2017-02-09 Thread Walter Underwood
After three hours, I’m still getting this from an async collection delete request. { responseHeader: { status: 0, QTime: 12 }, status: { state: "submitted", msg: "found [wunder0] in submitted tasks" } } 16 node cluster, 4 shards, 4 replicas, 14.7 million documents. Also, shutting down a node t

Migrate Documents to Another Collection

2017-02-09 Thread alias
hello please help me look this question ,Solr6.3 is the issue of the index migration (/admin/collections?Action=MIGRATE), I really do not know how to solve the hope that someone will help answer, very grateful http://lucene.472066.n3.nabble.com/Migrate-Documents-to-Another-Collection

Re: Removing duplicate terms from query

2017-02-09 Thread Erick Erickson
This is a common misunderstanding of RemoveDuplicatesTokenFilter. It removes tokens _introduced_ by certain other filters, not duplicates that were part of the original. This is the relevant part of the docs: "if they have the same text and position values". An input of "hey hey hey" has a differen

Re: Huh? What does this even mean? Not enough time left to update replicas. However, the schema is updated already.

2017-02-09 Thread Erick Erickson
Well, managed schema in SolrCloud is a bit heavy-weight. When you change the schema, two things need to happen: 1> the change has to be pushed to ZooKeeper 2> the replicas in the collection need to be reloaded to make the changes available to all replicas for the _next_ doc that comes in.

Re: Huh? What does this even mean? Not enough time left to update replicas. However, the schema is updated already.

2017-02-09 Thread Shawn Heisey
On 2/9/2017 10:29 AM, Michael Joyner wrote: > > Huh? What does this even mean? If the schema is updated already how > can we be out of time to update it? > > Not enough time left to update replicas. However, the schema is > updated already. The code where the waitForOtherReplicasToUpdate method (t

RE: Removing duplicate terms from query

2017-02-09 Thread Markus Jelsma
Yeah, what does that do anyway, omit both, but not one in particular, and where was omitTermFreq all this time, does it make sense? Not to me at least, so i never tried it and just overridden the similarity in place. M. -Original message- > From:Alexandre Rafalovitch > Sent: Thursda

Re: procedure to restart solrcloud, and config/collection consistency

2017-02-09 Thread xavier jmlucjav
hi Shawn, as I replied to Markus, of course I know (and use) the collections api to reload the config. I am asking what would happen in that scenario: - config updated (but collection not reloaded) - i restart one node now one node has the new config and the rest the old one?? To which he alrea

Re: values for fairnessPolicy?

2017-02-09 Thread Walter Underwood
The code needs a boolean. In HttpShardHandlerFactory.java: BlockingQueue blockingQueue = (this.queueSize == -1) ? new SynchronousQueue(this.accessPolicy) : new ArrayBlockingQueue(this.queueSize, this.accessPolicy); Also, what is a “reasonable size of queue” for sizeOfQueue?

values for fairnessPolicy?

2017-02-09 Thread Walter Underwood
The default is “false”. I tried “true” and it fails because it can’t parse that as an int. The docs need to describe legal values for this. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog)

Re: creating collection using collection API with SSL enabled SolrCloud

2017-02-09 Thread Bryan Bende
You should be able to start your Solr instances with "-h ". On Thu, Feb 9, 2017 at 12:09 PM, Xie, Sean wrote: > Thank you Hrishikesh, > > The cluster property solved the issue. > > Now we need to figure out a way to give the instance a host name to solve the > SSL error that IP not matching the

Huh? What does this even mean? Not enough time left to update replicas. However, the schema is updated already.

2017-02-09 Thread Michael Joyner
Huh? What does this even mean? If the schema is updated already how can we be out of time to update it? Not enough time left to update replicas. However, the schema is updated already.

Re: creating collection using collection API with SSL enabled SolrCloud

2017-02-09 Thread Xie, Sean
Thank you Hrishikesh, The cluster property solved the issue. Now we need to figure out a way to give the instance a host name to solve the SSL error that IP not matching the SSL name. Sean On 2/9/17, 11:35 AM, "Hrishikesh Gadre" wrote: Hi Sean, Have you configured the "urlSche

Re: Removing duplicate terms from query

2017-02-09 Thread Alexandre Rafalovitch
Would omitTermFreqAndPositions help here? Though that's probably an overkill as that disables phrase searches too. I am not sure if it is possible to do omitTermFreqAndPositions=true omitPositions=false to just skip frequencies. Regards, Alex. http://www.solr-start.com/ - Resources for Sol

Re: difference in json update handler update/json and update/json/docs

2017-02-09 Thread Florian Meier
this was the right lead, thanks Alex > Am 08.02.2017 um 22:20 schrieb Alexandre Rafalovitch : > > /update/json expects Solr JSON update format. > /update is an auto-route that should be equivalent to /update/json > with the right content type/extension. > > /update/json/docs expects random JSON

Re: Removing duplicate terms from query

2017-02-09 Thread Walter Underwood
1. I don’t think this is a good idea. It means that a search for “hey hey hey” won’t score that document higher. 2. Maybe you want to change how tf is calculated. Ignore multiple occurrences of a word. I ran into this with the movie title “New York, New York” at Netflix. It isn’t twice as much

Re: creating collection using collection API with SSL enabled SolrCloud

2017-02-09 Thread Hrishikesh Gadre
Hi Sean, Have you configured the "urlScheme" cluster property (i.e. urlScheme=https) ? https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-CLUSTERPROP:ClusterProperties Thanks Hrishikesh On Thu, Feb 9, 2017 at 8:23 AM, Xie, Sean wrote: > Hi All, > > When trying to

creating collection using collection API with SSL enabled SolrCloud

2017-02-09 Thread Xie, Sean
Hi All, When trying to create the collection using the API when there are a few replicas, I’m getting error because the call seems to trying to use HTTP for the replicas. https://IP_1:8983/solr/admin/collections?action=CREATE&name=My_COLLECTION&numShards=1&replicationFactor=1&collection.configN

Re: alerting system with Solr's Streaming Expressions

2017-02-09 Thread Susheel Kumar
got it, Thanks, Joel. On Thu, Feb 9, 2017 at 11:17 AM, Susheel Kumar wrote: > I increased from 250 to 2500 and 100 to 1000 when did't get expected > result. Let me put more examples. > > Thanks, > Susheel > > On Thu, Feb 9, 2017 at 11:03 AM, Joel Bernstein > wrote: > >> A few things that I see

Re: alerting system with Solr's Streaming Expressions

2017-02-09 Thread Susheel Kumar
I increased from 250 to 2500 and 100 to 1000 when did't get expected result. Let me put more examples. Thanks, Susheel On Thu, Feb 9, 2017 at 11:03 AM, Joel Bernstein wrote: > A few things that I see right off: > > 1) 2500 terms is too many. I was testing with 100-250 terms > 2) 1000 iteration

Re: alerting system with Solr's Streaming Expressions

2017-02-09 Thread Joel Bernstein
Also you can see in the final iteration of the model that there are 8 true positives and 8 false positives. So this model classifies everything as positive. At that you know that it's not a good model. Joel Bernstein http://joelsolr.blogspot.com/ On Thu, Feb 9, 2017 at 11:03 AM, Joel Bernstein w

Re: alerting system with Solr's Streaming Expressions

2017-02-09 Thread Joel Bernstein
A few things that I see right off: 1) 2500 terms is too many. I was testing with 100-250 terms 2) 1000 iterations is to high. If the model hasn't converged by 100 iterations it's likely not going to converge. 3) You're going to need more examples. You may want to run features first and see what it

Re: Solr Heap Dump: Any suggestions on what to look for?

2017-02-09 Thread Shawn Heisey
On 2/9/2017 6:19 AM, Kelly, Frank wrote: > Got a heap dump on an Out of Memory error. > Analyzing the dump now in Visual VM > > Seeing a lot of byte[] arrays (77% of our 8GB Heap) in > > * TreeMap$Entry > * FieldCacheImpl$SortedDocValues > > We’re considering switch over to DocValues but would

RE: DistributedUpdateProcessorFactory was explicitly disabled from this updateRequestProcessorChain

2017-02-09 Thread Pratik Thaker
Hi Friends, Can you please try to give me some details about below issue ? Regards, Pratik Thaker From: Pratik Thaker Sent: 07 February 2017 17:12 To: 'solr-user@lucene.apache.org' Subject: DistributedUpdateProcessorFactory was explicitly disabled from this updateRequestProcessorChain Hi All,

Re: procedure to restart solrcloud, and config/collection consistency

2017-02-09 Thread Shawn Heisey
On 2/9/2017 5:24 AM, xavier jmlucjav wrote: > I always wondered, if this was not really needed, and I could just call > 'restart' in every node, in a quick loop, and forget about it. Does anyone > know if this is the case? > > My doubt is in regards to changing some config, and then doing the above

Re: Could not find configName for collection

2017-02-09 Thread Shawn Heisey
On 2/9/2017 4:03 AM, Sedat Kestepe wrote: > When I try to create a collection through Solr or create an index through > Hue using a csv file, I get the below error: > > { "message": > "{\"responseHeader\":{\"status\":400,\"QTime\":16025},\"error\":{\"metadata\":[\"error-class\",\"org.apache.solr.c

Re: inconsistent result count when doing paging

2017-02-09 Thread Shawn Heisey
On 2/8/2017 9:35 PM, cmti95035 wrote: > I noticed in our production environment that the returned result count is > inconsistent when doing paging. > > For example, for a certain query, for the first page (start = 0, rows = 30), > the corresponding "numFound" is 3402; and then it returned 3378, 336

Re: alerting system with Solr's Streaming Expressions

2017-02-09 Thread Susheel Kumar
Hello Joel, Here is the final iteration in json format. https://www.dropbox.com/s/g3a3606ms6cu8q4/final_iteration.json?dl=0 Below is the expression used update(models, batchSize="50", train(trainingSet, features(trainingSet,

Re: Solr partial update

2017-02-09 Thread Mike Thomsen
Set the fl parameter equal to the fields you want and then query for id:(SOME_ID OR SOME_ID OR SOME_ID) On Thu, Feb 9, 2017 at 5:37 AM, Midas A wrote: > Hi, > > i want solr doc partially if unique id exist else we donot want to do any > thing . > > how can i achieve this . > > Regards, > Midas >

RE: Solr Heap Dump: Any suggestions on what to look for?

2017-02-09 Thread Markus Jelsma
-Original message- > From:Kelly, Frank > Sent: Thursday 9th February 2017 15:42 > To: solr-user@lucene.apache.org > Subject: Re: Solr Heap Dump: Any suggestions on what to look for? > > Thanks for the fast reply. > > I think we¹re going to focus on using doc values. > > You also said

Re: Solr Heap Dump: Any suggestions on what to look for?

2017-02-09 Thread Kelly, Frank
Thanks for the fast reply. I think we¹re going to focus on using doc values. You also said "facet on fewer fields² - how does one do that? Thanks! -Frank Frank Kelly Principal Software Engineer HERE 5 Wayside Rd, Burlington, MA 01803, USA 42° 29' 7" N 71° 11' 32" W

RE: DataImportHandler - Unable to load Tika Config Processing Document # 1

2017-02-09 Thread Anatharaman, Srinatha (Contractor)
Shawn, Thanks again for your input As I said in my last email I was successfully completed this in Solr standalone My requirement is, to index a emails which is already converted to a text file(There are no attachments), Once these text files are indexed Solr search result should bring me back

RE: Solr Heap Dump: Any suggestions on what to look for?

2017-02-09 Thread Markus Jelsma
Hello - FieldCache is your problem. This can be solved in many ways but only one really beneficial: decrease number of documents, increase heap, facet on fewer fields, don't do function query on many fields. Or, of course, reindex with doc values. And you get a bonus, you can also drastically re

Solr Heap Dump: Any suggestions on what to look for?

2017-02-09 Thread Kelly, Frank
Got a heap dump on an Out of Memory error. Analyzing the dump now in Visual VM Seeing a lot of byte[] arrays (77% of our 8GB Heap) in * TreeMap$Entry * FieldCacheImpl$SortedDocValues We’re considering switch over to DocValues but would rather be definitive about the root cause before we

Re: Removing duplicate terms from query

2017-02-09 Thread Ere Maijala
Thanks Emir. I was thinking of something very simple like doing what RemoveDuplicatesTokenFilter does but ignoring positions. It would of course still be possible to have the same term multiple times, but at least the adjacent ones could be deduplicated. The reason I'm not too eager to do it

RE: Removing duplicate terms from query

2017-02-09 Thread Markus Jelsma
How about a pattern replace char filter that checks for repeating groups? I'd probably not the fastest option but should work right away. -Original message- > From:Emir Arnautovic > Sent: Thursday 9th February 2017 13:52 > To: solr-user@lucene.apache.org > Subject: Re: Removing duplica

RE: procedure to restart solrcloud, and config/collection consistency

2017-02-09 Thread Markus Jelsma
Hello - see inline. -Original message- > From:xavier jmlucjav > Sent: Thursday 9th February 2017 13:46 > To: solr-user > Subject: Re: procedure to restart solrcloud, and config/collection consistency > > Hi Markus, > > yes, of course I know (and use) the collections api to reload the

Re: Removing duplicate terms from query

2017-02-09 Thread Emir Arnautovic
Hi Ere, I don't think that there is such filter. Implementing such filter would require looking backward which violates streaming approach of token filters and unpredictable memory usage. I would do it as part of query preprocessor and not necessarily as part of Solr. HTH, Emir On 09.02.

Re: procedure to restart solrcloud, and config/collection consistency

2017-02-09 Thread xavier jmlucjav
Hi Markus, yes, of course I know (and use) the collections api to reload the config. I am asking what would happen in that scenario: - config updated (but collection not reloaded) - i restart one node now one node has the new config and the rest the old one?? Regarding restarting many hosts, my

RE: procedure to restart solrcloud, and config/collection consistency

2017-02-09 Thread Markus Jelsma
Hello - if you just want to use updated configuration, you can use Solr's collection reload API call. For restarting we rely on remote provisioning tools such as Salt, other managing tools can probably execute commands remotely as well. If you operate more than just a very few machines, i'd rea

procedure to restart solrcloud, and config/collection consistency

2017-02-09 Thread xavier jmlucjav
Hi, When I need to restart a Solrcloud cluster, I always do this: - log in into host nb1, stop solr - log in into host nb2, stop solr -... - log in into host nbX, stop solr - verify all hosts did stop - in host nb1, start solr - in host nb12, start solr -... I always wondered, if this was not rea

Simulating group.facet for JSON facets, high mem usage w/ sorting on aggregation...

2017-02-09 Thread Bryant, Michael
Hi all, I'm converting my legacy facets to JSON facets and am seeing much better performance, especially with high cardinality facet fields. However, the one issue I can't seem to resolve is excessive memory usage (and OOM errors) when trying to simulate the effect of "group.facet" to sort face

Removing duplicate terms from query

2017-02-09 Thread Ere Maijala
Hi, I just noticed that while we use RemoveDuplicatesTokenFilter during query time, it will consider term positions and not really do anything e.g. if query is 'term term term'. As far as I can see the term positions make no difference in a simple non-phrase search. Is there a built-in way to

Could not find configName for collection

2017-02-09 Thread Sedat Kestepe
Hi, I am having a problem with my Solr on Ambari + HDP Stack. When I try to create a collection through Solr or create an index through Hue using a csv file, I get the below error: { "message": "{\"responseHeader\":{\"status\":400,\"QTime\":16025},\"error\":{\"metadata\":[\"error-class\",\"org.

Solr partial update

2017-02-09 Thread Midas A
Hi, i want solr doc partially if unique id exist else we donot want to do any thing . how can i achieve this . Regards, Midas

inconsistent result count when doing paging

2017-02-09 Thread cmti95035
Hi, I noticed in our production environment that the returned result count is inconsistent when doing paging. For example, for a certain query, for the first page (start = 0, rows = 30), the corresponding "numFound" is 3402; and then it returned 3378, 3361 for the 2nd and 3rd page, respectively (