Re: Is there a way to retrieve the a term's position/offset in Solr

2017-03-29 Thread Bjarke Buur Mortensen
OK, so the next thing to do would be to index and store the rich text ... is it HTML? Because then you can use HTMLStripCharFilterFactory in your analyzer, and still get the correct highlight back with hl.fragsize=0. I would think that you will have a hard time using the term positions, if what yo

Re: format data at source or format data during indexing?

2017-03-29 Thread Derek Poh
Hi Alex Thank you for pointing out theUpdateRequestProcessor option. On 3/30/2017 11:43 AM, Alexandre Rafalovitch wrote: I am not sure I can tell how to decide on one or another. However, I wanted to mention that you also have an option of doing in in the UpdateRequestProcessor chain. That's st

Re: format data at source or format data during indexing?

2017-03-29 Thread Derek Poh
Hi Erick So I could also not use the query analyzer stage to append the code to the search keyword? Have the front-end application append the code for every query it issue instead? On 3/30/2017 12:20 PM, Erick Erickson wrote: I generally prefer index-time work to query-time work on the theo

Re: format data at source or format data during indexing?

2017-03-29 Thread Erick Erickson
I generally prefer index-time work to query-time work on the theory that the index-time work is done once and the query time work is done for each query. That said, for a corpus this size (and presumably without a large query rate) I doubt you'd be able to measure any difference. So basically cho

Re: Indexing speed reduced significantly with OCR

2017-03-29 Thread Zheng Lin Edwin Yeo
Thanks for your reply. >From what I see, getting more hardware to do the OCR is inevitable? Even if we run the OCR outside of Solr indexing stream, it will still take a long time to process it if it is on just one machine. And we still need to wait for the OCR to finish converting before we can r

Re: format data at source or format data during indexing?

2017-03-29 Thread Alexandre Rafalovitch
I am not sure I can tell how to decide on one or another. However, I wanted to mention that you also have an option of doing in in the UpdateRequestProcessor chain. That's still within Solr (and therefore is consistent with multiple clients feeding into Solr) but is before individual field processi

format data at source or format data during indexing?

2017-03-29 Thread Derek Poh
Hi Ineed to create afield that will be prefix and suffix with code 'z01x'.This field needs to have the code in the index and during query. I can either 1. have the source data of the field formatted with the code before indexing (outside solr). use a charFilter in the query stage of the field

Re: Avoiding Transient state during a long running background process

2017-03-29 Thread Erick Erickson
It's an LRU cache time. See the docs for LinkedHashmap, this form of the c'tor is used in SolrCores.allocateLazyCores transientCores = new LinkedHashMap(Math.min(cacheSize, 1000), 0.75f, true) { which is a special form of the c'tor that creates an access-ordered map. I had a terrible moment seein

Re: Avoiding Transient state during a long running background process

2017-03-29 Thread Shashank Pedamallu
Thanks again for the information Shawn. 1) The long running process I told earlier was about Backup. I have written a custom BackupHandler to backup the index files to a Cloud storage following the ReplicationHandler class. I’m just wondering how does switching between transient state affect s

Re: Avoiding Transient state during a long running background process

2017-03-29 Thread Shawn Heisey
On 3/29/2017 4:50 PM, Shashank Pedamallu wrote: > Thank you very much for the response. Is there no definite way of > ensuring that Solr does not switch transient states by an api? Like > solrCore.open() and solrCore.close()? I am not aware of any way to tell Solr to NOT unload a core when all of

Re: Avoiding Transient state during a long running background process

2017-03-29 Thread Shashank Pedamallu
Hi Shawn, Thank you very much for the response. Is there no definite way of ensuring that Solr does not switch transient states by an api? Like solrCore.open() and solrCore.close()? Thanks, Shashank Pedamallu MTS MBU vCOps Dev US-CA-Promontory E, E 1035 Email: spedama...@vmware.com Office: 650.

Re: Pagination bug? when sorting by a field (not unique field)

2017-03-29 Thread Erick Erickson
You might be helped by "distributed IDF". see: SOLR-1632 On Wed, Mar 29, 2017 at 1:56 PM, Chris Hostetter wrote: > > The thing to keep in mind, is that w/o a fully deterministic sort, > the underlying problem statement "doc may appera on multiple pages" can > exist even in a single node solr inde

Re: in-place updates

2017-03-29 Thread Ishan Chattopadhyaya
> For in-place updates, the documentation states that only the fields being > modified are updated, but does that mean that all other fields don't need > to be stored? Correct, in general there's no need to store the other fields. However, there's a niche case where if a simultaneous DeleteByQuery

Re: Pagination bug? when sorting by a field (not unique field)

2017-03-29 Thread Chris Hostetter
The thing to keep in mind, is that w/o a fully deterministic sort, the underlying problem statement "doc may appera on multiple pages" can exist even in a single node solr index, even if no documents are added/deleted between bage requests: because background merges / searcher re-opening may h

Re: Pagination bug? when sorting by a field (not unique field)

2017-03-29 Thread Mikhail Khludnev
Great explanation, Alessandro! Let me briefly explain my experience. I have a tiny test with 2 shards and 2 replicas, index about a hundred of docs. And then when I fully paginate search results with score ranking, I've got duplicates across pages. And the reason is deletes, which occur probably d

Re: Duplicate documents

2017-03-29 Thread Wenjie Zhang
Hi Alex, Thanks for your time and please see response inline. Thanks, Jack On Wed, Mar 29, 2017 at 11:36 AM, Alexandre Rafalovitch wrote: > There are too many things here. As far as I understand: > *) You should not need to use Signature chain [Jack] I have the same > feeling as well, but I j

in-place updates

2017-03-29 Thread Elaine Cario
I need some clarity on atomic vs in-place updates. For atomic I understand that all fields need to be stored, either explicitly or through docValues, since the entire document is re-indexed. For in-place updates, the documentation states that only the fields being modified are updated, but does t

Re: Duplicate documents

2017-03-29 Thread Alexandre Rafalovitch
There are too many things here. As far as I understand: *) You should not need to use Signature chain *) You should have a uniqueID assigned to the child record *) You should not assign parentID to the child record, it will be assigned automatically *) Double check that your unique_key field type i

Re: Duplicate documents

2017-03-29 Thread Wenjie Zhang
BTW, we only have one node and the collection has just one shard. On Wed, Mar 29, 2017 at 10:52 AM, Wenjie Zhang wrote: > Hi there, > > We are in solr 6.0.1, here is our solr schema and config: > > _unique_key > > > > solr.StrField > 32766 > > > > > > [^

Re: Avoiding Transient state during a long running background process

2017-03-29 Thread Shawn Heisey
On 3/29/2017 11:17 AM, Shashank Pedamallu wrote: > I’m performing some long running operation on a Background thread on a > Core and I observed that since the core has the property “transient” > set to true, in between this operation completes, the core is being > CLOSED and OPENED by Solr (even th

Duplicate documents

2017-03-29 Thread Wenjie Zhang
Hi there, We are in solr 6.0.1, here is our solr schema and config: _unique_key solr.StrField 32766 [^\w-\.] _ When having above configuration, and doing following operations, we will see duplicate documents (two documents have same _

Avoiding Transient state during a long running background process

2017-03-29 Thread Shashank Pedamallu
Hi, I’m performing some long running operation on a Background thread on a Core and I observed that since the core has the property “transient” set to true, in between this operation completes, the core is being CLOSED and OPENED by Solr (even though the operation continues without interruption

Managing between or avoiding Transient state for specific period of time in Solr

2017-03-29 Thread Shashank Pedamallu
Hi, I’m performing some long running operation on a Background thread on a Core and I observed that since the core has the property “transient” set to true, in between this operation completes, the core is being CLOSED and OPENED by Solr (even though the operation continues without interruption

Re: Pagination bug? when sorting by a field (not unique field)

2017-03-29 Thread alessandro.benedetti
The reason Mikhail mentioned that, is probably related to : *The way how number of document calculated is changed (LUCENE-6711)* /The number of documents (docCount) is used to calculate term specificity (idf) and average document length (avdl). Prior to LUCENE-6711, collectionStats.maxDoc() was us

RE: Solr | cluster | behaviour

2017-03-29 Thread Prateek Jain J
Thanks Shawn. Regards, Prateek Jain -Original Message- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: 29 March 2017 01:27 PM To: solr-user@lucene.apache.org Subject: Re: Solr | cluster | behaviour On 3/29/2017 3:21 AM, Prateek Jain J wrote: > We are having solr deployment in ac

Re: The downsides of not splitting on whitespace in edismax (the old albino elephant prob)

2017-03-29 Thread Doug Turnbull
Good info-- Yes thanks for you correction! And thanks for the very welcome change to edismax! Interesting point on findin consistent offsets in the text. That would be an interesting approach Doug On Wed, Mar 29, 2017 at 11:36 AM Steve Rowe wrote: > Thanks Doug, excellent analysis! > > In im

Re: The downsides of not splitting on whitespace in edismax (the old albino elephant prob)

2017-03-29 Thread Steve Rowe
Thanks Doug, excellent analysis! In implementing the SOLR-9185 changtes, I considered a compromise approach to the term-centric / field-centric axis you describe in the case of differing field analysis pipelines: finding common source-text-offset bounded slices in all per-field queries, and the

Re: Fieldtype json supported in SOLR 5.4.0 or 5.4.1

2017-03-29 Thread Abhijit Pawar
Thanks Rick. Does that mean I need to define managed-schema.xml, I thought it gets created by default on installing but only on later versions of SOLR ( 6.0 or later). Will managed-schema help in indexing the JSON type fields in the mongoDB ? How do I define the managed-schema in SOLR 5.4.0 ?

Re: The downsides of not splitting on whitespace in edismax (the old albino elephant prob)

2017-03-29 Thread Doug Turnbull
What triggered me to send this was seeing this > When per-field query structures differ, e.g. when one field's analyzer removes stopwords and another's doesn't, edismax's DisjunctionMaxQuery structure when sow=false differs from that produced when sow=true. Briefly, sow=true produces a boolean que

The downsides of not splitting on whitespace in edismax (the old albino elephant prob)

2017-03-29 Thread Doug Turnbull
So with regards to this JIRA ( https://issues.apache.org/jira/browse/SOLR-9185) Which makes Solr splitting on whitespace optional. I want to point out that there's not a simple fix to multi-term synonyms in part because of specific tradeoffs. Splitting on whitespace is *someimes a good thing*. Not

Re: Pagination bug? when sorting by a field (not unique field)

2017-03-29 Thread Erick Erickson
I can answer at least one bit... If all the sort fields are equal, the _internal_ Lucene document ID (not ) is used to break the tie.The kicker is that the internal Lucene ID can change when merging segments. Further, the internal ID for two given docs can change relative to each other. I.e. star

Re: Pagination bug? when sorting by a field (not unique field)

2017-03-29 Thread Pablo Anzorena
Mikhall, effectively maxDocs are different and also deletedDocs, but numDocs are ok. I don't really get it, but can that be the problem? 2017-03-29 10:35 GMT-03:00 Mikhail Khludnev : > Can it happen that replicas are different by deleted docs? I mean numDocs > is the same, but maxDocs is differ

Re: Pagination bug? when sorting by a field (not unique field)

2017-03-29 Thread Mikhail Khludnev
Can it happen that replicas are different by deleted docs? I mean numDocs is the same, but maxDocs is different by number of deleted docs, you can see it in solr admin at the core page. On Wed, Mar 29, 2017 at 4:16 PM, Pablo Anzorena wrote: > Shawn, > > Yes, the field has duplicate values and ye

Re: Pagination bug? when sorting by a field (not unique field)

2017-03-29 Thread Pablo Anzorena
Shawn, Yes, the field has duplicate values and yes, if I add the secondary sort by the uniqueKey it solve the issue. Those 2 situations you mentioned are not occurring, none of them. The index is replicated, but not sharded. Does solr sort by an internal id if no uniqueKey is present in the sort

Re: Pagination bug? when sorting by a field (not unique field)

2017-03-29 Thread Shawn Heisey
On 3/29/2017 6:35 AM, Pablo Anzorena wrote: > I was paginating the results of a query and noticed that some > documents were repeated across pagination buckets of 100 rows. When I > sort by the unique field there is no repeated document but when I sort > by another field then repeated documents app

Re: Pagination bug? when sorting by a field (not unique field)

2017-03-29 Thread Pablo Anzorena
Let me try. It is really hard to replicate it, but I will try out and come back when i got it. 2017-03-29 9:40 GMT-03:00 Erik Hatcher : > Certainly not intended behavior. Can you show us a way to replicate the > issue? > > > > On Mar 29, 2017, at 8:35 AM, Pablo Anzorena > wrote: > > > > Hey, >

Re: Pagination bug? when sorting by a field (not unique field)

2017-03-29 Thread Erik Hatcher
Certainly not intended behavior. Can you show us a way to replicate the issue? > On Mar 29, 2017, at 8:35 AM, Pablo Anzorena wrote: > > Hey, > > I was paginating the results of a query and noticed that some documents > were repeated across pagination buckets of 100 rows. > When I sort by the

Pagination bug? when sorting by a field (not unique field)

2017-03-29 Thread Pablo Anzorena
Hey, I was paginating the results of a query and noticed that some documents were repeated across pagination buckets of 100 rows. When I sort by the unique field there is no repeated document but when I sort by another field then repeated documents appear. I assume is a bug and it's not the intend

Re: Solr | cluster | behaviour

2017-03-29 Thread Shawn Heisey
On 3/29/2017 3:21 AM, Prateek Jain J wrote: > We are having solr deployment in active passive mode. So, ideally only > one instance should be up and running at a time. That's true we only > see one instance serving requests but we do see some activity in CPU > for the standby solr instance. These i

Re: why leader replica does not call HdfsTransactionLog.finish

2017-03-29 Thread Yang.Liu
Thanks, Erick I am not sure about hdfs transcation logs, it's intricate. -- View this message in context: http://lucene.472066.n3.nabble.com/why-leader-replica-does-not-call-HdfsTransactionLog-finish-tp4327139p4327399.html Sent from the Solr - User mailing list archive at Nabble.com.

RE: Solr | cluster | behaviour

2017-03-29 Thread Prateek Jain J
Just to update, we are using solr 4.8.1 Regards, Prateek Jain -Original Message- From: Prateek Jain J Sent: 29 March 2017 10:22 AM To: solr-user@lucene.apache.org Subject: Solr | cluster | behaviour Hi All, We are having solr deployment in active passive mode. So, ideally only one

Solr | cluster | behaviour

2017-03-29 Thread Prateek Jain J
Hi All, We are having solr deployment in active passive mode. So, ideally only one instance should be up and running at a time. That's true we only see one instance serving requests but we do see some activity in CPU for the standby solr instance. These instances are writing to shared disk.