date:20170329

Solr | cluster | behaviour

2017-03-29 Thread Prateek Jain J


Hi All,

  We are having solr deployment in active passive mode. So, ideally only one 
instance should be up and running at a time. That's true we only see one 
instance serving requests but we do see some activity in CPU for the standby 
solr instance. These instances are writing to shared disk. My question is, why 
do we see activity is standby solr even if it is not serving any requests. Is 
there any replication of in-memory indexes between these instances? Any 
pointers to this are welcome.

Regards,
Prateek Jain

RE: Solr | cluster | behaviour

2017-03-29 Thread Prateek Jain J


Just to update, we are using solr 4.8.1


Regards,
Prateek Jain

-Original Message-
From: Prateek Jain J 
Sent: 29 March 2017 10:22 AM
To: solr-user@lucene.apache.org
Subject: Solr | cluster | behaviour


Hi All,

  We are having solr deployment in active passive mode. So, ideally only one 
instance should be up and running at a time. That's true we only see one 
instance serving requests but we do see some activity in CPU for the standby 
solr instance. These instances are writing to shared disk. My question is, why 
do we see activity is standby solr even if it is not serving any requests. Is 
there any replication of in-memory indexes between these instances? Any 
pointers to this are welcome.

Regards,
Prateek Jain

Re: why leader replica does not call HdfsTransactionLog.finish

2017-03-29 Thread Yang.Liu

Thanks, Erick
I am not sure about hdfs transcation logs, it's intricate.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/why-leader-replica-does-not-call-HdfsTransactionLog-finish-tp4327139p4327399.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr | cluster | behaviour

2017-03-29 Thread Shawn Heisey

On 3/29/2017 3:21 AM, Prateek Jain J wrote:
> We are having solr deployment in active passive mode. So, ideally only
> one instance should be up and running at a time. That's true we only
> see one instance serving requests but we do see some activity in CPU
> for the standby solr instance. These instances are writing to shared
> disk. My question is, why do we see activity is standby solr even if
> it is not serving any requests. Is there any replication of in-memory
> indexes between these instances? Any pointers to this are welcome. 

Solr is not designed to work in this way.  Sharing an index directory
between two instances is not recommended and may cause serious issues. 
Each instance expects exclusive access to every index it is managing. 
You can disable the locking that normally prevents such sharing, but
that locking exists for a reason and should not be disabled.

Instead, you should use either SolrCloud's inherent index duplication
technology, or the old master-slave replication that existed before
SolrCloud.  Disks are cheap.  Taking risks with your data isn't.

Thanks,
Shawn

Pagination bug? when sorting by a field (not unique field)

2017-03-29 Thread Pablo Anzorena

Hey,

I was paginating the results of a query and noticed that some documents
were repeated across pagination buckets of 100 rows.
When I sort by the unique field there is no repeated document but when I
sort by another field then repeated documents appear.
I assume is a bug and it's not the intended behaviour, right?

Solr version:5.2.1

Regards,
Pablo.

Re: Pagination bug? when sorting by a field (not unique field)

2017-03-29 Thread Erik Hatcher

Certainly not intended behavior.  Can you show us a way to replicate the issue?


> On Mar 29, 2017, at 8:35 AM, Pablo Anzorena  wrote:
> 
> Hey,
> 
> I was paginating the results of a query and noticed that some documents
> were repeated across pagination buckets of 100 rows.
> When I sort by the unique field there is no repeated document but when I
> sort by another field then repeated documents appear.
> I assume is a bug and it's not the intended behaviour, right?
> 
> Solr version:5.2.1
> 
> Regards,
> Pablo.

Re: Pagination bug? when sorting by a field (not unique field)

2017-03-29 Thread Pablo Anzorena

Let me try. It is really hard to replicate it, but I will try out and come
back when i got it.

2017-03-29 9:40 GMT-03:00 Erik Hatcher :

> Certainly not intended behavior.  Can you show us a way to replicate the
> issue?
>
>
> > On Mar 29, 2017, at 8:35 AM, Pablo Anzorena 
> wrote:
> >
> > Hey,
> >
> > I was paginating the results of a query and noticed that some documents
> > were repeated across pagination buckets of 100 rows.
> > When I sort by the unique field there is no repeated document but when I
> > sort by another field then repeated documents appear.
> > I assume is a bug and it's not the intended behaviour, right?
> >
> > Solr version:5.2.1
> >
> > Regards,
> > Pablo.
>
>

Re: Pagination bug? when sorting by a field (not unique field)

2017-03-29 Thread Shawn Heisey

On 3/29/2017 6:35 AM, Pablo Anzorena wrote:
> I was paginating the results of a query and noticed that some
> documents were repeated across pagination buckets of 100 rows. When I
> sort by the unique field there is no repeated document but when I sort
> by another field then repeated documents appear. I assume is a bug and
> it's not the intended behaviour, right? 

There is a potential situation that can cause this problem that is NOT a
bug.

If the field you are sorting on contains duplicate values (same value in
multiple documents), then I am pretty sure that the sort order of
documents with the same value in the sort field is non-deterministic in
these situations:

1) A distributed (sharded) index.
2) When the index contents can change between a request for one page and
a request for the next page -- documents being added, deleted, or changed.

Because the sort order of documents with the same value can change, one
document that may have ended up on the first page on the first query may
end up on the second page on the second query.

Sorting by a field with no duplicate values (the unique field you
mentioned) will always result in the exact same sort order ... but if
you add documents that sort to near the start of the sort order between
queries, the behavior you have noticed can still happen.

If this is what you are encountering, adding secondary sort on the
uniqueKey field would probably clear up the problem.  If your uniqueKey
field is "id", something like this:

sort=someField desc,id desc

Thanks,
Shawn

Re: Pagination bug? when sorting by a field (not unique field)

2017-03-29 Thread Pablo Anzorena

Shawn,

Yes, the field has duplicate values and yes, if I add the secondary sort by
the uniqueKey it solve the issue.

Those 2 situations you mentioned are not occurring, none of them. The index
is replicated, but not sharded.

Does solr sort by an internal id if no uniqueKey is present in the sort?

2017-03-29 9:58 GMT-03:00 Shawn Heisey :

> On 3/29/2017 6:35 AM, Pablo Anzorena wrote:
> > I was paginating the results of a query and noticed that some
> > documents were repeated across pagination buckets of 100 rows. When I
> > sort by the unique field there is no repeated document but when I sort
> > by another field then repeated documents appear. I assume is a bug and
> > it's not the intended behaviour, right?
>
> There is a potential situation that can cause this problem that is NOT a
> bug.
>
> If the field you are sorting on contains duplicate values (same value in
> multiple documents), then I am pretty sure that the sort order of
> documents with the same value in the sort field is non-deterministic in
> these situations:
>
> 1) A distributed (sharded) index.
> 2) When the index contents can change between a request for one page and
> a request for the next page -- documents being added, deleted, or changed.
>
> Because the sort order of documents with the same value can change, one
> document that may have ended up on the first page on the first query may
> end up on the second page on the second query.
>
> Sorting by a field with no duplicate values (the unique field you
> mentioned) will always result in the exact same sort order ... but if
> you add documents that sort to near the start of the sort order between
> queries, the behavior you have noticed can still happen.
>
> If this is what you are encountering, adding secondary sort on the
> uniqueKey field would probably clear up the problem.  If your uniqueKey
> field is "id", something like this:
>
> sort=someField desc,id desc
>
> Thanks,
> Shawn
>
>

Re: Pagination bug? when sorting by a field (not unique field)

2017-03-29 Thread Mikhail Khludnev

Can it happen that replicas are different by deleted docs? I mean numDocs
is the same, but maxDocs is different by number of deleted docs, you can
see it in solr admin at the core page.

On Wed, Mar 29, 2017 at 4:16 PM, Pablo Anzorena 
wrote:

> Shawn,
>
> Yes, the field has duplicate values and yes, if I add the secondary sort by
> the uniqueKey it solve the issue.
>
> Those 2 situations you mentioned are not occurring, none of them. The index
> is replicated, but not sharded.
>
> Does solr sort by an internal id if no uniqueKey is present in the sort?
>
> 2017-03-29 9:58 GMT-03:00 Shawn Heisey :
>
> > On 3/29/2017 6:35 AM, Pablo Anzorena wrote:
> > > I was paginating the results of a query and noticed that some
> > > documents were repeated across pagination buckets of 100 rows. When I
> > > sort by the unique field there is no repeated document but when I sort
> > > by another field then repeated documents appear. I assume is a bug and
> > > it's not the intended behaviour, right?
> >
> > There is a potential situation that can cause this problem that is NOT a
> > bug.
> >
> > If the field you are sorting on contains duplicate values (same value in
> > multiple documents), then I am pretty sure that the sort order of
> > documents with the same value in the sort field is non-deterministic in
> > these situations:
> >
> > 1) A distributed (sharded) index.
> > 2) When the index contents can change between a request for one page and
> > a request for the next page -- documents being added, deleted, or
> changed.
> >
> > Because the sort order of documents with the same value can change, one
> > document that may have ended up on the first page on the first query may
> > end up on the second page on the second query.
> >
> > Sorting by a field with no duplicate values (the unique field you
> > mentioned) will always result in the exact same sort order ... but if
> > you add documents that sort to near the start of the sort order between
> > queries, the behavior you have noticed can still happen.
> >
> > If this is what you are encountering, adding secondary sort on the
> > uniqueKey field would probably clear up the problem.  If your uniqueKey
> > field is "id", something like this:
> >
> > sort=someField desc,id desc
> >
> > Thanks,
> > Shawn
> >
> >
>



-- 
Sincerely yours
Mikhail Khludnev

Re: Pagination bug? when sorting by a field (not unique field)

2017-03-29 Thread Pablo Anzorena

Mikhall,

effectively maxDocs are different and also deletedDocs, but numDocs are ok.

I don't really get it, but can that be the problem?

2017-03-29 10:35 GMT-03:00 Mikhail Khludnev :

> Can it happen that replicas are different by deleted docs? I mean numDocs
> is the same, but maxDocs is different by number of deleted docs, you can
> see it in solr admin at the core page.
>
> On Wed, Mar 29, 2017 at 4:16 PM, Pablo Anzorena 
> wrote:
>
> > Shawn,
> >
> > Yes, the field has duplicate values and yes, if I add the secondary sort
> by
> > the uniqueKey it solve the issue.
> >
> > Those 2 situations you mentioned are not occurring, none of them. The
> index
> > is replicated, but not sharded.
> >
> > Does solr sort by an internal id if no uniqueKey is present in the sort?
> >
> > 2017-03-29 9:58 GMT-03:00 Shawn Heisey :
> >
> > > On 3/29/2017 6:35 AM, Pablo Anzorena wrote:
> > > > I was paginating the results of a query and noticed that some
> > > > documents were repeated across pagination buckets of 100 rows. When I
> > > > sort by the unique field there is no repeated document but when I
> sort
> > > > by another field then repeated documents appear. I assume is a bug
> and
> > > > it's not the intended behaviour, right?
> > >
> > > There is a potential situation that can cause this problem that is NOT
> a
> > > bug.
> > >
> > > If the field you are sorting on contains duplicate values (same value
> in
> > > multiple documents), then I am pretty sure that the sort order of
> > > documents with the same value in the sort field is non-deterministic in
> > > these situations:
> > >
> > > 1) A distributed (sharded) index.
> > > 2) When the index contents can change between a request for one page
> and
> > > a request for the next page -- documents being added, deleted, or
> > changed.
> > >
> > > Because the sort order of documents with the same value can change, one
> > > document that may have ended up on the first page on the first query
> may
> > > end up on the second page on the second query.
> > >
> > > Sorting by a field with no duplicate values (the unique field you
> > > mentioned) will always result in the exact same sort order ... but if
> > > you add documents that sort to near the start of the sort order between
> > > queries, the behavior you have noticed can still happen.
> > >
> > > If this is what you are encountering, adding secondary sort on the
> > > uniqueKey field would probably clear up the problem.  If your uniqueKey
> > > field is "id", something like this:
> > >
> > > sort=someField desc,id desc
> > >
> > > Thanks,
> > > Shawn
> > >
> > >
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>

Re: Pagination bug? when sorting by a field (not unique field)

2017-03-29 Thread Erick Erickson

I can answer at least one bit...

If all the sort fields are equal, the _internal_ Lucene document ID
(not ) is used to break the tie.The kicker is that the
internal Lucene ID can change when merging segments. Further, the
internal ID for two given docs can change relative to each other. I.e.

starting state:

unique key  internal Lucene doc ID
1   1
2   2

Sometime after merging:

unique key  internal Lucene doc ID
1   2
2   1

So if this problem only occurs when you're _also_ indexing this could
be happening.

Best,
Erick

On Wed, Mar 29, 2017 at 6:40 AM, Pablo Anzorena  wrote:
> Mikhall,
>
> effectively maxDocs are different and also deletedDocs, but numDocs are ok.
>
> I don't really get it, but can that be the problem?
>
> 2017-03-29 10:35 GMT-03:00 Mikhail Khludnev :
>
>> Can it happen that replicas are different by deleted docs? I mean numDocs
>> is the same, but maxDocs is different by number of deleted docs, you can
>> see it in solr admin at the core page.
>>
>> On Wed, Mar 29, 2017 at 4:16 PM, Pablo Anzorena 
>> wrote:
>>
>> > Shawn,
>> >
>> > Yes, the field has duplicate values and yes, if I add the secondary sort
>> by
>> > the uniqueKey it solve the issue.
>> >
>> > Those 2 situations you mentioned are not occurring, none of them. The
>> index
>> > is replicated, but not sharded.
>> >
>> > Does solr sort by an internal id if no uniqueKey is present in the sort?
>> >
>> > 2017-03-29 9:58 GMT-03:00 Shawn Heisey :
>> >
>> > > On 3/29/2017 6:35 AM, Pablo Anzorena wrote:
>> > > > I was paginating the results of a query and noticed that some
>> > > > documents were repeated across pagination buckets of 100 rows. When I
>> > > > sort by the unique field there is no repeated document but when I
>> sort
>> > > > by another field then repeated documents appear. I assume is a bug
>> and
>> > > > it's not the intended behaviour, right?
>> > >
>> > > There is a potential situation that can cause this problem that is NOT
>> a
>> > > bug.
>> > >
>> > > If the field you are sorting on contains duplicate values (same value
>> in
>> > > multiple documents), then I am pretty sure that the sort order of
>> > > documents with the same value in the sort field is non-deterministic in
>> > > these situations:
>> > >
>> > > 1) A distributed (sharded) index.
>> > > 2) When the index contents can change between a request for one page
>> and
>> > > a request for the next page -- documents being added, deleted, or
>> > changed.
>> > >
>> > > Because the sort order of documents with the same value can change, one
>> > > document that may have ended up on the first page on the first query
>> may
>> > > end up on the second page on the second query.
>> > >
>> > > Sorting by a field with no duplicate values (the unique field you
>> > > mentioned) will always result in the exact same sort order ... but if
>> > > you add documents that sort to near the start of the sort order between
>> > > queries, the behavior you have noticed can still happen.
>> > >
>> > > If this is what you are encountering, adding secondary sort on the
>> > > uniqueKey field would probably clear up the problem.  If your uniqueKey
>> > > field is "id", something like this:
>> > >
>> > > sort=someField desc,id desc
>> > >
>> > > Thanks,
>> > > Shawn
>> > >
>> > >
>> >
>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>>

The downsides of not splitting on whitespace in edismax (the old albino elephant prob)

2017-03-29 Thread Doug Turnbull

So with regards to this JIRA (
https://issues.apache.org/jira/browse/SOLR-9185) Which makes Solr splitting
on whitespace optional.

I want to point out that there's not a simple fix to multi-term synonyms in
part because of specific tradeoffs. Splitting on whitespace is *someimes a
good thing*. Not splitting on whitespace (or enforcing some other
cross-field consistent token splitting behavior) actually recreates an old
problem that was the reason for creating dismax strategies in the first
place. So I'm glad we're leaving the sow option :)

If you're interested, this summarizes a bunch of historical research I did
into Lucene code for my book for why splitting on whitespace is often a
good thing

Currently the behavior of edismax is intentionally designed to be
term-centric. There's a bias towards having more of your query terms in a
relevant hit. This comes out of an old problem called "albino elephant"
that was the original reason dismax strategies came about. So if a user
searches for

albino elephant

The original Lucene query parser for search across fields would do
something like:

(title:albino OR title:elephant) OR (text:albino OR text:elephant)

TF*IDF held constant for each term, a document that matches "albino" in two
fields has the same value as a document that matches BOTH albino and
elephant. Both get 2 "hits" in the OR query above. Most users consder this
not good! I want albino elephants, not just albino things nor just elephant
things!

So disjunctionmaxquery came about because somebody realized that if they
took the per-term maximum, they could bias towards results that had more of
the user's search terms.

(title:albino | title:albino) OR (text:elephant | text:elephant)

Here the highest scored result has BOTH search terms. So a result that has
both elephant and albino will come to the top. What users typically expect.

I call this strategy "term centric" -- it biases results towards documents
with more of the users search terms. I contrast this with "field centric"
search which focuses more on the specific analysis/matching behavior of one
field (shingles/synonyms/auto phrasing/taxonomies/whatever)

This strategy by necessity requires you to have a consistent, global
definition of what's a "search term" independent of fields either by a
common analyzer across fields or by just splitting on whitespace. A common
analyzer is what BlendedTermQuery in Lucene enforces (used by ES's
cross_field search)

In other words splitting on whitespace has *benefits* and *drawbacks.* The
drawback is what we experience with Solr multiterm synonyms. If you have
one field that breaks up by shingles/some multi-term synonym behavior and
another field that tokenizes on whitespace, you can't easily pick the
document with the "most search terms" as there's no consistent definition
of search terms.

I don't know where I'm going with this, but I want to point out that fixing
multiterm synonym won't have a silver bullet. People should still expect to
be frustrated :). We should all be aware we likely recreate another problem
with a simple fix to multiterm synonym. I think there's value in some
strategy that does something like

- Base relevance with edismax, splitting on whitespace to bias towards more
search terms
- Boosts with edismax w/o splitting on whitespace (or some other QP) to
layer in the effects you want for multiterm synonyms

How you balance these ranking signals is tricky and domain specific, but I
have found this sort of strategy balances both concerns

Ok this probably should have just been a blog post, but I wanted to just
use my history degree for something useful for a change...
Best!
-Doug

Re: The downsides of not splitting on whitespace in edismax (the old albino elephant prob)

2017-03-29 Thread Doug Turnbull

What triggered me to send this was seeing this

> When per-field query structures differ, e.g. when one field's analyzer
removes stopwords and another's doesn't, edismax's DisjunctionMaxQuery
structure when sow=false differs from that produced when sow=true. Briefly,
sow=true produces a boolean query containing one dismax query per query
term, while sow=false produces a dismax query containing one boolean query
per field. Min-should-match processing does (what I think is) the right
thing here. See
TestExtendedDismaxParser.testSplitOnWhitespace_Different_Field_Analysis() for
some examples of this. *Note*: when sow=false and all queried fields' query
structure is the same, edismax does what it has always done: produce a
boolean query containing one dismax query per term.

So just be careful because this switches edismax towards a per-field dismax
(correct me if I'm wrong here) as opposed to per-term. If I understand this
correctly, you may run into a different set of problems along the albino
elephant spectrum when sow=true

On Wed, Mar 29, 2017 at 10:45 AM Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> So with regards to this JIRA (
> https://issues.apache.org/jira/browse/SOLR-9185) Which makes Solr
> splitting on whitespace optional.
>
> I want to point out that there's not a simple fix to multi-term synonyms
> in part because of specific tradeoffs. Splitting on whitespace is *someimes
> a good thing*. Not splitting on whitespace (or enforcing some other
> cross-field consistent token splitting behavior) actually recreates an old
> problem that was the reason for creating dismax strategies in the first
> place. So I'm glad we're leaving the sow option :)
>
> If you're interested, this summarizes a bunch of historical research I did
> into Lucene code for my book for why splitting on whitespace is often a
> good thing
>
> Currently the behavior of edismax is intentionally designed to be
> term-centric. There's a bias towards having more of your query terms in a
> relevant hit. This comes out of an old problem called "albino elephant"
> that was the original reason dismax strategies came about. So if a user
> searches for
>
> albino elephant
>
> The original Lucene query parser for search across fields would do
> something like:
>
> (title:albino OR title:elephant) OR (text:albino OR text:elephant)
>
> TF*IDF held constant for each term, a document that matches "albino" in
> two fields has the same value as a document that matches BOTH albino and
> elephant. Both get 2 "hits" in the OR query above. Most users consder this
> not good! I want albino elephants, not just albino things nor just elephant
> things!
>
> So disjunctionmaxquery came about because somebody realized that if they
> took the per-term maximum, they could bias towards results that had more of
> the user's search terms.
>
> (title:albino | title:albino) OR (text:elephant | text:elephant)
>
> Here the highest scored result has BOTH search terms. So a result that has
> both elephant and albino will come to the top. What users typically expect.
>
> I call this strategy "term centric" -- it biases results towards documents
> with more of the users search terms. I contrast this with "field centric"
> search which focuses more on the specific analysis/matching behavior of one
> field (shingles/synonyms/auto phrasing/taxonomies/whatever)
>
> This strategy by necessity requires you to have a consistent, global
> definition of what's a "search term" independent of fields either by a
> common analyzer across fields or by just splitting on whitespace. A common
> analyzer is what BlendedTermQuery in Lucene enforces (used by ES's
> cross_field search)
>
> In other words splitting on whitespace has *benefits* and *drawbacks.* The
> drawback is what we experience with Solr multiterm synonyms. If you have
> one field that breaks up by shingles/some multi-term synonym behavior and
> another field that tokenizes on whitespace, you can't easily pick the
> document with the "most search terms" as there's no consistent definition
> of search terms.
>
> I don't know where I'm going with this, but I want to point out that
> fixing multiterm synonym won't have a silver bullet. People should still
> expect to be frustrated :). We should all be aware we likely recreate
> another problem with a simple fix to multiterm synonym. I think there's
> value in some strategy that does something like
>
> - Base relevance with edismax, splitting on whitespace to bias towards
> more search terms
> - Boosts with edismax w/o splitting on whitespace (or some other QP) to
> layer in the effects you want for multiterm synonyms
>
> How you balance these ranking signals is tricky and domain specific, but I
> have found this sort of strategy balances both concerns
>
> Ok this probably should have just been a blog post, but I wanted to just
> use my history degree for something useful for a change...
> Best!
> -Doug
>

Re: Fieldtype json supported in SOLR 5.4.0 or 5.4.1

2017-03-29 Thread Abhijit Pawar

Thanks Rick.

Does that mean I need to define managed-schema.xml, I thought it gets
created by default on installing but only on later versions of  SOLR ( 6.0
or later).

Will managed-schema help in indexing the JSON type fields in the mongoDB ?
How do I define the managed-schema in SOLR 5.4.0 ?



Best Regards,


Abhijit Pawar
Office : +1 (469) 287 2005 x 110



Follow us on:





On Tue, Mar 28, 2017 at 6:20 PM, Rick Leir  wrote:

> Abhijit
> In Mongo you probably have one JSON record per document. You can post that
> JSON record to Solr, and the JSON fields get indexed. The github project
> you mention does just that. If you use the Solr managed schema then Solr
> will automatically define fields based on what it receives. Otherwise you
> will need to carefully design a schema.xml.
> Cheers -- Rick
>
> On March 28, 2017 6:08:40 PM EDT, Abhijit Pawar <
> abhijit.ibizs...@gmail.com> wrote:
> >Hello All,
> >
> >I am working on a requirement to index field of type JSON (in mongoDB
> >collection) in SOLR 5.4.0.
> >
> >I am using mongo-jdbc-dih which I found on GitHub :
> >
> >https://github.com/hrishik/solr-mongodb-dih
> >
> >However I could not find a fieldtype on Apache SOLR wiki page which
> >would
> >support JSON datatype in mongoDB.
> >
> >Can someone please recommend a way to include datatype / fieldtype in
> >SOLR
> >schema to support or index JSON data field from mongoDB.
> >Thanks.
> >
> >Regards,
> >
> >Abhijit
>
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: The downsides of not splitting on whitespace in edismax (the old albino elephant prob)

2017-03-29 Thread Steve Rowe

Thanks Doug, excellent analysis!

In implementing the SOLR-9185 changtes, I considered a compromise approach to 
the term-centric / field-centric axis you describe in the case of differing 
field analysis pipelines: finding common source-text-offset bounded slices in 
all per-field queries, and then producing dismax queries over these slices; 
this is a generalization of what happens in the sow=true case, where slice 
points are pre-determined by whitespace.  However, it looked really complicated 
to maintain source text offsets with queries (if you’re interested, you can see 
an example of the kind of thing I’m talking about in my initial patch on 
, which I ultimately decided 
against committing), so I decided to go with per-field dismax when structural 
differences are encountered in the per-field queries.  While I won’t be doing 
any work on this short term, I still think the above-described approach could 
improve the situation in the sow=false/differing-field-analysis case.  Patches 
welcome!

One copy/paste-o in your writeup (I think), illustrating term-centric dismax 
queries:

>> (title:albino | title:albino) OR (text:elephant | text:elephant)


This should instead be:

(title:albino | text:albino) OR (title:elephant | text:elephant)  

--
Steve
www.lucidworks.com

> On Mar 29, 2017, at 10:49 AM, Doug Turnbull 
>  wrote:
> 
> What triggered me to send this was seeing this
> 
>> When per-field query structures differ, e.g. when one field's analyzer
> removes stopwords and another's doesn't, edismax's DisjunctionMaxQuery
> structure when sow=false differs from that produced when sow=true. Briefly,
> sow=true produces a boolean query containing one dismax query per query
> term, while sow=false produces a dismax query containing one boolean query
> per field. Min-should-match processing does (what I think is) the right
> thing here. See
> TestExtendedDismaxParser.testSplitOnWhitespace_Different_Field_Analysis() for
> some examples of this. *Note*: when sow=false and all queried fields' query
> structure is the same, edismax does what it has always done: produce a
> boolean query containing one dismax query per term.
> 
> So just be careful because this switches edismax towards a per-field dismax
> (correct me if I'm wrong here) as opposed to per-term. If I understand this
> correctly, you may run into a different set of problems along the albino
> elephant spectrum when sow=true
> 
> On Wed, Mar 29, 2017 at 10:45 AM Doug Turnbull <
> dturnb...@opensourceconnections.com> wrote:
> 
>> So with regards to this JIRA (
>> https://issues.apache.org/jira/browse/SOLR-9185) Which makes Solr
>> splitting on whitespace optional.
>> 
>> I want to point out that there's not a simple fix to multi-term synonyms
>> in part because of specific tradeoffs. Splitting on whitespace is *someimes
>> a good thing*. Not splitting on whitespace (or enforcing some other
>> cross-field consistent token splitting behavior) actually recreates an old
>> problem that was the reason for creating dismax strategies in the first
>> place. So I'm glad we're leaving the sow option :)
>> 
>> If you're interested, this summarizes a bunch of historical research I did
>> into Lucene code for my book for why splitting on whitespace is often a
>> good thing
>> 
>> Currently the behavior of edismax is intentionally designed to be
>> term-centric. There's a bias towards having more of your query terms in a
>> relevant hit. This comes out of an old problem called "albino elephant"
>> that was the original reason dismax strategies came about. So if a user
>> searches for
>> 
>> albino elephant
>> 
>> The original Lucene query parser for search across fields would do
>> something like:
>> 
>> (title:albino OR title:elephant) OR (text:albino OR text:elephant)
>> 
>> TF*IDF held constant for each term, a document that matches "albino" in
>> two fields has the same value as a document that matches BOTH albino and
>> elephant. Both get 2 "hits" in the OR query above. Most users consder this
>> not good! I want albino elephants, not just albino things nor just elephant
>> things!
>> 
>> So disjunctionmaxquery came about because somebody realized that if they
>> took the per-term maximum, they could bias towards results that had more of
>> the user's search terms.
>> 
>> (title:albino | title:albino) OR (text:elephant | text:elephant)
>> 
>> Here the highest scored result has BOTH search terms. So a result that has
>> both elephant and albino will come to the top. What users typically expect.
>> 
>> I call this strategy "term centric" -- it biases results towards documents
>> with more of the users search terms. I contrast this with "field centric"
>> search which focuses more on the specific analysis/matching behavior of one
>> field (shingles/synonyms/auto phrasing/taxonomies/whatever)
>> 
>> This strategy by necessity requires you to have a consistent, global
>> definition of what's a "search term" ind

Re: The downsides of not splitting on whitespace in edismax (the old albino elephant prob)

2017-03-29 Thread Doug Turnbull

Good info-- Yes thanks for you correction! And thanks for the very welcome
change to edismax!


Interesting point on findin consistent offsets in the text. That would be
an interesting approach


Doug

On Wed, Mar 29, 2017 at 11:36 AM Steve Rowe  wrote:

> Thanks Doug, excellent analysis!
>
> In implementing the SOLR-9185 changtes, I considered a compromise approach
> to the term-centric / field-centric axis you describe in the case of
> differing field analysis pipelines: finding common source-text-offset
> bounded slices in all per-field queries, and then producing dismax queries
> over these slices; this is a generalization of what happens in the sow=true
> case, where slice points are pre-determined by whitespace.  However, it
> looked really complicated to maintain source text offsets with queries (if
> you’re interested, you can see an example of the kind of thing I’m talking
> about in my initial patch on <
> https://issues.apache.org/jira/browse/LUCENE-7533>, which I ultimately
> decided against committing), so I decided to go with per-field dismax when
> structural differences are encountered in the per-field queries.  While I
> won’t be doing any work on this short term, I still think the
> above-described approach could improve the situation in the
> sow=false/differing-field-analysis case.  Patches welcome!
>
> One copy/paste-o in your writeup (I think), illustrating term-centric
> dismax queries:
>
> >> (title:albino | title:albino) OR (text:elephant | text:elephant)
>
>
> This should instead be:
>
> (title:albino | text:albino) OR (title:elephant | text:elephant)
>
> --
> Steve
> www.lucidworks.com
>
> > On Mar 29, 2017, at 10:49 AM, Doug Turnbull <
> dturnb...@opensourceconnections.com> wrote:
> >
> > What triggered me to send this was seeing this
> >
> >> When per-field query structures differ, e.g. when one field's analyzer
> > removes stopwords and another's doesn't, edismax's DisjunctionMaxQuery
> > structure when sow=false differs from that produced when sow=true.
> Briefly,
> > sow=true produces a boolean query containing one dismax query per query
> > term, while sow=false produces a dismax query containing one boolean
> query
> > per field. Min-should-match processing does (what I think is) the right
> > thing here. See
> >
> TestExtendedDismaxParser.testSplitOnWhitespace_Different_Field_Analysis()
> for
> > some examples of this. *Note*: when sow=false and all queried fields'
> query
> > structure is the same, edismax does what it has always done: produce a
> > boolean query containing one dismax query per term.
> >
> > So just be careful because this switches edismax towards a per-field
> dismax
> > (correct me if I'm wrong here) as opposed to per-term. If I understand
> this
> > correctly, you may run into a different set of problems along the albino
> > elephant spectrum when sow=true
> >
> > On Wed, Mar 29, 2017 at 10:45 AM Doug Turnbull <
> > dturnb...@opensourceconnections.com> wrote:
> >
> >> So with regards to this JIRA (
> >> https://issues.apache.org/jira/browse/SOLR-9185) Which makes Solr
> >> splitting on whitespace optional.
> >>
> >> I want to point out that there's not a simple fix to multi-term synonyms
> >> in part because of specific tradeoffs. Splitting on whitespace is
> *someimes
> >> a good thing*. Not splitting on whitespace (or enforcing some other
> >> cross-field consistent token splitting behavior) actually recreates an
> old
> >> problem that was the reason for creating dismax strategies in the first
> >> place. So I'm glad we're leaving the sow option :)
> >>
> >> If you're interested, this summarizes a bunch of historical research I
> did
> >> into Lucene code for my book for why splitting on whitespace is often a
> >> good thing
> >>
> >> Currently the behavior of edismax is intentionally designed to be
> >> term-centric. There's a bias towards having more of your query terms in
> a
> >> relevant hit. This comes out of an old problem called "albino elephant"
> >> that was the original reason dismax strategies came about. So if a user
> >> searches for
> >>
> >> albino elephant
> >>
> >> The original Lucene query parser for search across fields would do
> >> something like:
> >>
> >> (title:albino OR title:elephant) OR (text:albino OR text:elephant)
> >>
> >> TF*IDF held constant for each term, a document that matches "albino" in
> >> two fields has the same value as a document that matches BOTH albino and
> >> elephant. Both get 2 "hits" in the OR query above. Most users consder
> this
> >> not good! I want albino elephants, not just albino things nor just
> elephant
> >> things!
> >>
> >> So disjunctionmaxquery came about because somebody realized that if they
> >> took the per-term maximum, they could bias towards results that had
> more of
> >> the user's search terms.
> >>
> >> (title:albino | title:albino) OR (text:elephant | text:elephant)
> >>
> >> Here the highest scored result has BOTH search terms. So a result that
> has
> >> both elephant an

RE: Solr | cluster | behaviour

2017-03-29 Thread Prateek Jain J

Thanks Shawn. 

Regards,
Prateek Jain

-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: 29 March 2017 01:27 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr | cluster | behaviour

On 3/29/2017 3:21 AM, Prateek Jain J wrote:
> We are having solr deployment in active passive mode. So, ideally only 
> one instance should be up and running at a time. That's true we only 
> see one instance serving requests but we do see some activity in CPU 
> for the standby solr instance. These instances are writing to shared 
> disk. My question is, why do we see activity is standby solr even if 
> it is not serving any requests. Is there any replication of in-memory 
> indexes between these instances? Any pointers to this are welcome.

Solr is not designed to work in this way.  Sharing an index directory between 
two instances is not recommended and may cause serious issues. 
Each instance expects exclusive access to every index it is managing. 
You can disable the locking that normally prevents such sharing, but that 
locking exists for a reason and should not be disabled.

Instead, you should use either SolrCloud's inherent index duplication 
technology, or the old master-slave replication that existed before SolrCloud.  
Disks are cheap.  Taking risks with your data isn't.

Thanks,
Shawn

Re: Pagination bug? when sorting by a field (not unique field)

2017-03-29 Thread alessandro.benedetti

The reason Mikhail mentioned that, is probably related to :

*The way how number of document calculated is changed (LUCENE-6711)*
/The number of documents (docCount) is used to calculate term specificity
(idf) and average document length (avdl). Prior to LUCENE-6711,
collectionStats.maxDoc() was used for the statistics. Now,
collectionStats.docCount() is used whenever possible, if not maxDocs() is
used.
Assume that a collection contains 100 documents, and 50 of them have
"keywords" field. In this example, maxDocs is 100 while docCount is 50 for
the "keywords" field. The total number of tokens for "keywords" field is
divided by docCount to obtain avdl. Therefore, docCount which is the total
number of documents that have at least one term for the field, is a more
precise metric for optional fields.
DefaultSimilarity does not leverage avdl, so this change would have
relatively minor change in the result list. Because relative idf values of
terms will remain same. However, when combined with other factors such as
term frequency, relative ranking of documents could change. Some Similarity
implementations (such as the ones instantiated with NormalizationH2 and
BM25) take account into avdl and would have notable change in ranked list.
Especially if you have a collection of documents with varying lengths.
Because NormalizationH2 tends to punish documents longer than avdl./

This means that if you are load balancing, the page 2 query could go to
another replica, where the doc is scored differently, ending up on a
different position ( and maybe appearing again as a final effect).
This scenario is referred to scored ranking, so it will not affect sorting (
and I believe in your initial mail you were referring not to sorting)

Cheers


Pablo wrote
> Mikhall,
> 
> effectively maxDocs are different and also deletedDocs, but numDocs are
> ok.
> 
> I don't really get it, but can that be the problem?





-
---
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Pagination-bug-when-sorting-by-a-field-not-unique-field-tp4327408p4327461.html
Sent from the Solr - User mailing list archive at Nabble.com.

Managing between or avoiding Transient state for specific period of time in Solr

2017-03-29 Thread Shashank Pedamallu

Hi,

I’m performing some long running operation on a Background thread on a Core and 
I observed that since the core has the property “transient” set to true, in 
between this operation completes, the core is being CLOSED and OPENED by Solr 
(even though the operation continues without interruption). Is there a way to 
avoid/block OPENING/CLOSING of the core when this operation is in progress?

Thanks,
Shashank Pedamallu
MTS MBU vCOps Dev
US-CA-Promontory E, E 1035
Email: spedama...@vmware.com
Office: 650.427.6280 x76280

Avoiding Transient state during a long running background process

2017-03-29 Thread Shashank Pedamallu

Hi,

I’m performing some long running operation on a Background thread on a Core and 
I observed that since the core has the property “transient” set to true, in 
between this operation completes, the core is being CLOSED and OPENED by Solr 
(even though the operation continues without interruption). Is there a way to 
avoid/block OPENING/CLOSING of the core when this operation is in progress?

Thanks,
Shashank

Duplicate documents

2017-03-29 Thread Wenjie Zhang

Hi there,

We are in solr 6.0.1, here is our solr schema and config:

_unique_key


   
solr.StrField
32766
  




  [^\w-\.]
  _


  

When having above configuration, and doing following operations, we will
see duplicate documents (two documents have same _unique_key)

1, Add document:

*final SolrInputDocument document = new SolrInputDocument();*

* document.setField("_unique_key", "key1");*

* final UpdateRequest request = new UpdateRequest();*

*request.add(document);*

*solrClient.request(request,collectionName);*
2, Overwrite the document with

*final SolrInputDocument document = new SolrInputDocument();*

*document.setField("_unique_key", **"key1"**);*

*final SolrInputDocument childDocument = **new SolrInputDocument();*

*childDocument**.setField("name", "name");*

*childDocument**.setField("parent_id", "**key1**");*

* document.addChildDocument(childDocument);*

*final UpdateRequest request = new UpdateRequest();*

*request.add(document);*

*solrClient.request(request,collectionName);*

After this, we will see three documents in our collection, one for the
child document we added, two for the parent document and both have
"_unique_key" as "key1".


After doing some researching, we found the "SignatureUpdateProcessorFactory",
so we modified our solrConfig.xml to add "SignatureUpdateProcessorFactory".

 

  signatureField
  true
  _entityKey
  org.apache.solr.update.processor.Lookup3Signature


solr.StrField
32766
   




  [^\w-\.]
  _



  

After the change, we run the code in a new collection, the duplicate
document issue is gone, but the child document is also not shown in the
search result when searching (*:*),.
However, the block join ({!parent which="_unique_key:*"}name:*) works fine,
but not the join ({!join from=parent_id to=_unique_key}), it returns
nothing.

Any idea?


Thanks,
Jack

Re: Avoiding Transient state during a long running background process

2017-03-29 Thread Shawn Heisey

On 3/29/2017 11:17 AM, Shashank Pedamallu wrote:
> I’m performing some long running operation on a Background thread on a
> Core and I observed that since the core has the property “transient”
> set to true, in between this operation completes, the core is being
> CLOSED and OPENED by Solr (even though the operation continues without
> interruption). Is there a way to avoid/block OPENING/CLOSING of the
> core when this operation is in progress?

I see four choices:

* Change the transient property to false and restart Solr, so that core
cannot ever be unloaded.
* Increase the transientCacheSize value in solr.xml so that more
transient cores can be loaded at the same time.
* Don't access other transient cores during the long-running operation.
* Don't use LotsOfCores functionality (transient cores) at all.

https://wiki.apache.org/solr/LotsOfCores

Thanks,
Shawn

Re: Duplicate documents

2017-03-29 Thread Wenjie Zhang

BTW, we only have one node and the collection has just one shard.

On Wed, Mar 29, 2017 at 10:52 AM, Wenjie Zhang 
wrote:

> Hi there,
>
> We are in solr 6.0.1, here is our solr schema and config:
>
> _unique_key
>
> 
>
> solr.StrField
> 32766
>   
> 
> 
> 
> 
>   [^\w-\.]
>   _
> 
> 
>   
>
> When having above configuration, and doing following operations, we will
> see duplicate documents (two documents have same _unique_key)
>
> 1, Add document:
>
> *final SolrInputDocument document = new SolrInputDocument();*
>
> * document.setField("_unique_key", "key1");*
>
> * final UpdateRequest request = new UpdateRequest();*
>
> *request.add(document);*
>
> *solrClient.request(request,collectionName);*
> 2, Overwrite the document with
>
> *final SolrInputDocument document = new SolrInputDocument();*
>
> *document.setField("_unique_key", **"key1"**);*
>
> *final SolrInputDocument childDocument = **new SolrInputDocument();*
>
> *childDocument**.setField("name", "name");*
>
> *childDocument**.setField("parent_id", "**key1**");*
>
> * document.addChildDocument(childDocument);*
>
> *final UpdateRequest request = new UpdateRequest();*
>
> *request.add(document);*
>
> *solrClient.request(request,collectionName);*
>
> After this, we will see three documents in our collection, one for the
> child document we added, two for the parent document and both have
> "_unique_key" as "key1".
>
>
> After doing some researching, we found the "SignatureUpdateProcessorFacto
> ry", so we modified our solrConfig.xml to add "
> SignatureUpdateProcessorFactory".
>
>  
> 
>   signatureField
>   true
>   _entityKey
>   org.apache.solr.update.processor.
> Lookup3Signature
> 
> 
> solr.StrField
> 32766
>
> 
> 
> 
> 
>   [^\w-\.]
>   _
> 
> 
>
>   
>
> After the change, we run the code in a new collection, the duplicate
> document issue is gone, but the child document is also not shown in the
> search result when searching (*:*),.
> However, the block join ({!parent which="_unique_key:*"}name:*) works
> fine, but not the join ({!join from=parent_id to=_unique_key}), it
> returns nothing.
>
> Any idea?
>
>
> Thanks,
> Jack
>

Re: Duplicate documents

2017-03-29 Thread Alexandre Rafalovitch

There are too many things here. As far as I understand:
*) You should not need to use Signature chain
*) You should have a uniqueID assigned to the child record
*) You should not assign parentID to the child record, it will be
assigned automatically
*) Double check that your unique_key field type is string (not text or
similar), though this does not seem to be the issue

Make sure to run the test against clean/empty index, just to see if
maybe something is hanging around. I would also test that against the
latest Solr, just in case something was fixed in a meanwhile.

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 29 March 2017 at 14:30, Wenjie Zhang  wrote:
> BTW, we only have one node and the collection has just one shard.
>
> On Wed, Mar 29, 2017 at 10:52 AM, Wenjie Zhang 
> wrote:
>
>> Hi there,
>>
>> We are in solr 6.0.1, here is our solr schema and config:
>>
>> _unique_key
>>
>> 
>>
>> solr.StrField
>> 32766
>>   
>> 
>> 
>> 
>> 
>>   [^\w-\.]
>>   _
>> 
>> 
>>   
>>
>> When having above configuration, and doing following operations, we will
>> see duplicate documents (two documents have same _unique_key)
>>
>> 1, Add document:
>>
>> *final SolrInputDocument document = new SolrInputDocument();*
>>
>> * document.setField("_unique_key", "key1");*
>>
>> * final UpdateRequest request = new UpdateRequest();*
>>
>> *request.add(document);*
>>
>> *solrClient.request(request,collectionName);*
>> 2, Overwrite the document with
>>
>> *final SolrInputDocument document = new SolrInputDocument();*
>>
>> *document.setField("_unique_key", **"key1"**);*
>>
>> *final SolrInputDocument childDocument = **new SolrInputDocument();*
>>
>> *childDocument**.setField("name", "name");*
>>
>> *childDocument**.setField("parent_id", "**key1**");*
>>
>> * document.addChildDocument(childDocument);*
>>
>> *final UpdateRequest request = new UpdateRequest();*
>>
>> *request.add(document);*
>>
>> *solrClient.request(request,collectionName);*
>>
>> After this, we will see three documents in our collection, one for the
>> child document we added, two for the parent document and both have
>> "_unique_key" as "key1".
>>
>>
>> After doing some researching, we found the "SignatureUpdateProcessorFacto
>> ry", so we modified our solrConfig.xml to add "
>> SignatureUpdateProcessorFactory".
>>
>>  
>> 
>>   signatureField
>>   true
>>   _entityKey
>>   org.apache.solr.update.processor.
>> Lookup3Signature
>> 
>> 
>> solr.StrField
>> 32766
>>
>> 
>> 
>> 
>> 
>>   [^\w-\.]
>>   _
>> 
>> 
>>
>>   
>>
>> After the change, we run the code in a new collection, the duplicate
>> document issue is gone, but the child document is also not shown in the
>> search result when searching (*:*),.
>> However, the block join ({!parent which="_unique_key:*"}name:*) works
>> fine, but not the join ({!join from=parent_id to=_unique_key}), it
>> returns nothing.
>>
>> Any idea?
>>
>>
>> Thanks,
>> Jack
>>

in-place updates

2017-03-29 Thread Elaine Cario

I need some clarity on atomic vs in-place updates.  For atomic I understand
that all fields need to be stored, either explicitly or through docValues,
since the entire document is re-indexed.

For in-place updates, the documentation states that only the fields being
modified are updated, but does that mean that all other fields don't need
to be stored?

My real problem is that we are calculating authority-like and popularity
scores for our documents in an external process, and would actually prefer
to use something like ExternalFileField for those values.  But EFF is not
really SolrCloud-friendly, and we have a fairly mature indexing process in
place already.  It might be more palatable for us to do an in-place update,
rather than deal with the operational issues of dropping files into every
shard of a collection, but only IF we don't need to change our existing
schemas to store everything which might not be stored today.

Thanks in advance.

Re: Duplicate documents

2017-03-29 Thread Wenjie Zhang

Hi Alex,

Thanks for your time and please see response inline.

Thanks,
Jack

On Wed, Mar 29, 2017 at 11:36 AM, Alexandre Rafalovitch 
wrote:

> There are too many things here. As far as I understand:
> *) You should not need to use Signature chain  [Jack] I have the same
> feeling as well, but I just do not know why this should change the behavior
> of child documents.
> *) You should have a uniqueID assigned to the child record [Jack]
> Ideally, child document should not require a unique key, since it is just
> child document, and it is linked internally with parent document by solr,
> but I will try to add a unique key in the child document as well to see how
> it behaves.
> *) You should not assign parentID to the child record, it will be [Jack]
> You are right, but we add this parentId in this child document because of
> the orphen child issue (https://issues.apache.org/jira/browse/SOLR-5211),
> so we can overcome the orphen child issue by doing join. I am juse confused
> by the behavior change after added the Signature chain.

assigned automatically
> *) Double check that your unique_key field type is string (not text or
> similar), though this does not seem to be the issue [Jack] yes, it is a
> string.
>
> Make sure to run the test against clean/empty index, just to see if
> maybe something is hanging around. I would also test that against the
> latest Solr, just in case something was fixed in a meanwhile.  [Jack]
> Make sense, I will also test it in the latest version.
>
> Regards,
>Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 29 March 2017 at 14:30, Wenjie Zhang  wrote:
> > BTW, we only have one node and the collection has just one shard.
> >
> > On Wed, Mar 29, 2017 at 10:52 AM, Wenjie Zhang <
> wenjiezhang2...@gmail.com>
> > wrote:
> >
> >> Hi there,
> >>
> >> We are in solr 6.0.1, here is our solr schema and config:
> >>
> >> _unique_key
> >>
> >> 
> >>
> >> solr.StrField
> >> 32766
> >>   
> >> 
> >> 
> >> 
> >> 
> >>   [^\w-\.]
> >>   _
> >> 
> >> 
> >>   
> >>
> >> When having above configuration, and doing following operations, we will
> >> see duplicate documents (two documents have same _unique_key)
> >>
> >> 1, Add document:
> >>
> >> *final SolrInputDocument document = new SolrInputDocument();*
> >>
> >> * document.setField("_unique_key", "key1");*
> >>
> >> * final UpdateRequest request = new UpdateRequest();*
> >>
> >> *request.add(document);*
> >>
> >> *solrClient.request(request,collectionName);*
> >> 2, Overwrite the document with
> >>
> >> *final SolrInputDocument document = new SolrInputDocument();*
> >>
> >> *document.setField("_unique_key", **"key1"**);*
> >>
> >> *final SolrInputDocument childDocument = **new SolrInputDocument();*
> >>
> >> *childDocument**.setField("name", "name");*
> >>
> >> *childDocument**.setField("parent_id", "**key1**");*
> >>
> >> * document.addChildDocument(childDocument);*
> >>
> >> *final UpdateRequest request = new UpdateRequest();*
> >>
> >> *request.add(document);*
> >>
> >> *solrClient.request(request,collectionName);*
> >>
> >> After this, we will see three documents in our collection, one for the
> >> child document we added, two for the parent document and both have
> >> "_unique_key" as "key1".
> >>
> >>
> >> After doing some researching, we found the
> "SignatureUpdateProcessorFacto
> >> ry", so we modified our solrConfig.xml to add "
> >> SignatureUpdateProcessorFactory".
> >>
> >>  
> >> 
> >>   signatureField
> >>   true
> >>   _entityKey
> >>   org.apache.solr.update.processor.
> >> Lookup3Signature
> >> 
> >> 
> >> solr.StrField
> >> 32766
> >>
> >> 
> >> 
> >> 
> >> 
> >>   [^\w-\.]
> >>   _
> >> 
> >> 
> >>
> >>   
> >>
> >> After the change, we run the code in a new collection, the duplicate
> >> document issue is gone, but the child document is also not shown in the
> >> search result when searching (*:*),.
> >> However, the block join ({!parent which="_unique_key:*"}name:*) works
> >> fine, but not the join ({!join from=parent_id to=_unique_key}), it
> >> returns nothing.
> >>
> >> Any idea?
> >>
> >>
> >> Thanks,
> >> Jack
> >>
>

Re: Pagination bug? when sorting by a field (not unique field)

2017-03-29 Thread Mikhail Khludnev

Great explanation, Alessandro!

Let me briefly explain my experience. I have a tiny test with 2 shards and
2 replicas, index about a hundred of docs. And then when I fully paginate
search results with score ranking, I've got duplicates across pages. And
the reason is deletes, which occur probably due to update/failover. Every
paging request lands to the different replica. There are a few workarounds:
lands consequent requests to the same replicas; also  fixes
duplicates; but tie-breaking is the best way for sure.

On Wed, Mar 29, 2017 at 7:10 PM, alessandro.benedetti 
wrote:

> The reason Mikhail mentioned that, is probably related to :
>
> *The way how number of document calculated is changed (LUCENE-6711)*
> /The number of documents (docCount) is used to calculate term specificity
> (idf) and average document length (avdl). Prior to LUCENE-6711,
> collectionStats.maxDoc() was used for the statistics. Now,
> collectionStats.docCount() is used whenever possible, if not maxDocs() is
> used.
> Assume that a collection contains 100 documents, and 50 of them have
> "keywords" field. In this example, maxDocs is 100 while docCount is 50 for
> the "keywords" field. The total number of tokens for "keywords" field is
> divided by docCount to obtain avdl. Therefore, docCount which is the total
> number of documents that have at least one term for the field, is a more
> precise metric for optional fields.
> DefaultSimilarity does not leverage avdl, so this change would have
> relatively minor change in the result list. Because relative idf values of
> terms will remain same. However, when combined with other factors such as
> term frequency, relative ranking of documents could change. Some Similarity
> implementations (such as the ones instantiated with NormalizationH2 and
> BM25) take account into avdl and would have notable change in ranked list.
> Especially if you have a collection of documents with varying lengths.
> Because NormalizationH2 tends to punish documents longer than avdl./
>
> This means that if you are load balancing, the page 2 query could go to
> another replica, where the doc is scored differently, ending up on a
> different position ( and maybe appearing again as a final effect).
> This scenario is referred to scored ranking, so it will not affect sorting
> (
> and I believe in your initial mail you were referring not to sorting)
>
> Cheers
>
>
> Pablo wrote
> > Mikhall,
> >
> > effectively maxDocs are different and also deletedDocs, but numDocs are
> > ok.
> >
> > I don't really get it, but can that be the problem?
>
>
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Pagination-bug-when-sorting-by-a-field-not-unique-field-
> tp4327408p4327461.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Sincerely yours
Mikhail Khludnev

Re: Pagination bug? when sorting by a field (not unique field)

2017-03-29 Thread Chris Hostetter


The thing to keep in mind, is that w/o a fully deterministic sort, 
the underlying problem statement "doc may appera on multiple pages" can 
exist even in a single node solr index, even if no documents are 
added/deleted between bage requests: because background merges / 
searcher re-opening may happen in between those page requests.

The best practice, if you really care about ensuring no (non-updated) doc 
is ever returned twice in subsequent pages, is to to use a fully 
deterministic sort, with a "tie breaker" clause that is unique to every 
document (ie: uniqueKey field)



: Date: Wed, 29 Mar 2017 23:14:22 +0300
: From: Mikhail Khludnev 
: Reply-To: solr-user@lucene.apache.org
: To: solr-user 
: Subject: Re: Pagination bug? when sorting by a field (not unique field)
: 
: Great explanation, Alessandro!
: 
: Let me briefly explain my experience. I have a tiny test with 2 shards and
: 2 replicas, index about a hundred of docs. And then when I fully paginate
: search results with score ranking, I've got duplicates across pages. And
: the reason is deletes, which occur probably due to update/failover. Every
: paging request lands to the different replica. There are a few workarounds:
: lands consequent requests to the same replicas; also  fixes
: duplicates; but tie-breaking is the best way for sure.
: 
: On Wed, Mar 29, 2017 at 7:10 PM, alessandro.benedetti 
: wrote:
: 
: > The reason Mikhail mentioned that, is probably related to :
: >
: > *The way how number of document calculated is changed (LUCENE-6711)*
: > /The number of documents (docCount) is used to calculate term specificity
: > (idf) and average document length (avdl). Prior to LUCENE-6711,
: > collectionStats.maxDoc() was used for the statistics. Now,
: > collectionStats.docCount() is used whenever possible, if not maxDocs() is
: > used.
: > Assume that a collection contains 100 documents, and 50 of them have
: > "keywords" field. In this example, maxDocs is 100 while docCount is 50 for
: > the "keywords" field. The total number of tokens for "keywords" field is
: > divided by docCount to obtain avdl. Therefore, docCount which is the total
: > number of documents that have at least one term for the field, is a more
: > precise metric for optional fields.
: > DefaultSimilarity does not leverage avdl, so this change would have
: > relatively minor change in the result list. Because relative idf values of
: > terms will remain same. However, when combined with other factors such as
: > term frequency, relative ranking of documents could change. Some Similarity
: > implementations (such as the ones instantiated with NormalizationH2 and
: > BM25) take account into avdl and would have notable change in ranked list.
: > Especially if you have a collection of documents with varying lengths.
: > Because NormalizationH2 tends to punish documents longer than avdl./
: >
: > This means that if you are load balancing, the page 2 query could go to
: > another replica, where the doc is scored differently, ending up on a
: > different position ( and maybe appearing again as a final effect).
: > This scenario is referred to scored ranking, so it will not affect sorting
: > (
: > and I believe in your initial mail you were referring not to sorting)
: >
: > Cheers
: >
: >
: > Pablo wrote
: > > Mikhall,
: > >
: > > effectively maxDocs are different and also deletedDocs, but numDocs are
: > > ok.
: > >
: > > I don't really get it, but can that be the problem?
: >
: >
: >
: >
: >
: > -
: > ---
: > Alessandro Benedetti
: > Search Consultant, R&D Software Engineer, Director
: > Sease Ltd. - www.sease.io
: > --
: > View this message in context: http://lucene.472066.n3.
: > nabble.com/Pagination-bug-when-sorting-by-a-field-not-unique-field-
: > tp4327408p4327461.html
: > Sent from the Solr - User mailing list archive at Nabble.com.
: >
: 
: 
: 
: -- 
: Sincerely yours
: Mikhail Khludnev
: 

-Hoss
http://www.lucidworks.com/

Re: in-place updates

2017-03-29 Thread Ishan Chattopadhyaya

> For in-place updates, the documentation states that only the fields being
> modified are updated, but does that mean that all other fields don't need
> to be stored?

Correct, in general there's no need to store the other fields. However,
there's a niche case where if a simultaneous DeleteByQuery command and
in-place update command get re-ordered on a replica, and it ends up
deleting the document that is required to be updated, the replica will
fetch the document from leader. At that point, non-stored fields can be
lost. So, if you don't use DeleteByQuery commands you can safely stop
storing other fields.

> we are calculating authority-like and popularity
> scores for our documents

This is the ideal usecase for in-place updates.

On Thu, Mar 30, 2017 at 1:13 AM, Elaine Cario  wrote:

> I need some clarity on atomic vs in-place updates.  For atomic I understand
> that all fields need to be stored, either explicitly or through docValues,
> since the entire document is re-indexed.
>
> For in-place updates, the documentation states that only the fields being
> modified are updated, but does that mean that all other fields don't need
> to be stored?
>
> My real problem is that we are calculating authority-like and popularity
> scores for our documents in an external process, and would actually prefer
> to use something like ExternalFileField for those values.  But EFF is not
> really SolrCloud-friendly, and we have a fairly mature indexing process in
> place already.  It might be more palatable for us to do an in-place update,
> rather than deal with the operational issues of dropping files into every
> shard of a collection, but only IF we don't need to change our existing
> schemas to store everything which might not be stored today.
>
> Thanks in advance.
>

Re: Pagination bug? when sorting by a field (not unique field)

2017-03-29 Thread Erick Erickson

You might be helped by "distributed IDF".
see: SOLR-1632

On Wed, Mar 29, 2017 at 1:56 PM, Chris Hostetter
 wrote:
>
> The thing to keep in mind, is that w/o a fully deterministic sort,
> the underlying problem statement "doc may appera on multiple pages" can
> exist even in a single node solr index, even if no documents are
> added/deleted between bage requests: because background merges /
> searcher re-opening may happen in between those page requests.
>
> The best practice, if you really care about ensuring no (non-updated) doc
> is ever returned twice in subsequent pages, is to to use a fully
> deterministic sort, with a "tie breaker" clause that is unique to every
> document (ie: uniqueKey field)
>
>
>
> : Date: Wed, 29 Mar 2017 23:14:22 +0300
> : From: Mikhail Khludnev 
> : Reply-To: solr-user@lucene.apache.org
> : To: solr-user 
> : Subject: Re: Pagination bug? when sorting by a field (not unique field)
> :
> : Great explanation, Alessandro!
> :
> : Let me briefly explain my experience. I have a tiny test with 2 shards and
> : 2 replicas, index about a hundred of docs. And then when I fully paginate
> : search results with score ranking, I've got duplicates across pages. And
> : the reason is deletes, which occur probably due to update/failover. Every
> : paging request lands to the different replica. There are a few workarounds:
> : lands consequent requests to the same replicas; also  fixes
> : duplicates; but tie-breaking is the best way for sure.
> :
> : On Wed, Mar 29, 2017 at 7:10 PM, alessandro.benedetti 
> : wrote:
> :
> : > The reason Mikhail mentioned that, is probably related to :
> : >
> : > *The way how number of document calculated is changed (LUCENE-6711)*
> : > /The number of documents (docCount) is used to calculate term specificity
> : > (idf) and average document length (avdl). Prior to LUCENE-6711,
> : > collectionStats.maxDoc() was used for the statistics. Now,
> : > collectionStats.docCount() is used whenever possible, if not maxDocs() is
> : > used.
> : > Assume that a collection contains 100 documents, and 50 of them have
> : > "keywords" field. In this example, maxDocs is 100 while docCount is 50 for
> : > the "keywords" field. The total number of tokens for "keywords" field is
> : > divided by docCount to obtain avdl. Therefore, docCount which is the total
> : > number of documents that have at least one term for the field, is a more
> : > precise metric for optional fields.
> : > DefaultSimilarity does not leverage avdl, so this change would have
> : > relatively minor change in the result list. Because relative idf values of
> : > terms will remain same. However, when combined with other factors such as
> : > term frequency, relative ranking of documents could change. Some 
> Similarity
> : > implementations (such as the ones instantiated with NormalizationH2 and
> : > BM25) take account into avdl and would have notable change in ranked list.
> : > Especially if you have a collection of documents with varying lengths.
> : > Because NormalizationH2 tends to punish documents longer than avdl./
> : >
> : > This means that if you are load balancing, the page 2 query could go to
> : > another replica, where the doc is scored differently, ending up on a
> : > different position ( and maybe appearing again as a final effect).
> : > This scenario is referred to scored ranking, so it will not affect sorting
> : > (
> : > and I believe in your initial mail you were referring not to sorting)
> : >
> : > Cheers
> : >
> : >
> : > Pablo wrote
> : > > Mikhall,
> : > >
> : > > effectively maxDocs are different and also deletedDocs, but numDocs are
> : > > ok.
> : > >
> : > > I don't really get it, but can that be the problem?
> : >
> : >
> : >
> : >
> : >
> : > -
> : > ---
> : > Alessandro Benedetti
> : > Search Consultant, R&D Software Engineer, Director
> : > Sease Ltd. - www.sease.io
> : > --
> : > View this message in context: http://lucene.472066.n3.
> : > nabble.com/Pagination-bug-when-sorting-by-a-field-not-unique-field-
> : > tp4327408p4327461.html
> : > Sent from the Solr - User mailing list archive at Nabble.com.
> : >
> :
> :
> :
> : --
> : Sincerely yours
> : Mikhail Khludnev
> :
>
> -Hoss
> http://www.lucidworks.com/

Re: Avoiding Transient state during a long running background process

2017-03-29 Thread Shashank Pedamallu

Hi Shawn,

Thank you very much for the response. Is there no definite way of ensuring that 
Solr does not switch transient states by an api? Like solrCore.open() and 
solrCore.close()?

Thanks,
Shashank Pedamallu
MTS MBU vCOps Dev
US-CA-Promontory E, E 1035
Email: spedama...@vmware.com
Office: 650.427.6280 x76280







On 3/29/17, 11:11 AM, "Shawn Heisey"  wrote:

>On 3/29/2017 11:17 AM, Shashank Pedamallu wrote:
>> I’m performing some long running operation on a Background thread on a
>> Core and I observed that since the core has the property “transient”
>> set to true, in between this operation completes, the core is being
>> CLOSED and OPENED by Solr (even though the operation continues without
>> interruption). Is there a way to avoid/block OPENING/CLOSING of the
>> core when this operation is in progress?
>
>I see four choices:
>
>* Change the transient property to false and restart Solr, so that core
>cannot ever be unloaded.
>* Increase the transientCacheSize value in solr.xml so that more
>transient cores can be loaded at the same time.
>* Don't access other transient cores during the long-running operation.
>* Don't use LotsOfCores functionality (transient cores) at all.
>
>https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.apache.org_solr_LotsOfCores&d=DwIDaQ&c=uilaK90D4TOVoH58JNXRgQ&r=blJD2pBapH3dDkoajIf9mT9SSbbs19wRbChNde1ErNI&m=WyZhjN__67hbu-9yVN2H_dAU0FotKvKSgVXY22VPyX0&s=WRX4ho9MpXJO5LTp-ZCsUmjxLJaeTG702IthWiDOSeM&e=
> 
>
>Thanks,
>Shawn
>

Re: Avoiding Transient state during a long running background process

2017-03-29 Thread Shawn Heisey

On 3/29/2017 4:50 PM, Shashank Pedamallu wrote:
> Thank you very much for the response. Is there no definite way of
> ensuring that Solr does not switch transient states by an api? Like
> solrCore.open() and solrCore.close()? 

I am not aware of any way to tell Solr to NOT unload a core when all of
the following conditions have been met:

1) Another transient core must be loaded because it has been accessed.
2) The core in question has been marked transient.
3) The transientCacheSize has already been reached.
4) The core in question is the one with the earliest timestamp.

I checked the code, but could not determine whether the oldest core is
decided by core load time or by core access time. My guess is that it is
decided by the load time, because this is the option that would have the
best performance.

If it's important that this core never gets unloaded, then you'll want
to remove the transient property.

Thanks,
Shawn

Re: Avoiding Transient state during a long running background process

2017-03-29 Thread Shashank Pedamallu

Thanks again for the information Shawn. 

1) The long running process I told earlier was about Backup. I have written a 
custom BackupHandler to backup the index files to a Cloud storage following the 
ReplicationHandler class. I’m just wondering how does switching between 
transient state affect such processes. Does ReplicationHandler not get affected 
if a replication job is in progress and Solr decided to unload this core in 
between the process.

Thanks,
Shashank

On 3/29/17, 4:41 PM, "Shawn Heisey"  wrote:

>On 3/29/2017 4:50 PM, Shashank Pedamallu wrote:
>> Thank you very much for the response. Is there no definite way of
>> ensuring that Solr does not switch transient states by an api? Like
>> solrCore.open() and solrCore.close()? 
>
>I am not aware of any way to tell Solr to NOT unload a core when all of
>the following conditions have been met:
>
>1) Another transient core must be loaded because it has been accessed.
>2) The core in question has been marked transient.
>3) The transientCacheSize has already been reached.
>4) The core in question is the one with the earliest timestamp.
>
>I checked the code, but could not determine whether the oldest core is
>decided by core load time or by core access time. My guess is that it is
>decided by the load time, because this is the option that would have the
>best performance.
>
>If it's important that this core never gets unloaded, then you'll want
>to remove the transient property.
>
>Thanks,
>Shawn
>

Re: Avoiding Transient state during a long running background process

2017-03-29 Thread Erick Erickson

It's an LRU cache time. See the docs for LinkedHashmap, this form of
the c'tor is used in SolrCores.allocateLazyCores

transientCores = new LinkedHashMap(Math.min(cacheSize, 1000), 0.75f, true) {

which is a special form of the c'tor that creates an access-ordered map.
I had a terrible moment seeing this line in the code where
transientCores is declared:

protected Map transientCores = new
LinkedHashMap<>(); // For "lazily loaded" cores

which would have created an insertion-ordered LRU cache. Turns out
that this is just a placeholder to keep from having to check if the
transientCores map is null before it's really allocated.

bq: My guess is that it is decided by the load time, because this is
the option that would have the best performance.

Not at all. The theory here is that this is to support the pattern
where some transient cores are used all the time and some cores are
only used for a while then go quiet. E.g. searching an organization's
documents. A large organization might have users searching all day. A
small organization may search the docs once a week.

If it was insertion-order, then those users who signed on and worked
all day would have their cores unloaded periodically even though other
cores were last accessed a long time ago. Of course there will be some
access patterns for which this is a bad assumption.

I'm in the middle of pulling all this out into a pluggable framework,
see SOLR-8906. So if this is truly important in 6.6+ you should be
able to define your own plugin.

Shawn's comments on how to avoid unloading the core are spot on, and
the only options that exist currently.

Your BackupHandler should be OK. The core is reloaded whenever it's
accessed, but since the underlying index hasn't changed (it couldn't
because the core was unloaded!) it should be in the same state it was
in last time you accessed it.

If your custom BackupHandler is not holding the core open or, more
specifically a searcher, then even if the core wasn't unloaded you
have the possibility of the index changing out from underneath you due
to indexing activity between calls and having an inconsistent backup.
Could you use the fetchindex replication API command? See:
https://cwiki.apache.org/confluence/display/solr/Index+Replication#IndexReplication-HTTPAPICommandsfortheReplicationHandler.
Solr relies on this "doing the right thing" so that there are
consistent indexes every time, it might save you a lot of grief.

This does work with SolrCloud (I'm not assuming you're using
SolrCloud, just sayin'), but note that the machine being replicated
_to_ (that, BTW, doesn't even have to be part of the collection) won't
be able to serve queries while the replication is going on. I'm
thinking something like use a dummy Solr instance to issue the
fetchindex to _then_ move the result to your Cloud storage.

Best,
Erick

On Wed, Mar 29, 2017 at 4:41 PM, Shawn Heisey  wrote:
> On 3/29/2017 4:50 PM, Shashank Pedamallu wrote:
>> Thank you very much for the response. Is there no definite way of
>> ensuring that Solr does not switch transient states by an api? Like
>> solrCore.open() and solrCore.close()?
>
> I am not aware of any way to tell Solr to NOT unload a core when all of
> the following conditions have been met:
>
> 1) Another transient core must be loaded because it has been accessed.
> 2) The core in question has been marked transient.
> 3) The transientCacheSize has already been reached.
> 4) The core in question is the one with the earliest timestamp.
>
> I checked the code, but could not determine whether the oldest core is
> decided by core load time or by core access time. My guess is that it is
> decided by the load time, because this is the option that would have the
> best performance.
>
> If it's important that this core never gets unloaded, then you'll want
> to remove the transient property.
>
> Thanks,
> Shawn
>

format data at source or format data during indexing?

2017-03-29 Thread Derek Poh


Hi

Ineed to create afield that will be prefix and suffix with code 
'z01x'.This field needs to have the code in the index and during query.

I can either
1.
have the source data of the field formatted with the code before 
indexing (outside solr).
use a charFilter in the query stage of the field typeto add the 
codeduring query.


pattern="^(.*)$" replacement="z01x $1 z01x" />


OR

2.
use the charFilter before tokenizerclass during the index and query 
analyzer stage of the field type.


The collection has between 100k - 200k documents currentlybut it may 
increase in the future.
Theindexing time with option 2 and current indexing time is almost the 
same, between 2-3 minutes.


Which option would you advice?

Derek

--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: format data at source or format data during indexing?

2017-03-29 Thread Alexandre Rafalovitch

I am not sure I can tell how to decide on one or another. However, I
wanted to mention that you also have an option of doing in in the
UpdateRequestProcessor chain. That's still within Solr (and therefore
is consistent with multiple clients feeding into Solr) but is before
individual field processing (so will survive - for example - a
copyField).

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 29 March 2017 at 23:38, Derek Poh  wrote:
> Hi
>
> Ineed to create afield that will be prefix and suffix with code 'z01x'.This
> field needs to have the code in the index and during query.
> I can either
> 1.
> have the source data of the field formatted with the code before indexing
> (outside solr).
> use a charFilter in the query stage of the field typeto add the codeduring
> query.
>
>  replacement="z01x $1 z01x" />
>
> OR
>
> 2.
> use the charFilter before tokenizerclass during the index and query analyzer
> stage of the field type.
>
> The collection has between 100k - 200k documents currentlybut it may
> increase in the future.
> Theindexing time with option 2 and current indexing time is almost the same,
> between 2-3 minutes.
>
> Which option would you advice?
>
> Derek
>
> --
> CONFIDENTIALITY NOTICE
> This e-mail (including any attachments) may contain confidential and/or
> privileged information. If you are not the intended recipient or have
> received this e-mail in error, please inform the sender immediately and
> delete this e-mail (including any attachments) from your computer, and you
> must not use, disclose to anyone else or copy this e-mail (including any
> attachments), whether in whole or in part.
> This e-mail and any reply to it may be monitored for security, legal,
> regulatory compliance and/or other appropriate reasons.

Re: Indexing speed reduced significantly with OCR

2017-03-29 Thread Zheng Lin Edwin Yeo

Thanks for your reply.

>From what I see, getting more hardware to do the OCR is inevitable?

Even if we run the OCR outside of Solr indexing stream, it will still take
a long time to process it if it is on just one machine. And we still need
to wait for the OCR to finish converting before we can run the indexing to
Solr.

Regards,
Edwin


On 29 March 2017 at 04:40, Phil Scadden  wrote:

> Well I haven’t had to deal with a problem that size, but it seems to me
> that you have little alternative except through more computer hardware at
> it. For the job I did, I OCRed to convert PDF to searchable PDF outside the
> indexing workflow. I used pdftotext utility to extract text from pdf. If
> text extracted was <1% document size, then I assumed it needed to be OCRed
> otherwise didn’t bother. You could look at a more sophisticated method to
> determine whether OCR was necessary. Doing it outside indexing stream means
> you can use different hardware for OCR. Converting to searchable PDF means
> you do it only once - a reindex doesn’t need to do OCR.
> Notice: This email and any attachments are confidential and may not be
> used, published or redistributed without the prior written consent of the
> Institute of Geological and Nuclear Sciences Limited (GNS Science). If
> received in error please destroy and immediately notify GNS Science. Do not
> copy or disclose the contents.
>

Re: format data at source or format data during indexing?

2017-03-29 Thread Erick Erickson

I generally prefer index-time work to query-time work on the theory
that the index-time work is done once and the query time work is done
for each query.

That said, for a corpus this size (and presumably without a large
query rate) I doubt you'd be able to measure any difference.

So basically choose the easiest to implement IMO.

Best,
Erick

On Wed, Mar 29, 2017 at 8:43 PM, Alexandre Rafalovitch
 wrote:
> I am not sure I can tell how to decide on one or another. However, I
> wanted to mention that you also have an option of doing in in the
> UpdateRequestProcessor chain. That's still within Solr (and therefore
> is consistent with multiple clients feeding into Solr) but is before
> individual field processing (so will survive - for example - a
> copyField).
>
> Regards,
>Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 29 March 2017 at 23:38, Derek Poh  wrote:
>> Hi
>>
>> Ineed to create afield that will be prefix and suffix with code 'z01x'.This
>> field needs to have the code in the index and during query.
>> I can either
>> 1.
>> have the source data of the field formatted with the code before indexing
>> (outside solr).
>> use a charFilter in the query stage of the field typeto add the codeduring
>> query.
>>
>> > replacement="z01x $1 z01x" />
>>
>> OR
>>
>> 2.
>> use the charFilter before tokenizerclass during the index and query analyzer
>> stage of the field type.
>>
>> The collection has between 100k - 200k documents currentlybut it may
>> increase in the future.
>> Theindexing time with option 2 and current indexing time is almost the same,
>> between 2-3 minutes.
>>
>> Which option would you advice?
>>
>> Derek
>>
>> --
>> CONFIDENTIALITY NOTICE
>> This e-mail (including any attachments) may contain confidential and/or
>> privileged information. If you are not the intended recipient or have
>> received this e-mail in error, please inform the sender immediately and
>> delete this e-mail (including any attachments) from your computer, and you
>> must not use, disclose to anyone else or copy this e-mail (including any
>> attachments), whether in whole or in part.
>> This e-mail and any reply to it may be monitored for security, legal,
>> regulatory compliance and/or other appropriate reasons.

Re: format data at source or format data during indexing?

2017-03-29 Thread Derek Poh

Hi Erick

So I could also not use the query analyzer stage to append the code to
the search keyword?
Have the front-end application append the code for every query it issue
instead?

On 3/30/2017 12:20 PM, Erick Erickson wrote:

I generally prefer index-time work to query-time work on the theory
that the index-time work is done once and the query time work is done
for each query.

That said, for a corpus this size (and presumably without a large
query rate) I doubt you'd be able to measure any difference.

So basically choose the easiest to implement IMO.

Best,
Erick

On Wed, Mar 29, 2017 at 8:43 PM, Alexandre Rafalovitch
wrote:

I am not sure I can tell how to decide on one or another. However, I
wanted to mention that you also have an option of doing in in the
UpdateRequestProcessor chain. That's still within Solr (and therefore
is consistent with multiple clients feeding into Solr) but is before
individual field processing (so will survive - for example - a
copyField).

Regards,
Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced

On 29 March 2017 at 23:38, Derek Poh wrote:

Ineed to create afield that will be prefix and suffix with code 'z01x'.This
field needs to have the code in the index and during query.
I can either
1.
have the source data of the field formatted with the code before indexing
(outside solr).
use a charFilter in the query stage of the field typeto add the codeduring
query.

2.
use the charFilter before tokenizerclass during the index and query analyzer
stage of the field type.

The collection has between 100k - 200k documents currentlybut it may
increase in the future.
Theindexing time with option 2 and current indexing time is almost the same,
between 2-3 minutes.

Which option would you advice?

Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer, and you
must not use, disclose to anyone else or copy this e-mail (including any
attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.

--
CONFIDENTIALITY NOTICE

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part.

This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.

Re: format data at source or format data during indexing?

2017-03-29 Thread Derek Poh

Hi Alex

Thank you for pointing out theUpdateRequestProcessor option.

On 3/30/2017 11:43 AM, Alexandre Rafalovitch wrote:

Regards,
Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced

On 29 March 2017 at 23:38, Derek Poh wrote:

2.
use the charFilter before tokenizerclass during the index and query analyzer
stage of the field type.

The collection has between 100k - 200k documents currentlybut it may
increase in the future.
Theindexing time with option 2 and current indexing time is almost the same,
between 2-3 minutes.

Which option would you advice?

Derek

--
CONFIDENTIALITY NOTICE

This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.

Re: Is there a way to retrieve the a term's position/offset in Solr

2017-03-29 Thread Bjarke Buur Mortensen

OK, so the next thing to do would be to index and store the rich text ...
is it HTML? Because then you can use HTMLStripCharFilterFactory in your
analyzer, and still get the correct highlight back with hl.fragsize=0.

I would think that you will have a hard time using the term positions, if
what you are indexing is somehow transformed before indexing and you want
to map the positions back to the untransformed text.

2017-03-29 4:44 GMT+02:00 forest_soup :

> Thanks All!
>
> Actually we are going to show the highlighted words in a rich text format
> instead of the plain text which was indexed. So the hl.fragsize=0 seems not
> work for me..
>
> And for the patch(SOLR-4722), haven't tried it. Hope it can return the
> position/offset info.
>
> Thanks!
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Is-there-a-way-to-retrieve-the-a-term-s-
> position-offset-in-Solr-tp4326931p4327339.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

42 matches

Mail list logo