RE: Position search

2019-10-16 Thread Kaminski, Adi
Hi,
Thanks for the responses.

It's a soft boundary which is resulted by dynamic syntax from our application. 
So may vary from different user searches, one user can search some "word1" in 
starting 30 words, and another can search "word2" in
starting 10 words. The use case is to match some terms/phrase in specific 
document places in order to identify scripts/specific word ocuurences.

So I guess copy field won't work here.

Any other suggestions/thoughts ?
Maybe some hidden position filters in native level to limit from start/end of 
the document ?

Thanks,
Adi

-Original Message-
From: Tim Casey 
Sent: Tuesday, October 15, 2019 11:05 PM
To: solr-user@lucene.apache.org
Subject: Re: Position search

If this is about a normalized query, I would put the normalization text into a 
specific field.  The reason for this is you may want to search the overall text 
during any form of expansion phase of searching for data.
That is, maybe you want to know the context of up to the 120th word.  At least 
you have both.
Also, you may want to note which normalized fields were truncated or were 
simply too small. This would give some guidance as to the bias of the 
normalization.  If 95% of the fields were not truncated, there is a chance you 
are not doing good at normalizing because you have a set of particularly short 
messages.  So I would expect a small set of side fields remarking this.  This 
would allow you to carry the measures along with the data.

tim

On Tue, Oct 15, 2019 at 12:19 PM Alexandre Rafalovitch 
wrote:

> Is the 100 words a hard boundary or a soft one?
>
> If it is a hard one (always 100 words), the easiest is probably copy
> field and in the (unstored) copy, trim off whatever you don't want to
> search. Possibly using regular expressions. Of course, "what's a word"
> is an important question here.
>
> Similarly, you could do that with Update Request Processors and
> clone/process field even before it hits the schema. Then you could
> store the extract for highlighting purposes.
>
> Regards,
>Alex.
>
> On Tue, 15 Oct 2019 at 02:25, Kaminski, Adi 
> wrote:
> >
> > Hi,
> > What's the recommended way to search in Solr (assuming 8.2 is used)
> > for
> specific terms/phrases/expressions while limiting the search from
> position perspective.
> > For example to search only in the first/last 100 words of the document ?
> >
> > Is there any built-in functionality for that ?
> >
> > Thanks in advance,
> > Adi
> >
> >
> > This electronic message may contain proprietary and confidential
> information of Verint Systems Inc., its affiliates and/or
> subsidiaries. The information is intended to be for the use of the
> individual(s) or
> entity(ies) named above. If you are not the intended recipient (or
> authorized to receive this e-mail for the intended recipient), you may
> not use, copy, disclose or distribute to anyone this message or any
> information contained in this message. If you have received this
> electronic message in error, please notify us by replying to this e-mail.
>


This electronic message may contain proprietary and confidential information of 
Verint Systems Inc., its affiliates and/or subsidiaries. The information is 
intended to be for the use of the individual(s) or entity(ies) named above. If 
you are not the intended recipient (or authorized to receive this e-mail for 
the intended recipient), you may not use, copy, disclose or distribute to 
anyone this message or any information contained in this message. If you have 
received this electronic message in error, please notify us by replying to this 
e-mail.


Re: Atomic Updates with PreAnalyzedField

2019-10-16 Thread Mikhail Khludnev
Hello, Oleksandr.
It deserves JIRA, please raise one.

On Tue, Oct 15, 2019 at 8:17 PM Oleksandr Drapushko 
wrote:

> Hello Community,
>
> I've discovered data loss bug and couldn't find any mention of it. Please
> confirm this bug haven't been reported yet.
>
>
> Description:
>
> If you try to update non pre-analyzed fields in a document using atomic
> updates, data in pre-analyzed fields (if there is any) will be lost. The
> bug was discovered in Solr 8.2 and 7.7.2.
>
>
> Steps to reproduce:
>
> 1. Index this document into techproducts
> {
>   "id": "a",
>   "n_s": "s1",
>   "pre":
>
> "{\"v\":\"1\",\"str\":\"Alaska\",\"tokens\":[{\"t\":\"alaska\",\"s\":0,\"e\":6,\"i\":1}]}"
> }
>
> 2. Query the document
> {
>   "response":{"numFound":1,"start":0,"maxScore":1.0,"docs":[
>   {
> "id":"a",
> "n_s":"s1",
> "pre":"Alaska",
> "_version_":1647475215142223872}]
>   }}
>
> 3. Update using atomic syntax
> {
>   "add": {
> "doc": {
>   "id": "a",
>   "n_s": {"set": "s2"}
> }
>   }
> }
>
> 4. Observe the warning in solr log
> UI:
> WARN  x:techproducts_shard2_replica_n6  PreAnalyzedField  Error parsing
> pre-analyzed field 'pre'
>
> solr.log:
> WARN  (qtp1384454980-23) [c:techproducts s:shard2 r:core_node8
> x:techproducts_shard2_replica_n6] o.a.s.s.PreAnalyzedField Error parsing
> pre-analyzed field 'pre' => java.io.IOException: Invalid JSON type
> java.lang.String, expected Map
> at
>
> org.apache.solr.schema.JsonPreAnalyzedParser.parse(JsonPreAnalyzedParser.java:86)
>
> 5. Query the document again
> {
>   "response":{"numFound":1,"start":0,"maxScore":1.0,"docs":[
>   {
> "id":"a",
> "n_s":"s2",
> "_version_":1647475461695995904}]
>   }}
>
> Result: There is no 'pre' field in the document anymore.
>
>
> My thoughts on it:
>
> 1. Data loss can be prevented if the warning will be replaced with error
> (re-throwing exception). Atomic updates for such documents still won't
> work, but updates will be explicitly rejected.
>
> 2. Solr tries to read the document from index, merge it with input document
> and re-index the document, but when it reads indexed pre-analyzed fields
> the format is different, so Solr cannot parse and re-index those fields
> properly.
>
>
> Thank you,
> Oleksandr
>


-- 
Sincerely yours
Mikhail Khludnev


Re: Solr-Cloud, join and collection collocation

2019-10-16 Thread Nicolas Paris
Sadly, the join performances are poor.
The joined collection is 12M documents, and the performances are 6k ms
versus 60ms when I compare to the denormalized field.

Apparently, the performances does not change when the filter on the
joined collection is changed. It is still 6k ms when the subset is 12M
or 1 document in size. So the performance of join looks correlated to
size of joined collection and not the kind of filter applied to it.

I will explore the streaming expressions

On Wed, Oct 16, 2019 at 08:00:43AM +0200, Nicolas Paris wrote:
> > You can certainly replicate the joined collection to every shard. It
> > must fit in one shard and a replica of that shard must be co-located
> > with every replica of the “to” collection.
> 
> Yes, I found this in the documentation, with a clear example just after
> this mail. I will test it today. I also read your blog about join
> performances[1] and I suspect the performance impact of joins will be
> huge because the joined collection is about 10M documents (only two
> fields, unique id and an array of longs and a filter applied to the
> array, join key is 10M unique IDs).
> 
> > Have you looked at streaming and “streaming expressions"? It does not
> > have the same problem, although it does have its own limitations.
> 
> I never tested them, and I am not very confortable yet in how to test
> them. Is it possible to mix query parsers and streaming expression in
> the client call via http parameters - or is streaming expression apply
> programmatically only ?
> 
> [1] https://lucidworks.com/post/solr-and-joins/
> 
> On Tue, Oct 15, 2019 at 07:12:25PM -0400, Erick Erickson wrote:
> > You can certainly replicate the joined collection to every shard. It must 
> > fit in one shard and a replica of that shard must be co-located with every 
> > replica of the “to” collection.
> > 
> > Have you looked at streaming and “streaming expressions"? It does not have 
> > the same problem, although it does have its own limitations.
> > 
> > Best,
> > Erick
> > 
> > > On Oct 15, 2019, at 6:58 PM, Nicolas Paris  
> > > wrote:
> > > 
> > > Hi
> > > 
> > > I have several large collections that cannot fit in a standalone solr
> > > instance. They are split over multiple shards in solr-cloud mode.
> > > 
> > > Those collections are supposed to be joined to an other collection to
> > > retrieve subset. Because I am using distributed collections, I am not
> > > able to use the solr join feature.
> > > 
> > > For this reason, I denormalize the information by adding the joined
> > > collection within every collections. Naturally, when I want to update
> > > the joined collection, I have to update every one of the distributed
> > > collections.
> > > 
> > > In standalone mode, I only would have to update the joined collection.
> > > 
> > > I wonder if there is a way to overcome this limitation. For example, by
> > > replicating the joined collection to every shard - or other method I am
> > > ignoring.
> > > 
> > > Any thought ? 
> > > -- 
> > > nicolas
> > 
> 
> -- 
> nicolas
> 

-- 
nicolas


Re: Highlighting Solr 8

2019-10-16 Thread sasarun
Hi Eric,

Unified highlighter does not have an option to provide alternate field when
highlighting. That option is available with Orginal and fast vector
highlighter. As indicated in the Solr documentation, Unified is the
recommended method for highlighting to meet most of the use cases. Please do
share more details in case you are facing any specific issue with
highlighting. 

Thanks,

Arun 




--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Problems with TokenFilter, but only in wildcard queries

2019-10-16 Thread Björn Keil
Hello,

I am having a problem with a primitive self-written TokenFilter, namely the
GermanUmlautFilter in the example below. It's being used for both queries
and indexing.
It works perfectly most of the time, it replace ä with ae, ö with oe and so
forth, before ICUFoldingFilter replaces the remaining non-ascii symbols.

However, it does cause odd behaviour in Wildcard Queries. e.g.:
The query title:todesmä* matches todesmarsch, which it should not, because
an ä is supposed to be replaced with an ae, however, it also matches
todesmärchen, as it should.
The query title:todesmär still matches todesmarsch, but not todesmärchen.

That is odd, even as though the replacement did not take place while
performing a wildcard query, even though it did work during indexing. In
different circumstances it works, however. E.g.:
The query title:härte does correctly not match harte, but it does match
härte.
The query title:haerte is equivalent to the query title:härte.
The query title:harte does correctly not match haerte, but it does match
harte.

While debugging the GermanUmlautFilter, I did not find any obvious mistake.
The only thing that it is a bit strange is that the CharTermAttribute's
(implement by PackedTokenAttributeImpl) endOffset attribute does not appear
to change. However, if it is supposed to indicate the last character's
offset in byte, that would be the expected result: It replaces a single
two-byte character with two one byte characters in the examples above.

Does anybody have an idea what's going on here? What's so different about
wildcard queries?

>From the schema.xml:


  














GermanUmlautFilter code:

package de.example.analysis;

import java.io.IOException;

import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

/**
 * This TokenFilter replaces German umlauts and the character ß with a
normalized form in ASCII characters.
 *
 * ü => ue
 * ß => ss
 * etc.
 *
 * This enables a sort order according DIN 5007, variant 2, the so
called "phone book" sort order.
 *
 * @see org.apache.lucene.analysis.TokenStream
 *
 */
public class GermanUmaultFilter extends TokenFilter {

private final CharTermAttribute termAtt =
addAttribute(CharTermAttribute.class);

/**
 * @see org.apache.lucene.analysis.TokenFilter#TokenFilter()
 * @param input TokenStream with the tokens to filter
 */
public GermanUmaultFilter(TokenStream input) {
super(input);
}

/**
 * Performs the actual filtering upon request by the consumer.
 *
 * @see org.apache.lucene.analysis.TokenStream#incrementToken()
 * @return true on success, false on failure
 */
public boolean incrementToken() throws IOException {
if (input.incrementToken()) {
int countReplacements = 0;
char[] origBuffer = termAtt.buffer();
int origLength = termAtt.length();
// Figure out how many replacements we need to get the 
size of the new buffer
for (int i = 0; i < origLength; i++) {
if (origBuffer[i] == 'ü'
|| origBuffer[i] == 'ä'
|| origBuffer[i] == 'ö'
|| origBuffer[i] == 'ß'
|| origBuffer[i] == 'Ä'
|| origBuffer[i] == 'Ö'
|| origBuffer[i] == 'Ü'
) {
countReplacements++;
}
}

// If there is a replacement create a new buffer of the 
appropriate length...
if (countReplacements != 0) {
int newLength = origLength + countReplacements;
char[] target = new char[newLength];
int j = 0;
// ... perform the replacement ...
for (int i = 0; i < origLength; i++) {
switch (origBuffer[i]) {
case 'ä':
target[j++] = 'a';
target[j++] = 'e';
break;
case 'ö':
target[j++] = 'o';
target[j++] = 'e';
break;
case 'ü':
   

Re: Atomic Updates with PreAnalyzedField

2019-10-16 Thread Oleksandr Drapushko
https://issues.apache.org/jira/browse/SOLR-13850

On Wed, Oct 16, 2019 at 11:25 AM Mikhail Khludnev  wrote:

> Hello, Oleksandr.
> It deserves JIRA, please raise one.
>
> On Tue, Oct 15, 2019 at 8:17 PM Oleksandr Drapushko 
> wrote:
>
> > Hello Community,
> >
> > I've discovered data loss bug and couldn't find any mention of it. Please
> > confirm this bug haven't been reported yet.
> >
> >
> > Description:
> >
> > If you try to update non pre-analyzed fields in a document using atomic
> > updates, data in pre-analyzed fields (if there is any) will be lost. The
> > bug was discovered in Solr 8.2 and 7.7.2.
> >
> >
> > Steps to reproduce:
> >
> > 1. Index this document into techproducts
> > {
> >   "id": "a",
> >   "n_s": "s1",
> >   "pre":
> >
> >
> "{\"v\":\"1\",\"str\":\"Alaska\",\"tokens\":[{\"t\":\"alaska\",\"s\":0,\"e\":6,\"i\":1}]}"
> > }
> >
> > 2. Query the document
> > {
> >   "response":{"numFound":1,"start":0,"maxScore":1.0,"docs":[
> >   {
> > "id":"a",
> > "n_s":"s1",
> > "pre":"Alaska",
> > "_version_":1647475215142223872}]
> >   }}
> >
> > 3. Update using atomic syntax
> > {
> >   "add": {
> > "doc": {
> >   "id": "a",
> >   "n_s": {"set": "s2"}
> > }
> >   }
> > }
> >
> > 4. Observe the warning in solr log
> > UI:
> > WARN  x:techproducts_shard2_replica_n6  PreAnalyzedField  Error parsing
> > pre-analyzed field 'pre'
> >
> > solr.log:
> > WARN  (qtp1384454980-23) [c:techproducts s:shard2 r:core_node8
> > x:techproducts_shard2_replica_n6] o.a.s.s.PreAnalyzedField Error parsing
> > pre-analyzed field 'pre' => java.io.IOException: Invalid JSON type
> > java.lang.String, expected Map
> > at
> >
> >
> org.apache.solr.schema.JsonPreAnalyzedParser.parse(JsonPreAnalyzedParser.java:86)
> >
> > 5. Query the document again
> > {
> >   "response":{"numFound":1,"start":0,"maxScore":1.0,"docs":[
> >   {
> > "id":"a",
> > "n_s":"s2",
> > "_version_":1647475461695995904}]
> >   }}
> >
> > Result: There is no 'pre' field in the document anymore.
> >
> >
> > My thoughts on it:
> >
> > 1. Data loss can be prevented if the warning will be replaced with error
> > (re-throwing exception). Atomic updates for such documents still won't
> > work, but updates will be explicitly rejected.
> >
> > 2. Solr tries to read the document from index, merge it with input
> document
> > and re-index the document, but when it reads indexed pre-analyzed fields
> > the format is different, so Solr cannot parse and re-index those fields
> > properly.
> >
> >
> > Thank you,
> > Oleksandr
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


Re: Position search

2019-10-16 Thread Alexandre Rafalovitch
So are these really text locations or rather actually sections of the
document. If later, can you parse out sections during indexing?

Regards,
 Alex

On Wed, Oct 16, 2019, 3:57 AM Kaminski, Adi, 
wrote:

> Hi,
> Thanks for the responses.
>
> It's a soft boundary which is resulted by dynamic syntax from our
> application. So may vary from different user searches, one user can search
> some "word1" in starting 30 words, and another can search "word2" in
> starting 10 words. The use case is to match some terms/phrase in specific
> document places in order to identify scripts/specific word ocuurences.
>
> So I guess copy field won't work here.
>
> Any other suggestions/thoughts ?
> Maybe some hidden position filters in native level to limit from start/end
> of the document ?
>
> Thanks,
> Adi
>
> -Original Message-
> From: Tim Casey 
> Sent: Tuesday, October 15, 2019 11:05 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Position search
>
> If this is about a normalized query, I would put the normalization text
> into a specific field.  The reason for this is you may want to search the
> overall text during any form of expansion phase of searching for data.
> That is, maybe you want to know the context of up to the 120th word.  At
> least you have both.
> Also, you may want to note which normalized fields were truncated or were
> simply too small. This would give some guidance as to the bias of the
> normalization.  If 95% of the fields were not truncated, there is a chance
> you are not doing good at normalizing because you have a set of
> particularly short messages.  So I would expect a small set of side fields
> remarking this.  This would allow you to carry the measures along with the
> data.
>
> tim
>
> On Tue, Oct 15, 2019 at 12:19 PM Alexandre Rafalovitch  >
> wrote:
>
> > Is the 100 words a hard boundary or a soft one?
> >
> > If it is a hard one (always 100 words), the easiest is probably copy
> > field and in the (unstored) copy, trim off whatever you don't want to
> > search. Possibly using regular expressions. Of course, "what's a word"
> > is an important question here.
> >
> > Similarly, you could do that with Update Request Processors and
> > clone/process field even before it hits the schema. Then you could
> > store the extract for highlighting purposes.
> >
> > Regards,
> >Alex.
> >
> > On Tue, 15 Oct 2019 at 02:25, Kaminski, Adi 
> > wrote:
> > >
> > > Hi,
> > > What's the recommended way to search in Solr (assuming 8.2 is used)
> > > for
> > specific terms/phrases/expressions while limiting the search from
> > position perspective.
> > > For example to search only in the first/last 100 words of the document
> ?
> > >
> > > Is there any built-in functionality for that ?
> > >
> > > Thanks in advance,
> > > Adi
> > >
> > >
> > > This electronic message may contain proprietary and confidential
> > information of Verint Systems Inc., its affiliates and/or
> > subsidiaries. The information is intended to be for the use of the
> > individual(s) or
> > entity(ies) named above. If you are not the intended recipient (or
> > authorized to receive this e-mail for the intended recipient), you may
> > not use, copy, disclose or distribute to anyone this message or any
> > information contained in this message. If you have received this
> > electronic message in error, please notify us by replying to this e-mail.
> >
>
>
> This electronic message may contain proprietary and confidential
> information of Verint Systems Inc., its affiliates and/or subsidiaries. The
> information is intended to be for the use of the individual(s) or
> entity(ies) named above. If you are not the intended recipient (or
> authorized to receive this e-mail for the intended recipient), you may not
> use, copy, disclose or distribute to anyone this message or any information
> contained in this message. If you have received this electronic message in
> error, please notify us by replying to this e-mail.
>


Re: Using Tesseract OCR to extract PDF files in EML file attachment

2019-10-16 Thread Charlie Hull
My colleagues Eric Pugh and Dan Worley covered OCR and Solr in a 
presentation at our recent London Lucene/Solr Meetup:

https://www.meetup.com/Apache-Lucene-Solr-London-User-Group/events/264579498/
(direct link to slides if you can't find it in the comments 
https://www.slideshare.net/o19s/payloads-and-ocr-with-solr)


HTH

Charlie


On 14/10/2019 11:40, Retro wrote:

Hello, thanks for answer, but let me explain the setup. We are running our
own backup solution for emails (messages from Exchange in MSG format).
Content of these messages then indexed in SOLR. But SOLR can not process
attachments within those MSG files, can not OCR them. This is what I need -
to OCR attachments and get their content indexed in SOLR.

Davis, Daniel (NIH/NLM) [C] wrote

Nuance and ABBYY provide OCR capabilities as well.
Looking at higher level solutions, both indexengines.com and Comvault can
do email remediation for legal issues.

AJ Weber wrote

There are alternative, paid, libraries to parse and extract attachments
from EML files as well
EML attachments will have a mimetype associated with their metadata.

Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html





--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html



--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk



Query related APACHE SOLR 8.2.0

2019-10-16 Thread Rohit Rasal
Hello,

We are trying to implement APACHE SOLR 8.2.0 in our Organization, In our 
organization, we use Tomcat for Deployment of web applications and Server OS is 
Suse Linux (SLES v12-sp3).
So we have some Query related to software requirement of APACHE SOLR 8.2.0,


1.   Which tomcat minimum and maximum version is required for APACHE SOLR 
8.2.0?

2.   List of all OS minimum and maximum version is required for APACHE SOLR 
8.2.0?




Regards,
Rohit Rasal |  Assistant Manager |  NSDL e-Governance Infrastructure Limited  | 
 (CIN U72900MH1995PLC095642)
Direct: 8347  |Email: roh...@nsdl.co.in  | 
Website: https://egov-nsdl.co.in/
[cid:image002.jpg@01D58414.CFDF7C00][nsdl]

"Winner of Golden Peacock Award for Innovation Management - 2018"

Sapphire Chambers, 4th floor, Riviresa Society, Baner, Pune, Maharashtra 411045





Re: Query related APACHE SOLR 8.2.0

2019-10-16 Thread sasarun
Hi Rohit, 

Solr bundle comes with a Jetty server by default and does not require a
tomcat instance to run. Even though earlier version of Solr was in the form
of war file, Solr 5.0 and higher versions no longer supports user defined
containers. Details of the same are available in the link below for
reference. 

https://cwiki.apache.org/confluence/display/solr/WhyNoWar


Details of system requirements are available in the below link

https://lucene.apache.org/solr/guide/8_2/solr-system-requirements.html#supported-operating-systems

Thanks,

Arun 




--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


RE: Position search

2019-10-16 Thread Kaminski, Adi
Hi,
These are really text positions.
For example I have a document: "hello thanks for calling the support how can I 
help you"

And in the application I would like to search for documents that match "thanks" 
NEAR "support" only in first 30 words of the document (greeting part for 
example), and not in the middle/end part of the document.

Regards,
Adi

-Original Message-
From: Alexandre Rafalovitch 
Sent: Wednesday, October 16, 2019 12:48 PM
To: solr-user 
Subject: Re: Position search

So are these really text locations or rather actually sections of the document. 
If later, can you parse out sections during indexing?

Regards,
 Alex

On Wed, Oct 16, 2019, 3:57 AM Kaminski, Adi, 
wrote:

> Hi,
> Thanks for the responses.
>
> It's a soft boundary which is resulted by dynamic syntax from our
> application. So may vary from different user searches, one user can
> search some "word1" in starting 30 words, and another can search
> "word2" in starting 10 words. The use case is to match some
> terms/phrase in specific document places in order to identify 
> scripts/specific word ocuurences.
>
> So I guess copy field won't work here.
>
> Any other suggestions/thoughts ?
> Maybe some hidden position filters in native level to limit from
> start/end of the document ?
>
> Thanks,
> Adi
>
> -Original Message-
> From: Tim Casey 
> Sent: Tuesday, October 15, 2019 11:05 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Position search
>
> If this is about a normalized query, I would put the normalization
> text into a specific field.  The reason for this is you may want to
> search the overall text during any form of expansion phase of searching for 
> data.
> That is, maybe you want to know the context of up to the 120th word.
> At least you have both.
> Also, you may want to note which normalized fields were truncated or
> were simply too small. This would give some guidance as to the bias of
> the normalization.  If 95% of the fields were not truncated, there is
> a chance you are not doing good at normalizing because you have a set
> of particularly short messages.  So I would expect a small set of side
> fields remarking this.  This would allow you to carry the measures
> along with the data.
>
> tim
>
> On Tue, Oct 15, 2019 at 12:19 PM Alexandre Rafalovitch
>  >
> wrote:
>
> > Is the 100 words a hard boundary or a soft one?
> >
> > If it is a hard one (always 100 words), the easiest is probably copy
> > field and in the (unstored) copy, trim off whatever you don't want
> > to search. Possibly using regular expressions. Of course, "what's a word"
> > is an important question here.
> >
> > Similarly, you could do that with Update Request Processors and
> > clone/process field even before it hits the schema. Then you could
> > store the extract for highlighting purposes.
> >
> > Regards,
> >Alex.
> >
> > On Tue, 15 Oct 2019 at 02:25, Kaminski, Adi
> > 
> > wrote:
> > >
> > > Hi,
> > > What's the recommended way to search in Solr (assuming 8.2 is
> > > used) for
> > specific terms/phrases/expressions while limiting the search from
> > position perspective.
> > > For example to search only in the first/last 100 words of the
> > > document
> ?
> > >
> > > Is there any built-in functionality for that ?
> > >
> > > Thanks in advance,
> > > Adi
> > >
> > >
> > > This electronic message may contain proprietary and confidential
> > information of Verint Systems Inc., its affiliates and/or
> > subsidiaries. The information is intended to be for the use of the
> > individual(s) or
> > entity(ies) named above. If you are not the intended recipient (or
> > authorized to receive this e-mail for the intended recipient), you
> > may not use, copy, disclose or distribute to anyone this message or
> > any information contained in this message. If you have received this
> > electronic message in error, please notify us by replying to this e-mail.
> >
>
>
> This electronic message may contain proprietary and confidential
> information of Verint Systems Inc., its affiliates and/or
> subsidiaries. The information is intended to be for the use of the
> individual(s) or
> entity(ies) named above. If you are not the intended recipient (or
> authorized to receive this e-mail for the intended recipient), you may
> not use, copy, disclose or distribute to anyone this message or any
> information contained in this message. If you have received this
> electronic message in error, please notify us by replying to this e-mail.
>


This electronic message may contain proprietary and confidential information of 
Verint Systems Inc., its affiliates and/or subsidiaries. The information is 
intended to be for the use of the individual(s) or entity(ies) named above. If 
you are not the intended recipient (or authorized to receive this e-mail for 
the intended recipient), you may not use, copy, disclose or distribute to 
anyone this message or any information contained in this message. If you have 
received this electronic messag

Need help with Solr Streaming query

2019-10-16 Thread Prasenjit Sarkar
Hi,


I am facing issue while working with solr streamimg expression. I am using 
/export for emiting tuples out of streaming query.Howver when I tried to use 
not operator in solr query it is not working.The same is working with /select.


Please find the below query 


top(n=105,search(,qt="/export",q="-(: *c*) 
",fl="",sort=" asc"),sort=" asc")



In the above query the not operator q="-(: *c*)" is not 
working with /export.However the same query works when I combine any postive 
search criteria with the not expression like q="-(: *c*) AND 
(: **)". Can you please help here. As running only not query 
with /export should be a valid use case. I have also checked the solr logs and 
found no errors when running the not query.The query is just not returning any 
value and it is returning with no result very fast.





Regards,
Prasenjit Sarkar


Experience certainty. IT Services
Business Solutions
Consulting

=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you




Re: Position search

2019-10-16 Thread Erick Erickson
Three things off the top of my head, in order of how long it’d take to 
implement:

***
If it’s _always_ some distance from the start or end, index special beginning 
and end tags. perhaps a nonsense string like BEGINslkdjfhsldkfhsdkfh  and 
ENDslakshalskdfhj. Now your searches become phrase queries with slop. Searching 
for “erick in the first 100 words” becomes:

"BEGINslkdjfhsldkfhsdkfh erick”~100

***
Index each term with a payload indicating its position and use a payload 
function to determine whether the term should count as a hit. You’d probably 
have to have a field telling you how long is field is to know what offset “50 
words from the end” is.

***
Get into the low level Lucene code. After all if you index the position 
information to support phrase queries, you have exactly the position of the 
word. NOTE: you’d also probably have to index a separate field with the total 
length of the field in it so you know what position “100 words from the end” 
is. I suspect you could make this the most efficient, but I wouldn’t go here 
unless your performance is poor as it’d take some development work.

Note: I haven’t thought these out very carefully so caveat emptor.

Here’s a place to get started with payloads if you decide to go that route:

https://lucidworks.com/post/solr-payloads/

Best,
Erick


> On Oct 16, 2019, at 5:47 AM, Alexandre Rafalovitch  wrote:
> 
> So are these really text locations or rather actually sections of the
> document. If later, can you parse out sections during indexing?
> 
> Regards,
> Alex
> 
> On Wed, Oct 16, 2019, 3:57 AM Kaminski, Adi, 
> wrote:
> 
>> Hi,
>> Thanks for the responses.
>> 
>> It's a soft boundary which is resulted by dynamic syntax from our
>> application. So may vary from different user searches, one user can search
>> some "word1" in starting 30 words, and another can search "word2" in
>> starting 10 words. The use case is to match some terms/phrase in specific
>> document places in order to identify scripts/specific word ocuurences.
>> 
>> So I guess copy field won't work here.
>> 
>> Any other suggestions/thoughts ?
>> Maybe some hidden position filters in native level to limit from start/end
>> of the document ?
>> 
>> Thanks,
>> Adi
>> 
>> -Original Message-
>> From: Tim Casey 
>> Sent: Tuesday, October 15, 2019 11:05 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Position search
>> 
>> If this is about a normalized query, I would put the normalization text
>> into a specific field.  The reason for this is you may want to search the
>> overall text during any form of expansion phase of searching for data.
>> That is, maybe you want to know the context of up to the 120th word.  At
>> least you have both.
>> Also, you may want to note which normalized fields were truncated or were
>> simply too small. This would give some guidance as to the bias of the
>> normalization.  If 95% of the fields were not truncated, there is a chance
>> you are not doing good at normalizing because you have a set of
>> particularly short messages.  So I would expect a small set of side fields
>> remarking this.  This would allow you to carry the measures along with the
>> data.
>> 
>> tim
>> 
>> On Tue, Oct 15, 2019 at 12:19 PM Alexandre Rafalovitch >> 
>> wrote:
>> 
>>> Is the 100 words a hard boundary or a soft one?
>>> 
>>> If it is a hard one (always 100 words), the easiest is probably copy
>>> field and in the (unstored) copy, trim off whatever you don't want to
>>> search. Possibly using regular expressions. Of course, "what's a word"
>>> is an important question here.
>>> 
>>> Similarly, you could do that with Update Request Processors and
>>> clone/process field even before it hits the schema. Then you could
>>> store the extract for highlighting purposes.
>>> 
>>> Regards,
>>>   Alex.
>>> 
>>> On Tue, 15 Oct 2019 at 02:25, Kaminski, Adi 
>>> wrote:
 
 Hi,
 What's the recommended way to search in Solr (assuming 8.2 is used)
 for
>>> specific terms/phrases/expressions while limiting the search from
>>> position perspective.
 For example to search only in the first/last 100 words of the document
>> ?
 
 Is there any built-in functionality for that ?
 
 Thanks in advance,
 Adi
 
 
 This electronic message may contain proprietary and confidential
>>> information of Verint Systems Inc., its affiliates and/or
>>> subsidiaries. The information is intended to be for the use of the
>>> individual(s) or
>>> entity(ies) named above. If you are not the intended recipient (or
>>> authorized to receive this e-mail for the intended recipient), you may
>>> not use, copy, disclose or distribute to anyone this message or any
>>> information contained in this message. If you have received this
>>> electronic message in error, please notify us by replying to this e-mail.
>>> 
>> 
>> 
>> This electronic message may contain proprietary and confidential
>> information of Verint Systems Inc., its affiliates and/or subs

Re: Need help with Solr Streaming query

2019-10-16 Thread Erick Erickson
The NOT operator isn’t a Boolean NOT, so it requires some care, Chris Hostetter 
wrote a good blog about that. Try

q=*:* -(:*c*

The query q=-something really isn’t valid syntax, but some query parsers help 
you out by silently putting the *:* in front of it. that’s not guaranteed 
across all parsers though.

Best,
Erick

> On Oct 16, 2019, at 8:14 AM, Prasenjit Sarkar  
> wrote:
> 
> Hi,
> 
> 
> I am facing issue while working with solr streamimg expression. I am using 
> /export for emiting tuples out of streaming query.Howver when I tried to use 
> not operator in solr query it is not working.The same is working with /select.
> 
> 
> Please find the below query 
> 
> 
> top(n=105,search(,qt="/export",q="-(: *c*) 
> ",fl="",sort=" asc"),sort=" asc")
> 
> 
> 
> In the above query the not operator q="-(: *c*)" is not 
> working with /export.However the same query works when I combine any postive 
> search criteria with the not expression like q="-(: *c*) AND 
> (: **)". Can you please help here. As running only not query 
> with /export should be a valid use case. I have also checked the solr logs 
> and found no errors when running the not query.The query is just not 
> returning any value and it is returning with no result very fast.
> 
> 
> 
> 
> 
> Regards,
> Prasenjit Sarkar
> 
> 
> Experience certainty. IT Services
> Business Solutions
> Consulting
> 
> =-=-=
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain 
> confidential or privileged information. If you are 
> not the intended recipient, any dissemination, use, 
> review, distribution, printing or copying of the 
> information contained in this e-mail message 
> and/or attachments to it are strictly prohibited. If 
> you have received this communication in error, 
> please notify us by reply e-mail or telephone and 
> immediately and permanently delete the message 
> and any attachments. Thank you
> 
> 



Re: Solr-Cloud, join and collection collocation

2019-10-16 Thread Mikhail Khludnev
Note: adding score=none as a local param. Turns another algorithm dragging
by from side join.

On Wed, Oct 16, 2019 at 11:37 AM Nicolas Paris 
wrote:

> Sadly, the join performances are poor.
> The joined collection is 12M documents, and the performances are 6k ms
> versus 60ms when I compare to the denormalized field.
>
> Apparently, the performances does not change when the filter on the
> joined collection is changed. It is still 6k ms when the subset is 12M
> or 1 document in size. So the performance of join looks correlated to
> size of joined collection and not the kind of filter applied to it.
>
> I will explore the streaming expressions
>
> On Wed, Oct 16, 2019 at 08:00:43AM +0200, Nicolas Paris wrote:
> > > You can certainly replicate the joined collection to every shard. It
> > > must fit in one shard and a replica of that shard must be co-located
> > > with every replica of the “to” collection.
> >
> > Yes, I found this in the documentation, with a clear example just after
> > this mail. I will test it today. I also read your blog about join
> > performances[1] and I suspect the performance impact of joins will be
> > huge because the joined collection is about 10M documents (only two
> > fields, unique id and an array of longs and a filter applied to the
> > array, join key is 10M unique IDs).
> >
> > > Have you looked at streaming and “streaming expressions"? It does not
> > > have the same problem, although it does have its own limitations.
> >
> > I never tested them, and I am not very confortable yet in how to test
> > them. Is it possible to mix query parsers and streaming expression in
> > the client call via http parameters - or is streaming expression apply
> > programmatically only ?
> >
> > [1] https://lucidworks.com/post/solr-and-joins/
> >
> > On Tue, Oct 15, 2019 at 07:12:25PM -0400, Erick Erickson wrote:
> > > You can certainly replicate the joined collection to every shard. It
> must fit in one shard and a replica of that shard must be co-located with
> every replica of the “to” collection.
> > >
> > > Have you looked at streaming and “streaming expressions"? It does not
> have the same problem, although it does have its own limitations.
> > >
> > > Best,
> > > Erick
> > >
> > > > On Oct 15, 2019, at 6:58 PM, Nicolas Paris 
> wrote:
> > > >
> > > > Hi
> > > >
> > > > I have several large collections that cannot fit in a standalone solr
> > > > instance. They are split over multiple shards in solr-cloud mode.
> > > >
> > > > Those collections are supposed to be joined to an other collection to
> > > > retrieve subset. Because I am using distributed collections, I am not
> > > > able to use the solr join feature.
> > > >
> > > > For this reason, I denormalize the information by adding the joined
> > > > collection within every collections. Naturally, when I want to update
> > > > the joined collection, I have to update every one of the distributed
> > > > collections.
> > > >
> > > > In standalone mode, I only would have to update the joined
> collection.
> > > >
> > > > I wonder if there is a way to overcome this limitation. For example,
> by
> > > > replicating the joined collection to every shard - or other method I
> am
> > > > ignoring.
> > > >
> > > > Any thought ?
> > > > --
> > > > nicolas
> > >
> >
> > --
> > nicolas
> >
>
> --
> nicolas
>


-- 
Sincerely yours
Mikhail Khludnev


Re: Re: Query on autoGeneratePhraseQueries

2019-10-16 Thread Shubham Goswami
Hi Rohan/Audrey

I have implemented the sow=false property with eDismax Query parser but
still it does not has any effect
on the query as it is still parsing as separate terms instead of phrased
one.

On Tue, Oct 15, 2019 at 8:25 PM Rohan Kasat  wrote:

> Also check ,
> pf , pf2 , pf3
> ps , ps2, ps3 parameters for phrase searches.
>
> Regards,
> Rohan K
>
> On Tue, Oct 15, 2019 at 6:41 AM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
>
> > I'm not sure how your config file is setup, but I know that the way we do
> > multi-token synonyms is to have the sow (split on whitespace) parameter
> set
> > to False while using the edismax parser. I'm not sure if this would work
> > with PhraseQueries , but it might be worth a try!
> >
> > In our config file we do something like this:
> >
> > 
> > 
> > edismax
> > 1.0
> > explicit
> > 100
> > content_en
> > w3json_en
> > false
> > 
> >  
> >
> > You can read a bit about the parameter here:
> >
> https://opensourceconnections.com/blog/2018/02/20/edismax-and-multiterm-synonyms-oddities/
> >
> > Best,
> > Audrey
> >
> > --
> > Audrey Lorberfeld
> > Data Scientist, w3 Search
> > IBM
> > audrey.lorberf...@ibm.com
> >
> >
> > On 10/15/19, 5:50 AM, "Shubham Goswami" 
> > wrote:
> >
> > Hi kshitij
> >
> > Thanks for the reply!
> > I tried to debug it and found that raw query(black company) has
> parsed
> > as
> > two separate queries
> > black and company and returning the results based on black query
> > instead of
> > this it should have
> > got parsed as a single phrase query like("black company") because i
> am
> > using
> > autoGeneratedPhraseQuery.
> > Do you have any idea about this please correct me if i am wrong.
> >
> > Thanks
> > Shubham
> >
> > On Tue, Oct 15, 2019 at 1:58 PM kshitij tyagi <
> > kshitij.shopcl...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > Try debugging your solr query and understand how it gets parsed.
> Try
> > using
> > > "debug=true" for the same
> > >
> > > On Tue, Oct 15, 2019 at 12:58 PM Shubham Goswami <
> > > shubham.gosw...@hotwax.co>
> > > wrote:
> > >
> > > > *Hi all,*
> > > >
> > > > I am a beginner to solr framework and I am trying to implement
> > > > *autoGeneratePhraseQueries* property in a fieldtype of
> > > type=text_general, i
> > > > kept the property value as true and restarted the solr server but
> > still
> > > it
> > > > is not taking my two words query like(Black company) as a phrase
> > without
> > > > double quotes and returning the results only for Black.
> > > >
> > > >  Can somebody please help me to understand what am i
> > missing ?
> > > > Following is my Schema.xml file code and i am using solr 7.5
> > version.
> > > >  > > > positionIncrementGap="100" multiValued="true"
> > > > autoGeneratePhraseQueries="true">
> > > > 
> > > >   =
> > > >words="stopwords.txt"
> > > > ignoreCase="true"/>
> > > >   
> > > > 
> > > > 
> > > >   
> > > >words="stopwords.txt"
> > > > ignoreCase="true"/>
> > > >expand="true"
> > > > ignoreCase="true" synonyms="synonyms.txt"/>
> > > >   
> > > > 
> > > >   
> > > >
> > > >
> > > > --
> > > > *Thanks & Regards*
> > > > Shubham Goswami
> > > > Enterprise Software Engineer
> > > > *HotWax Systems*
> > > > *Enterprise open source experts*
> > > > cell: +91-7803886288
> > > > office: 0731-409-3684
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.hotwaxsystems.com&d=DwIBaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M&m=Zi9beGF58BzJUNUdCkeW0pwliKwq9vdTSh0V_lR0734&s=FhSkJBcmYw_bfHgq1enzuYQeOZwKHzlP9h4VwTZSL5E&e=
> > > >
> > >
> >
> >
> > --
> > *Thanks & Regards*
> > Shubham Goswami
> > Enterprise Software Engineer
> > *HotWax Systems*
> > *Enterprise open source experts*
> > cell: +91-7803886288
> > office: 0731-409-3684
> >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.hotwaxsystems.com&d=DwIBaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M&m=Zi9beGF58BzJUNUdCkeW0pwliKwq9vdTSh0V_lR0734&s=FhSkJBcmYw_bfHgq1enzuYQeOZwKHzlP9h4VwTZSL5E&e=
> >
> >
> > --
>
> *Regards,Rohan Kasat*
>


-- 
*Thanks & Regards*
Shubham Goswami
Enterprise Software Engineer
*HotWax Systems*
*Enterprise open source experts*
cell: +91-7803886288
office: 0731-409-3684
http://www.hotwaxsystems.com


Re: Query on autoGeneratePhraseQueries

2019-10-16 Thread Shawn Heisey

On 10/16/2019 7:14 AM, Shubham Goswami wrote:

I have implemented the sow=false property with eDismax Query parser but
still it does not has any effect
on the query as it is still parsing as separate terms instead of phrased
one.


We have seen reports that when sow=false, which is the default setting 
since Solr 7.0, autoGeneratePhraseQueries does not work.  Try setting 
sow=true and see whether you get the results you expect.


I do not know whether this behavior is a bug or if it is expected.

Thanks,
Shawn


The Visual Guide to Streaming Expressions and Math Expressions

2019-10-16 Thread Joel Bernstein
Hi,

The Visual Guide to Streaming Expressions and Math Expressions is now
complete. It's been published to Github at the following location:

https://github.com/apache/lucene-solr/blob/visual-guide/solr/solr-ref-guide/src/math-expressions.adoc#streaming-expressions-and-math-expressions

The guide will eventually be part of Solr's release when the RefGuide is
ready to accommodate it. In the meantime its been designed to be easily
read directly from Github.

The guide contains close to 200 visualizations and examples showing how to
use Streaming Expressions and Math Expressions for data analysis and
visualization. The visual guide is also designed to guide users that are
not experts in math in how to apply the functions to analysis and visualize
data.

The new visual data loading feature in Solr 8.3 is also covered in the
guide. This feature should cut down on the time it takes to load CSV files
so that more time can be spent on analysis and visualization.

https://github.com/apache/lucene-solr/blob/visual-guide/solr/solr-ref-guide/src/loading.adoc#loading-data

Joel Bernstein


Re: The Visual Guide to Streaming Expressions and Math Expressions

2019-10-16 Thread Pratik Patel
Hi Joel,

Looks like this is going to be very helpful, thank you! I am wondering
whether the visualizations are generated through third party library or is
it something which would be part of solr distribution?
https://github.com/apache/lucene-solr/blob/visual-guide/solr/solr-ref-guide/src/visualization.adoc#visualization


Thanks,
Pratik


On Wed, Oct 16, 2019 at 10:54 AM Joel Bernstein  wrote:

> Hi,
>
> The Visual Guide to Streaming Expressions and Math Expressions is now
> complete. It's been published to Github at the following location:
>
>
> https://github.com/apache/lucene-solr/blob/visual-guide/solr/solr-ref-guide/src/math-expressions.adoc#streaming-expressions-and-math-expressions
>
> The guide will eventually be part of Solr's release when the RefGuide is
> ready to accommodate it. In the meantime its been designed to be easily
> read directly from Github.
>
> The guide contains close to 200 visualizations and examples showing how to
> use Streaming Expressions and Math Expressions for data analysis and
> visualization. The visual guide is also designed to guide users that are
> not experts in math in how to apply the functions to analysis and visualize
> data.
>
> The new visual data loading feature in Solr 8.3 is also covered in the
> guide. This feature should cut down on the time it takes to load CSV files
> so that more time can be spent on analysis and visualization.
>
>
> https://github.com/apache/lucene-solr/blob/visual-guide/solr/solr-ref-guide/src/loading.adoc#loading-data
>
> Joel Bernstein
>


Do backups of collections need to be taken on the Leader?

2019-10-16 Thread Koen De Groote
I'm trying to restore a couple of collections, and 1 keeps feeling. This
happens to be the only one who's leader isn't on the host that the backup
was taken from.


The backup was done on server1, for all collections.

For this collection that is failing, the Leader was on server2. All other
collections had their leader on server1. All collections had 1 replica, on
the other server.

I would think that having the replica there would be enough to perform a
restore.

Or does the backup need to happen on the actual leader?

Kind regards,
Koen De Groote


Re: The Visual Guide to Streaming Expressions and Math Expressions

2019-10-16 Thread Joel Bernstein
Hi Pratik,

The visualizations are all done using Apache Zeppelin and the Zeppelin-Solr
interpreter. The getting started part of the user guide provides links for
Zeppelin-Solr. The install process in pretty quick. This is all open
source, freely available software. It's possible that Zepplin-Solr can be
incorporated into the Solr code eventually but the test frameworks are
quite different. I think some simple scripts can be included with the Solr
to automated the downloads for Zeppelin and Zeppelin-Solr.

Joel Bernstein
http://joelsolr.blogspot.com/


On Wed, Oct 16, 2019 at 11:27 AM Pratik Patel  wrote:

> Hi Joel,
>
> Looks like this is going to be very helpful, thank you! I am wondering
> whether the visualizations are generated through third party library or is
> it something which would be part of solr distribution?
>
> https://github.com/apache/lucene-solr/blob/visual-guide/solr/solr-ref-guide/src/visualization.adoc#visualization
>
>
> Thanks,
> Pratik
>
>
> On Wed, Oct 16, 2019 at 10:54 AM Joel Bernstein 
> wrote:
>
> > Hi,
> >
> > The Visual Guide to Streaming Expressions and Math Expressions is now
> > complete. It's been published to Github at the following location:
> >
> >
> >
> https://github.com/apache/lucene-solr/blob/visual-guide/solr/solr-ref-guide/src/math-expressions.adoc#streaming-expressions-and-math-expressions
> >
> > The guide will eventually be part of Solr's release when the RefGuide is
> > ready to accommodate it. In the meantime its been designed to be easily
> > read directly from Github.
> >
> > The guide contains close to 200 visualizations and examples showing how
> to
> > use Streaming Expressions and Math Expressions for data analysis and
> > visualization. The visual guide is also designed to guide users that are
> > not experts in math in how to apply the functions to analysis and
> visualize
> > data.
> >
> > The new visual data loading feature in Solr 8.3 is also covered in the
> > guide. This feature should cut down on the time it takes to load CSV
> files
> > so that more time can be spent on analysis and visualization.
> >
> >
> >
> https://github.com/apache/lucene-solr/blob/visual-guide/solr/solr-ref-guide/src/loading.adoc#loading-data
> >
> > Joel Bernstein
> >
>


Re: Query on autoGeneratePhraseQueries

2019-10-16 Thread Michael Gibney
Going to back to the initial question, the wording is a little ambiguous
and it occurs to me that it's possible there's a misunderstanding of what
autoGeneratePhraseQueries does. It really only auto-generates phrase
*subqueries*. To use the example from the initial request, a query like
(black company) would always generate a non-phrase query (respecting mm,
q.op, etc. -- but in any case not a top-level phrase query), regardless of
the setting of autoGeneratePhraseQueries.

autoGeneratePhraseQueries (when set to true) only kicks in (in different
ways depending on analysis chain, and setting of "sow") for a query like
(the black-company manufactures), which would be transformed to something
more like (the "black company" manufactures). The idea is that there's some
extra indication that the two words should be bundled together for purposes
of querying.

If you want to auto-generate a top-level phrase query, some other approach
would be called for.

Apologies if this is obvious and/or not helpful, Shubham!

On Wed, Oct 16, 2019 at 10:10 AM Shawn Heisey  wrote:

> On 10/16/2019 7:14 AM, Shubham Goswami wrote:
> > I have implemented the sow=false property with eDismax Query parser but
> > still it does not has any effect
> > on the query as it is still parsing as separate terms instead of phrased
> > one.
>
> We have seen reports that when sow=false, which is the default setting
> since Solr 7.0, autoGeneratePhraseQueries does not work.  Try setting
> sow=true and see whether you get the results you expect.
>
> I do not know whether this behavior is a bug or if it is expected.
>
> Thanks,
> Shawn
>


Re: Position search

2019-10-16 Thread Alexandre Rafalovitch
Well, after some digging and trying to recall things:
1) XMLParser allows to specify a query in a different way from normal
query parameters:
https://lucene.apache.org/solr/guide/8_1/other-parsers.html#xml-query-parser
2) SpanFirst allowed to anchor the search to the start of the text and
provide the initial number of tokens to search within. It is not well
documented but apparently somebody did some tests:
https://coding-art.blogspot.com/2016/05/apache-solr-xml-query-parser.html
3) SpanFirst is actually a simpler use case of a more general matcher
(SpanPositionRangeQuery)
4) SpanPositionRangeQuery is not yet exposed in Solr, but will be in
8.3: https://issues.apache.org/jira/browse/SOLR-13663

So, I would test your example with XMLParser and SpanFirst (perhaps on
latest 8.x Solr). If that works, you have an approach for at least
initial X query and know you have an easy upgrade when 8.3 is out
(soon). Alternatively, you can play with SpanFirst and reversal of the
field.

Regards,
   Alex.
P.s. Also, SpanFirst apparently boosts matches early in the text
higher than those later. That's in the mailing list archive
discussions, which you can search on the web. E.,g.
https://lists.apache.org/thread.html/014db9dcef44a8f9641600d19cfaa528f33bac676b7ac68903537b75@%3Csolr-user.lucene.apache.org%3E

On Wed, 16 Oct 2019 at 08:17, Kaminski, Adi  wrote:
>
> Hi,
> These are really text positions.
> For example I have a document: "hello thanks for calling the support how can 
> I help you"
>
> And in the application I would like to search for documents that match 
> "thanks" NEAR "support" only in first 30 words of the document (greeting part 
> for example), and not in the middle/end part of the document.
>
> Regards,
> Adi
>
> -Original Message-
> From: Alexandre Rafalovitch 
> Sent: Wednesday, October 16, 2019 12:48 PM
> To: solr-user 
> Subject: Re: Position search
>
> So are these really text locations or rather actually sections of the 
> document. If later, can you parse out sections during indexing?
>
> Regards,
>  Alex
>
> On Wed, Oct 16, 2019, 3:57 AM Kaminski, Adi, 
> wrote:
>
> > Hi,
> > Thanks for the responses.
> >
> > It's a soft boundary which is resulted by dynamic syntax from our
> > application. So may vary from different user searches, one user can
> > search some "word1" in starting 30 words, and another can search
> > "word2" in starting 10 words. The use case is to match some
> > terms/phrase in specific document places in order to identify 
> > scripts/specific word ocuurences.
> >
> > So I guess copy field won't work here.
> >
> > Any other suggestions/thoughts ?
> > Maybe some hidden position filters in native level to limit from
> > start/end of the document ?
> >
> > Thanks,
> > Adi
> >
> > -Original Message-
> > From: Tim Casey 
> > Sent: Tuesday, October 15, 2019 11:05 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Position search
> >
> > If this is about a normalized query, I would put the normalization
> > text into a specific field.  The reason for this is you may want to
> > search the overall text during any form of expansion phase of searching for 
> > data.
> > That is, maybe you want to know the context of up to the 120th word.
> > At least you have both.
> > Also, you may want to note which normalized fields were truncated or
> > were simply too small. This would give some guidance as to the bias of
> > the normalization.  If 95% of the fields were not truncated, there is
> > a chance you are not doing good at normalizing because you have a set
> > of particularly short messages.  So I would expect a small set of side
> > fields remarking this.  This would allow you to carry the measures
> > along with the data.
> >
> > tim
> >
> > On Tue, Oct 15, 2019 at 12:19 PM Alexandre Rafalovitch
> >  > >
> > wrote:
> >
> > > Is the 100 words a hard boundary or a soft one?
> > >
> > > If it is a hard one (always 100 words), the easiest is probably copy
> > > field and in the (unstored) copy, trim off whatever you don't want
> > > to search. Possibly using regular expressions. Of course, "what's a word"
> > > is an important question here.
> > >
> > > Similarly, you could do that with Update Request Processors and
> > > clone/process field even before it hits the schema. Then you could
> > > store the extract for highlighting purposes.
> > >
> > > Regards,
> > >Alex.
> > >
> > > On Tue, 15 Oct 2019 at 02:25, Kaminski, Adi
> > > 
> > > wrote:
> > > >
> > > > Hi,
> > > > What's the recommended way to search in Solr (assuming 8.2 is
> > > > used) for
> > > specific terms/phrases/expressions while limiting the search from
> > > position perspective.
> > > > For example to search only in the first/last 100 words of the
> > > > document
> > ?
> > > >
> > > > Is there any built-in functionality for that ?
> > > >
> > > > Thanks in advance,
> > > > Adi
> > > >
> > > >
> > > > This electronic message may contain proprietary and confidential
> > > information of Verint Sy

Re: Position search

2019-10-16 Thread Tim Casey
Adi,

If you are looking for something specific you might want to try something
different.  Before you would search 'the end of a document', you might
think about segmenting the document and searching specific segments.  At
the end of a lot of things like email will be signatures.  Those are fairly
standard language, although mostly the same in meaning, do differ in
specific language.  They are a common segment.

If you are searching something like research papers, then you would be
thinking about the conclusion (?), bibliography (?).  It does not matter,
but there will be specific segments.

I think you will find the last N tokens of a document have some odd
categories within the search results.  I might guess you have a different
purpose in mind.  Either way, you would likely do better to segment what
you are searching.

tim

On Mon, Oct 14, 2019 at 11:25 PM Kaminski, Adi 
wrote:

> Hi,
> What's the recommended way to search in Solr (assuming 8.2 is used) for
> specific terms/phrases/expressions while limiting the search from position
> perspective.
> For example to search only in the first/last 100 words of the document ?
>
> Is there any built-in functionality for that ?
>
> Thanks in advance,
> Adi
>
>
> This electronic message may contain proprietary and confidential
> information of Verint Systems Inc., its affiliates and/or subsidiaries. The
> information is intended to be for the use of the individual(s) or
> entity(ies) named above. If you are not the intended recipient (or
> authorized to receive this e-mail for the intended recipient), you may not
> use, copy, disclose or distribute to anyone this message or any information
> contained in this message. If you have received this electronic message in
> error, please notify us by replying to this e-mail.
>


Re: solr 8.1.1 many time slower returning query results than solr 4.10.4 or solr 6.5.1

2019-10-16 Thread Russell Bahr
Hi Shawn,
Just checking to see if you saw my reply and had any feedback. Thank you again 
for your help. It is much appreciated.
Thank you,
Russ


From: Russell Bahr 
Date: Tuesday, October 15, 2019 at 11:50 AM
To: "solr-user@lucene.apache.org" 
Subject: Re: solr 8.1.1 many time slower returning query results than solr 
4.10.4 or solr 6.5.1

Hi Shawn,
I included the wrong file for solr4 and did not realize until you pointed out 
the heap size.  The correct file that is setting the Java environment is "Solr 
4 tomcat setenv" I have uploaded that to the shared folder along with the 
requested screenshots "Solr 4 top screenshot","Solr 6 top screenshot","Solr 8 
top screenshot".

I have also uploaded the solr.log, solr_gc.log, and solr_slow_requests.log from 
a 2 hour period of time where I was running the email load test against the 
solr8 implementation in which the queued tasks are taking too long to complete.

solr_gc.log, solr_gc.log.1, solr_gc.log.2, solr.log, solr.log.10, solr.log.6, 
solr.log.7, solr.log.8, solr.log.9, solr_slow_requests.log

Let me know if there is any other information that I can provide that may help 
to work through this.

Manzama
a MODERN GOVERNANCE company

Russell Bahr
Lead Infrastructure Engineer

USA & CAN Office: +1 (541) 306 3271
USA & CAN Support: +1 (541) 706 9393
UK Office & Support: +44 (0)203 282 1633
AUS Office & Support: +61 (0) 2 8417 2339

543 NW York Drive, Suite 100, Bend, OR 97703

LinkedIn | 
Twitter | 
Facebook | 
YouTube


On Tue, Oct 15, 2019 at 2:28 AM Shawn Heisey 
mailto:apa...@elyograg.org>> wrote:
On 10/14/2019 1:36 PM, Russell Bahr wrote:
> Backend replacement of solr4 and hopefully Frontend replacement as well.
> solr-spec 8.1.1
> lucene-spec 8.1.1
> Runtime Oracle Corporation OpenJDK 64-Bit Server VM 12 12+33
> 1 collection 6 shards 5 replicas per shard 17,919,889 current documents (35 
> days worth of documents) - indexing new documents regularly throughout the 
> day, deleting aged out documents nightly.

Java 12 is not recommended.  It is one of the "new feature" releases
that only gets 6 months of support.  We would recommend Java 8 or Java
11.  These are the versions with long term support.  Probably a good
thing to be using OpenJDK, as the official Oracle Java now requires
paying for a license.

Solr 8 ships with settings that enable the G1GC collector instead of
CMS, because CMS is deprecated and will disappear in a future Java
version.  We have seen problems with this when the system is
misconfigured as far as heap size.  When the system is properly sized,
G1 tends to do better than CMS, but when the heap is too large or too
small, has a tendency to amplify garbage collection problems in comparison.

Looking at your solr.in.sh files for each version ... the 
Solr 4 install
appears to be setting the heap to 512 megabytes.  This is definitely not
enough for millions of documents, and if this is what the heap size is
actually set to, would almost certainly run into memory errors
frequently and have absolutely terrible performance.  But you are saying
that it works well, so I don't think the heap is actually set to 512
megabytes.  Maybe the bin/solr script has been modified directly to set
the memory size instead of setting it in solr.in.sh where it 
should be set.

Solr 6 has a heap size of just under 27 gigabytes.  Solr 8 has a heap
size of just under 8 gigabytes.  With millions of documents, it is
likely that 8GB of heap is not quite big enough.

For each of your installations (Solr 4, Solr 6, and Solr 8) can you
provide the screenshot described at this wiki page?

https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems#SolrPerformanceProblems-Askingforhelponamemory/performanceissue

It would also be helpful to see the GC logs from Solr 8.  We would need
at least one GC log, making sure that they cover at least a few hours,
including the timeframe when the slow indexing and slow queries were
observed.

Thanks,
Shawn


Re: Solr-Cloud, join and collection collocation

2019-10-16 Thread Nicolas Paris
> Note: adding score=none as a local param. Turns another algorithm
> dragging by from side join.

Indeed, the behavior with score=none local param is a query time
correlated with the joined collection subset size. For subset of 100k
documenrs, the query time is 1 seconds, 4 sec for 1M I get client
timeout (15sec) for any superior to 5M.

On this basis I guess some redesign will be necessary to find the good
in between normalization and de-normalization for insertion/selection
speed trade-off

Thanks



On Wed, Oct 16, 2019 at 03:32:33PM +0300, Mikhail Khludnev wrote:
> Note: adding score=none as a local param. Turns another algorithm dragging
> by from side join.
> 
> On Wed, Oct 16, 2019 at 11:37 AM Nicolas Paris 
> wrote:
> 
> > Sadly, the join performances are poor.
> > The joined collection is 12M documents, and the performances are 6k ms
> > versus 60ms when I compare to the denormalized field.
> >
> > Apparently, the performances does not change when the filter on the
> > joined collection is changed. It is still 6k ms when the subset is 12M
> > or 1 document in size. So the performance of join looks correlated to
> > size of joined collection and not the kind of filter applied to it.
> >
> > I will explore the streaming expressions
> >
> > On Wed, Oct 16, 2019 at 08:00:43AM +0200, Nicolas Paris wrote:
> > > > You can certainly replicate the joined collection to every shard. It
> > > > must fit in one shard and a replica of that shard must be co-located
> > > > with every replica of the “to” collection.
> > >
> > > Yes, I found this in the documentation, with a clear example just after
> > > this mail. I will test it today. I also read your blog about join
> > > performances[1] and I suspect the performance impact of joins will be
> > > huge because the joined collection is about 10M documents (only two
> > > fields, unique id and an array of longs and a filter applied to the
> > > array, join key is 10M unique IDs).
> > >
> > > > Have you looked at streaming and “streaming expressions"? It does not
> > > > have the same problem, although it does have its own limitations.
> > >
> > > I never tested them, and I am not very confortable yet in how to test
> > > them. Is it possible to mix query parsers and streaming expression in
> > > the client call via http parameters - or is streaming expression apply
> > > programmatically only ?
> > >
> > > [1] https://lucidworks.com/post/solr-and-joins/
> > >
> > > On Tue, Oct 15, 2019 at 07:12:25PM -0400, Erick Erickson wrote:
> > > > You can certainly replicate the joined collection to every shard. It
> > must fit in one shard and a replica of that shard must be co-located with
> > every replica of the “to” collection.
> > > >
> > > > Have you looked at streaming and “streaming expressions"? It does not
> > have the same problem, although it does have its own limitations.
> > > >
> > > > Best,
> > > > Erick
> > > >
> > > > > On Oct 15, 2019, at 6:58 PM, Nicolas Paris 
> > wrote:
> > > > >
> > > > > Hi
> > > > >
> > > > > I have several large collections that cannot fit in a standalone solr
> > > > > instance. They are split over multiple shards in solr-cloud mode.
> > > > >
> > > > > Those collections are supposed to be joined to an other collection to
> > > > > retrieve subset. Because I am using distributed collections, I am not
> > > > > able to use the solr join feature.
> > > > >
> > > > > For this reason, I denormalize the information by adding the joined
> > > > > collection within every collections. Naturally, when I want to update
> > > > > the joined collection, I have to update every one of the distributed
> > > > > collections.
> > > > >
> > > > > In standalone mode, I only would have to update the joined
> > collection.
> > > > >
> > > > > I wonder if there is a way to overcome this limitation. For example,
> > by
> > > > > replicating the joined collection to every shard - or other method I
> > am
> > > > > ignoring.
> > > > >
> > > > > Any thought ?
> > > > > --
> > > > > nicolas
> > > >
> > >
> > > --
> > > nicolas
> > >
> >
> > --
> > nicolas
> >
> 
> 
> -- 
> Sincerely yours
> Mikhail Khludnev

-- 
nicolas


RE: Highlighting Solr 8

2019-10-16 Thread Eric Allen
Thanks for the reply.

Currently we are migrating from solr4 to solr8 under solr 4 we wrote our own 
highlighter because the provided one was too slow for our documents.

We deal with many large documents, but we have full term vectors already.  So 
as I understand it from my reading of the code the unified highlighter should 
be fast even on these large documents.

The concern about alternate fields was if the highlighter was slow we could 
just return highlights from one field if they existed and if not then highlight 
the other fields.

>From my research I'm leaning towards returning highlights from all the fields 
>we are interested in because I feel it will be fast.

Eric Allen - Software Devloper, NetDocuments
eric.al...@netdocuments.com | O: 801.989.9691 | C: 801.989.9691

-Original Message-
From: sasarun  
Sent: Wednesday, October 16, 2019 2:45 AM
To: solr-user@lucene.apache.org
Subject: Re: Highlighting Solr 8

Hi Eric,

Unified highlighter does not have an option to provide alternate field when 
highlighting. That option is available with Orginal and fast vector 
highlighter. As indicated in the Solr documentation, Unified is the recommended 
method for highlighting to meet most of the use cases. Please do share more 
details in case you are facing any specific issue with highlighting. 

Thanks,

Arun 




--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Solr JVM performance challenge with Updates

2019-10-16 Thread GaneshSe
Any help on this is much appreciated.



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Solr JVM Turning - 7.2.1

2019-10-16 Thread Sethuraman, Ganesh
Hi,

We are using Solr 7.2.1 with 2 nodes (245GB RAM each) and 3 node ZK cluster in 
production. We are using Java 8 with default GC settings (with NewRatio=3) with 
15GB max heap, changed to 16 GB after the performance issue mentioned below.

We have about 90 collections in this (~8  shards each), about 50 of them are 
actively being used. About 3 collections are being actively updated using SolrJ 
update query with soft commit of 30 secs. Other collection go through update 
handler batch CSV update.

We had read timeout/slowness issue when Young Collection size usage peaked. As 
you can see in the GC Graph below during the problem time. After that we 
increased the overall heap size to 16GB (from 15 GB) and as you can see that we 
did not see any read issue.

  1.  I see our Heap is very large, we are seeing higher usage of young 
collection, is this due to solrj updates (concurrent one record update)?
  2.  Should we change the NewRatio to 2 (so that young size increases more)? 
as we are seeing only 58% usage of old gen
  3.  We are also seeing a behavior that if we restart the Solr in production, 
when updates are happening, one server starts up, but does not have all 
collections and shards up, and when we restart both the server up, it comes up 
fine, is this behavior also related to the Solrj updates?



Problem GC Report  
https://gceasy.io/my-gc-report.jsp?p=YXJjaGl2ZWQvMjAxOS8xMC83Ly0tMDJfc29scl9nYy5sb2cuNi5jdXJyZW50LS0xNC00My01OA==&channel=WEB

No Problem GC Report (still see higher Young collection use)  
https://gceasy.io/my-gc-report.jsp?p=YXJjaGl2ZWQvMjAxOS8xMC85Ly0tMDJfX3NvbHJfZ2MubG9nLjIuY3VycmVudC0tMjAtNDQtMjY=&channel=WEB

 Any help on the above question appreciated.

Thanks &Regards,

Ganesh






Help with Stream Graph

2019-10-16 Thread Rajeswari Natarajan
Hi,

Since the stream graph query for my use case , didn't work as  i took the
data from solr source code test and also copied the schema and
solrconfig.xml from solr 7.6 source code.  Had to substitute few variables.

Posted below data

curl -X POST http://localhost:8983/solr/knr/update -H
'Content-type:text/csv' -d '
id, basket_s, product_s, prics_f
90,basket1,product1,20
91,basket1,product3,30
92,basket1,product5,1
93,basket2,product1,2
94,basket2,product6,5
95,basket2,product7,10
96,basket3,product4,20
97,basket3,product3,10
98,basket3,product1,10
99,basket4,product4,40
110,basket4,product3,10
111,basket4,product1,10'
After this I committed and made sure the data got published. to solr

curl --data-urlencode
'expr=gatherNodes(knr,walk="product1->product_s",gather="basket_s")'
http://localhost:8983/solr/knr/stream

{

  "result-set":{

"docs":[{

"EOF":true,

"RESPONSE_TIME":4}]}}


and if I add *scatter="branches, leaves" , there is one doc.*



curl --data-urlencode
'expr=gatherNodes(knr,walk="product1->product_s",gather="basket_s",scatter="branches,
leaves")' http://localhost:8983/solr/knr/stream

{

  "result-set":{

"docs":[{

"node":"product1",

"collection":"knr",

"field":"node",

"level":0}

  ,{

"EOF":true,

"RESPONSE_TIME":4}]}}




Below is the data I got from
https://github.com/apache/lucene-solr/blob/branch_7_6/solr/solrj/src/test/org/apache/solr/client/solrj/io/graph/GraphExpressionTest.java#L271



According to this test 4 docs are expected.


I am not sure what I am missing. Any pointers, please


Thanks you,

Rajeswari


Re: Query on autoGeneratePhraseQueries

2019-10-16 Thread Shubham Goswami
Hi Michael/Shawn

Thanks for the response.
Michael you are right, autoGeneratePhraseQueries works for the query like
Black-company
with the setting of Sow=true.
Thanks for your great support.

Best
Shubham

On Wed, Oct 16, 2019 at 9:22 PM Michael Gibney 
wrote:

> Going to back to the initial question, the wording is a little ambiguous
> and it occurs to me that it's possible there's a misunderstanding of what
> autoGeneratePhraseQueries does. It really only auto-generates phrase
> *subqueries*. To use the example from the initial request, a query like
> (black company) would always generate a non-phrase query (respecting mm,
> q.op, etc. -- but in any case not a top-level phrase query), regardless of
> the setting of autoGeneratePhraseQueries.
>
> autoGeneratePhraseQueries (when set to true) only kicks in (in different
> ways depending on analysis chain, and setting of "sow") for a query like
> (the black-company manufactures), which would be transformed to something
> more like (the "black company" manufactures). The idea is that there's some
> extra indication that the two words should be bundled together for purposes
> of querying.
>
> If you want to auto-generate a top-level phrase query, some other approach
> would be called for.
>
> Apologies if this is obvious and/or not helpful, Shubham!
>
> On Wed, Oct 16, 2019 at 10:10 AM Shawn Heisey  wrote:
>
> > On 10/16/2019 7:14 AM, Shubham Goswami wrote:
> > > I have implemented the sow=false property with eDismax Query parser but
> > > still it does not has any effect
> > > on the query as it is still parsing as separate terms instead of
> phrased
> > > one.
> >
> > We have seen reports that when sow=false, which is the default setting
> > since Solr 7.0, autoGeneratePhraseQueries does not work.  Try setting
> > sow=true and see whether you get the results you expect.
> >
> > I do not know whether this behavior is a bug or if it is expected.
> >
> > Thanks,
> > Shawn
> >
>


-- 
*Thanks & Regards*
Shubham Goswami
Enterprise Software Engineer
*HotWax Systems*
*Enterprise open source experts*
cell: +91-7803886288
office: 0731-409-3684
http://www.hotwaxsystems.com


Query regarding positionIncrementGap

2019-10-16 Thread Shubham Goswami
Hi Community

I am a beginner in solr and i am trying to understand the working of
positionIncrementGap but i am still not clear how it exactly works for the
phrase queries and general queires.
   Can somebody please help me to understand this with the help fo an
example ?
Any help will be appreciated. Thanks in advance.

-- 
*Thanks & Regards*
Shubham Goswami
Enterprise Software Engineer
*HotWax Systems*
*Enterprise open source experts*
cell: +91-7803886288
office: 0731-409-3684
http://www.hotwaxsystems.com