Re: IndexMergeTool to adhere to TieredMergePolicyFactory settings

2017-12-01 Thread Zheng Lin Edwin Yeo
Hi,

Is it possible for the merging not to merge to one large segment?

Regards,
Edwin


On 28 November 2017 at 12:05, Zheng Lin Edwin Yeo 
wrote:

> Hi,
>
> I'm currently using Solr 6.5.1.
>
> I found that in the IndexMergeTool.java, we found that there is this line
> which set the maxNumSegments to 1.
>
> writer.forceMerge(1);
>
>
> For this, does it means that there will always be only 1 segment after the
> merging? From what I see, that seems to be the case.
>
> Is there any way which we can allow the merging to be in multiple segment,
> which each segment of a certain size? Like if we want each segment to be of
> 20GB, which is what is what is set in the TieredMergePolicyFactory?
>
>  > 10 10
> 20480 
>
>
> Regards,
> Edwin
>


Solr score use cases

2017-12-01 Thread Faraz Fallahi
Hi

A simple question: what are the most common use cases for the solr score of
documents retrieved after firing queries?
I dont have a real understanding of its purpose at the moment.

Thx for helping


Range facet over currency field

2017-12-01 Thread Aman Tandon
Hi,

I have a doubt regarding how to do the range facet on some different
currency.

I have indexed the price data in USD inside the field price_usd_c and I
have currency.xml which is getting generated by a process.

If I want to do the range facet on the field price_usd_c in Euro currency,
then how could I do it and what is the syntax of it. Is there any way to do
so? If so kindly help.

Regards,
Aman


Passing Metadata from an RTF-file via TIKA to SOLR ...

2017-12-01 Thread Jan . Christopher . Schluchtmann-EXT
Hi there!
I am quite new to Lucene/Solr/Tika, etc., so I would appreciate you help 
concerning the following matter.


I have a RTF-document, that I want to index in Solr, using Tika.
The RTF-indexing works in general, but since I changed the Solr-schema, 
the indexer complains about missing mandatory fields, like "module-id".
The rtf-file is generated by me and I added the metadata-fields to the 
RTF-document in the "userprops"-section of the RTF-file (see below) -- so 
Tika should be able to read it and to provide it.

The problem is: I don't know HOW or WHERE Tika provides this metadata, so 
I don't know how to access it. As a result, I don't know how I can map it 
to the respective Solr-fields, like "module-id", that are mandatory in my 
Solr-schema.

Can someone give me a hint, please? 
I am running out of ideas here ... :-/




{\rtf1\fbidis\ansi\ansicpg1252\deff0\deflang1031{\fonttbl{\f0\fnil\fcharset0 
Arial;}}
{\colortbl ;\red0\green0\blue0;}
{\userprops
{\propname module-id}\proptype30{\staticval 000ba8a6}
}
}


Mit freundlichen Grüßen/ With kind regards

Jan Schluchtmann
Systems Engineering Cluster Instruments
VW Group
Continental Automotive GmbH
Division Interior
ID S3 RM
VDO-Strasse 1, 64832 Babenhausen, Germany

Telefon/Phone: +49 6073 12-4346
Telefax: +49 6073 12-79-4346

Re: Solr score use cases

2017-12-01 Thread Aman Tandon
Hi Faraz,

Solr score which you could retrieved by adding in fl parameter could be
helpful to understand the following:

1) search relevance ranking: how much score solr has given to the top &
second top document, and with debug=true you could better understand what
is causing that score.

2) You could use the function query to multiply score with some feature
e.g. paid customers score, popularity score, etc to improve the relevance
as per the business.

I am able to think these few points only, someone can also put more light
if I am missing anything. I hope this is what you want to know. 😊

Regards,
Aman

On Dec 1, 2017 13:38, "Faraz Fallahi"  wrote:

Hi

A simple question: what are the most common use cases for the solr score of
documents retrieved after firing queries?
I dont have a real understanding of its purpose at the moment.

Thx for helping


Re: Solr score use cases

2017-12-01 Thread Faraz Fallahi
Oki but If ID Just make an simple query with a "where Claude" and sort by a
field i See no sense in calculating a score right?

Am 01.12.2017 16:33 schrieb "Aman Tandon" :

> Hi Faraz,
>
> Solr score which you could retrieved by adding in fl parameter could be
> helpful to understand the following:
>
> 1) search relevance ranking: how much score solr has given to the top &
> second top document, and with debug=true you could better understand what
> is causing that score.
>
> 2) You could use the function query to multiply score with some feature
> e.g. paid customers score, popularity score, etc to improve the relevance
> as per the business.
>
> I am able to think these few points only, someone can also put more light
> if I am missing anything. I hope this is what you want to know. 😊
>
> Regards,
> Aman
>
> On Dec 1, 2017 13:38, "Faraz Fallahi" 
> wrote:
>
> Hi
>
> A simple question: what are the most common use cases for the solr score of
> documents retrieved after firing queries?
> I dont have a real understanding of its purpose at the moment.
>
> Thx for helping
>


Re: Solr score use cases

2017-12-01 Thread Faraz Fallahi
Or does the Score even get calculated when i sort or Not?

Am 01.12.2017 4:38 nachm. schrieb "Faraz Fallahi" <
faraz.fall...@googlemail.com>:

> Oki but If ID Just make an simple query with a "where Claude" and sort by
> a field i See no sense in calculating a score right?
>
> Am 01.12.2017 16:33 schrieb "Aman Tandon" :
>
>> Hi Faraz,
>>
>> Solr score which you could retrieved by adding in fl parameter could be
>> helpful to understand the following:
>>
>> 1) search relevance ranking: how much score solr has given to the top &
>> second top document, and with debug=true you could better understand what
>> is causing that score.
>>
>> 2) You could use the function query to multiply score with some feature
>> e.g. paid customers score, popularity score, etc to improve the relevance
>> as per the business.
>>
>> I am able to think these few points only, someone can also put more light
>> if I am missing anything. I hope this is what you want to know. 😊
>>
>> Regards,
>> Aman
>>
>> On Dec 1, 2017 13:38, "Faraz Fallahi" 
>> wrote:
>>
>> Hi
>>
>> A simple question: what are the most common use cases for the solr score
>> of
>> documents retrieved after firing queries?
>> I dont have a real understanding of its purpose at the moment.
>>
>> Thx for helping
>>
>


JVM GC Issue

2017-12-01 Thread Dominique Bejean
Hi,

We are encountering issue with GC.

Randomly nearly once a day there are consecutive full GC with no memory
reclaimed.

So the old generation heap usage grow up to the limit.

Solr stop responding and we need to force restart.


We are using Solr 6.6.1 with Oracle 1.8 JVM. The JVM settings are the
default one with CMS GC.


Solr is configured as standalone in master/slave replication. Replication
are pooled each 10 minutes.


Hardware has 4 vcpu and 16Gb RAM

Both Xms and Xmx are set to 6Gb.

The core contain 1.200.000 documents, with a lot of fields both stored and
indexed without docValues enabled.

The file system core size is 11Gb


All Solr cache are sized to 1024 with autowarmcount to 256


Queries rate is low (10 per seconds). The only memory consuming feature are
sort and facet. There are 10 facets from 10 to 2000 distinct values


I don't see errors or warning in solr logs about opening or warming new
searchers.

here is link to gceasy report.
http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTcvMTIvMS8tLXNvbHJfZ2MubG9nLjAuemlwLS0xMS0xNC0zMA==

Gceasy suggest to increase heap size, but I do not agree

Any suggestion about this issue ?

Regards.

Dominique
-- 
Dominique Béjean
06 08 46 12 43


Re: Solr LTR plugin - Training

2017-12-01 Thread ilayaraja
Also, when I want to add phrase match as a feature as below, it does not
support it:
{
"store" : "tsf",
"name" : "productTitleMatchGuestQuery",
"class" : "org.apache.solr.ltr.feature.SolrFeature",
"params" : { "q" : "{!dismax qf=p_title^1.0}${user_query}" }
  },
{
"store" : "tsf",
"name" : "productTitlePfMatchGuestQuery",
"class" : "org.apache.solr.ltr.feature.SolrFeature",
"params" : *{ "q" : "{!dismax pf=p_title^2.0}${user_query}*" }
  },

Not sure, how to add phrase matching here. I also want to add pf2, pf3 as
features if this works. Thanks



-
--Ilay
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re:LTR model upload

2017-12-01 Thread Christine Poerschke (BLOOMBERG/ QUEEN VIC)
Hi Brian,

Thank you for this question! It sounds like you may be experiencing the issue 
reported in https://issues.apache.org/jira/browse/SOLR-11049 and perhaps the 
same workaround could work for you?

And in the upcoming Solr 7.2 release 
https://issues.apache.org/jira/browse/SOLR-11250 will provide an alternative 
way for handling large models.

Hope that helps.

Regards,

Christine

- Original Message -
From: solr-user@lucene.apache.org
To: solr-user@lucene.apache.org
At: 11/28/17 23:00:11

When I upload, I can see my model when I hit 
"solr/collection_name/schema/model-store/" but when I reload the collection, 
it's gone.

Is there a size limit for LTR models? I have a 1.6mb / 49,000 line long 
lambdamart model (that's what ranklib spit out for me) which I didn't think 
would be a huge problem. I decided to test by cutting down the model size by 
deleting 99% of it and it worked after reload. Does this mean my model is too 
big or do I possibly have a syntax bug somewhere in my model.json?


  *   Brian



Re: JVM GC Issue

2017-12-01 Thread Toke Eskildsen
Dominique Bejean  wrote:
> We are encountering issue with GC.

> Randomly nearly once a day there are consecutive full GC with no memory
> reclaimed.

[... 1.2M docs, Xmx 6GB ...]

> Gceasy suggest to increase heap size, but I do not agree

It does seem strange, with your apparently modest index & workload. Nothing you 
say sounds problematic to me and you have covered the usual culprits 
overlapping searchers, faceting and filterCache.

Is it possible for you to share the solr.log around the two times that memory 
usage peaked? 2017-11-30 17:00-19:00 and 2017-12-01 08:00-12:00.

If you cannot share, please check if you have excessive traffic around that 
time or if there is a lot of UnInverting going on (triggered by faceting on 
non.DocValues String fields). I know your post implies that you have already 
done so, so this is more of a sanity check.


- Toke Eskildsen


Request return behavior when trigger DataImportHandler updates

2017-12-01 Thread Nathan Friend
Hello,

When triggering the DataImportHandler to update via an HTTP request, I've 
noticed the handler behaves differently depending on how I specify the request 
parameters.

If I use query parameters, e.g. 
http://localhost:8983/solr/mycore/dataimport?command=full-import, the HTTP 
request is not fulfilled until the import is completed.

However, if I instead make a request to 
http://localhost:8983/solr/mycore/dataimport with a form-data key of 
"command"="full-import", the HTTP request returns immediately. In this case, 
the documentation seems to suggest the best way to detect when the import has 
completed is to poll Solr for status updates until the status switches from 
"busy" back to "idle".

Is this difference in behavior intentional?  I would like to be able to rely on 
the first behavior - to expect that the import will be finished when the HTTP 
call returns.  However, I can't seem to find any documentation on this 
difference; I'd prefer not to have my component rely on what might be a 
convenient bug :).

I'm using Solr 6.5.1.

Thanks!

Nathan

Confidentiality Notice:
This electronic transmission, and any documents attached to it, may contain 
confidential information belonging to the sender. This information is intended 
solely for the use of the individual or entity named above. If you are not the 
intended recipient, you are hereby notified that any disclosure, copying, 
distribution or the taking of any action in reliance upon the contents of this 
information is prohibited. If you have received this transmission in error, 
please notify the sender immediately and delete the message and all documents.


Re: Fwd: solr-security-proxy

2017-12-01 Thread Rick Leir
The default blacklist is qt and stream, because there are examples of nasty 
things which can be done using those parms. But it seems much wiser to 
whitelist just the parms your web app needs to use. Am I missing something? Is 
there a simpler way to protect a Solr installation which just serves a few AJAX 
GETs? Cheers -- Rick

On November 30, 2017 3:10:14 PM EST, Rick Leir  wrote:
>Hi all
>I have just been looking at solr-security-proxy, which seems to be a
>great little app to put in front of Solr (link below). But would it
>make more sense to use a whitelist of Solr parameters instead of a
>blacklist?
>Thanks
>Rick
>
>https://github.com/dergachev/solr-security-proxy
>
>solr-security-proxy
>Node.js based reverse proxy to make a solr instance read-only,
>rejecting requests that have the potential to modify the solr index.
>--invalidParams   Block these query params (comma separated)  [default:
>"qt,stream"]
>
>
>-- 
>Sorry for being brief. Alternate email is rickleir at yahoo dot com

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 

starting SolrCloud nodes

2017-12-01 Thread Steve Pruitt
Thanks to previous help.  I have a ZK ensemble of three nodes running.  I have 
uploaded the config for my collection and the solr.xml file.
I have Solr installed on three machines.

I think my next steps are:

Start up each Solr instance:  bin/solr start -c -z 
"zoo1:2181,zoo2:2181,zoo3:2181"  // I have ZK_Hosts set to my ZK's, but the 
documentation seems to say I need to provide the list here to prevent embedded 
ZK from getting used.

From one of the Solr instances create a collection:  bin/solr create -c 
nameofuploadedconfig -s 3 -rf 2 //for example.

The documentation I think implies that all of the Solr instances are 
automatically set with the collection.  There is no further action at this 
point?

Thanks.

-S


Re: JVM GC Issue

2017-12-01 Thread Erick Erickson
Your autowarm counts are rather high, bit as Toke says this doesn't
seem outrageous.

I have seen situations where Solr is running close to the limits of
its heap and GC only reclaims a tiny bit of memory each time, when you
say "full GC with no memory
reclaimed" is that really no memory _at all_? Or "almost no memory"?
This situation can be alleviated by allocating more memory to the JVM
.

Your JVM pressure would certainly be reduced by enabling docValues on
any field you sort,facet or group on. That would require a full
reindex of course. Note that this makes your index on disk bigger, but
reduces JVM pressure by roughly the same amount so it's a win in this
situation.

Have you attached a memory profiler to the running Solr instance? I'd
be curious where the memory is being allocated.

Best,
Erick

On Fri, Dec 1, 2017 at 8:31 AM, Toke Eskildsen  wrote:
> Dominique Bejean  wrote:
>> We are encountering issue with GC.
>
>> Randomly nearly once a day there are consecutive full GC with no memory
>> reclaimed.
>
> [... 1.2M docs, Xmx 6GB ...]
>
>> Gceasy suggest to increase heap size, but I do not agree
>
> It does seem strange, with your apparently modest index & workload. Nothing 
> you say sounds problematic to me and you have covered the usual culprits 
> overlapping searchers, faceting and filterCache.
>
> Is it possible for you to share the solr.log around the two times that memory 
> usage peaked? 2017-11-30 17:00-19:00 and 2017-12-01 08:00-12:00.
>
> If you cannot share, please check if you have excessive traffic around that 
> time or if there is a lot of UnInverting going on (triggered by faceting on 
> non.DocValues String fields). I know your post implies that you have already 
> done so, so this is more of a sanity check.
>
>
> - Toke Eskildsen


Re: starting SolrCloud nodes

2017-12-01 Thread Erick Erickson
Looks good. If you set ZK_HOST in your solr.in.sh you can forgo
setting it when you start, but that's not necessary.

Best,
Erick

On Fri, Dec 1, 2017 at 9:13 AM, Steve Pruitt  wrote:
> Thanks to previous help.  I have a ZK ensemble of three nodes running.  I 
> have uploaded the config for my collection and the solr.xml file.
> I have Solr installed on three machines.
>
> I think my next steps are:
>
> Start up each Solr instance:  bin/solr start -c -z 
> "zoo1:2181,zoo2:2181,zoo3:2181"  // I have ZK_Hosts set to my ZK's, but the 
> documentation seems to say I need to provide the list here to prevent 
> embedded ZK from getting used.
>
> From one of the Solr instances create a collection:  bin/solr create -c 
> nameofuploadedconfig -s 3 -rf 2 //for example.
>
> The documentation I think implies that all of the Solr instances are 
> automatically set with the collection.  There is no further action at this 
> point?
>
> Thanks.
>
> -S


Re: [EXTERNAL] - Re: Basic SolrCloud help

2017-12-01 Thread Shawn Heisey
On 11/30/2017 8:53 AM, Steve Pruitt wrote:
> I took the hint and looked at both solr.in.cmd and solr.in.sh.  Clearly 
> setting ZK_HOST is a first step.  I am sure this is explained somewhere, but 
> I overlooked it.
> From here, once I have Solr installed, I can run the Control Script to upload 
> a config set either when creating a collection or independent of creating the 
> collection.
>
> When I install Solr on the three nodes I have planned, I run the Control 
> Script and just point to somewhere on disk I have the config set stored.
>
> One question buried in my first missive was about mixing the Solr machines.  
> I was thinking of installing Solr on two VM's running Centos and then make my 
> third Solr node on my local machine, i.e. Windows.  I can't think of why this 
> could be an issue,  as long as everything is setup with the right ZK hosts, 
> etc.  Does anyone know of any potential issues doing this?

There shouldn't be any issue with different nodes running on different
operating systems.  There is no service install script for Windows, so
you would have to do that yourself (usually people use NSSM or SRVANY)
or you'd just have to start it manually, which I think means the include
script (solr.in.cmd) will be in the bin directory.

If the local machine you're describing is your personal machine, that is
not something I would do.  Personal machines typically need to be taken
down and/or rebooted for various reasons on a frequent basis.  Also, if
the system resources on the local machine are significantly different
than the VMs, SolrCloud's load balancing can't take that into account
and send a different number of requests to each machine.  Things work
best when all the machines are substantially similar to each other.

Also, since it's likely a personal Windows machine will be running a
client OS like Windows 7 or Windows 10, I have heard from multiple
sources that client operating systems from Microsoft are intentionally
crippled in ways that the server operating systems (which cost a LOT
more) aren't, so that they don't run server loads very well.  I've never
been able to find concrete information about exactly how Microsoft
limits the functionality in the client OS.  If anybody on this list has
a source for concrete information, I'd love to see it.

Last, I will mention my personal feelings about Windows compared to
Linux.  Performance seems to be generally better in Linux.  NTFS is not
as good as the native filesystems in Linux, although Lucene tends
towards large files, which aren't as problematic as tons of little
files.  Memory management appears to me to be much better in Linux.  The
single biggest complaint I have about Windows is cost, especially for
the server operating systems, which I think are the only good choice for
a program like Solr.

> I am not sure what "definition" and *config* are referencing.  When I 
> initially install Solr it will not have a config set.  I haven't created a 
> Collection yet.   The running Solr instance upon initial install has no 
> config yet.  But, I think I am not understanding what "definition" and 
> "*config*" mean.

When I said "definition" I was talking about ZK_HOST or the argument to
the -z option.  The "config" would be the include script (usually
solr.in.sh or solr.in.cmd) -- providing configuration options for Solr
startup like heap size, zk info, gc tuning, etc.

Thanks,
Shawn



Re: Solr score use cases

2017-12-01 Thread Erick Erickson
Sorting certainly ignores scoring, I'm pretty sure it's just not
calculated in that case.

If your sorting results in multiple documents in the same bin, people
will combine the primary sort with a secondary sort on score, so in
that case the score is definitely calculated, ie "&sort=day asc, score
desc"

Returning the score with documents is usually for development
purposes. Scores are _not_ comparable except within a single query, so
IMO telling users that a doc from one search has a score of X and a
doc from another search has a score of Y is useless-to-misleading
information. A score of 2X is _not_ necessarily "twice as good" (or
even as good) as a score of X in another search.

FWIW,
Erick

On Fri, Dec 1, 2017 at 6:31 AM, Faraz Fallahi
 wrote:
> Or does the Score even get calculated when i sort or Not?
>
> Am 01.12.2017 4:38 nachm. schrieb "Faraz Fallahi" <
> faraz.fall...@googlemail.com>:
>
>> Oki but If ID Just make an simple query with a "where Claude" and sort by
>> a field i See no sense in calculating a score right?
>>
>> Am 01.12.2017 16:33 schrieb "Aman Tandon" :
>>
>>> Hi Faraz,
>>>
>>> Solr score which you could retrieved by adding in fl parameter could be
>>> helpful to understand the following:
>>>
>>> 1) search relevance ranking: how much score solr has given to the top &
>>> second top document, and with debug=true you could better understand what
>>> is causing that score.
>>>
>>> 2) You could use the function query to multiply score with some feature
>>> e.g. paid customers score, popularity score, etc to improve the relevance
>>> as per the business.
>>>
>>> I am able to think these few points only, someone can also put more light
>>> if I am missing anything. I hope this is what you want to know. 😊
>>>
>>> Regards,
>>> Aman
>>>
>>> On Dec 1, 2017 13:38, "Faraz Fallahi" 
>>> wrote:
>>>
>>> Hi
>>>
>>> A simple question: what are the most common use cases for the solr score
>>> of
>>> documents retrieved after firing queries?
>>> I dont have a real understanding of its purpose at the moment.
>>>
>>> Thx for helping
>>>
>>


Reg:- Indexing MySQL data with Solr

2017-12-01 Thread @Nandan@
Hi ,
I am working on an Ecommerce database . I have more then 40 tables and
around 20GB of data.
I want to index data with solr for more effective search feature.
Please tell me how to index MySQL data with Apache solr.
Thanks in advance
Nandan Priyadarshi


Re: Request return behavior when trigger DataImportHandler updates

2017-12-01 Thread Shawn Heisey
On 12/1/2017 9:37 AM, Nathan Friend wrote:
> When triggering the DataImportHandler to update via an HTTP request, I've 
> noticed the handler behaves differently depending on how I specify the 
> request parameters.
>
> If I use query parameters, e.g. 
> http://localhost:8983/solr/mycore/dataimport?command=full-import, the HTTP 
> request is not fulfilled until the import is completed.
>
> However, if I instead make a request to 
> http://localhost:8983/solr/mycore/dataimport with a form-data key of 
> "command"="full-import", the HTTP request returns immediately. In this case, 
> the documentation seems to suggest the best way to detect when the import has 
> completed is to poll Solr for status updates until the status switches from 
> "busy" back to "idle".

At one time, way back in version 1.4 and 3.2, I was making such requests
from a Perl program, and they were definitely GET requests.  These
requests did NOT wait until the import completed.

I just checked my most recent program, which uses SolrJ, and I
discovered that I am using Method.POST there, so I can't confirm or deny
what you're seeing yet.  At one point, I went through all my SolrJ code
and made sure all requests were using POST, so that I would never run
into the limitations on the URL size.  It wasn't really necessary for my
code that starts DIH, but I was making the change everywhere.

> Is this difference in behavior intentional?  I would like to be able to rely 
> on the first behavior - to expect that the import will be finished when the 
> HTTP call returns.  However, I can't seem to find any documentation on this 
> difference; I'd prefer not to have my component rely on what might be a 
> convenient bug :).

As far as I know, Solr should be unaware of any difference between these
ways of supplying the parameters, because the servlet container (usually
Jetty) should just make parameters available to Solr, no matter how it
received them.  I think that the dataimport handler has always been
async in nature.  I've NEVER seen an import command wait to respond
until the import is complete.  My imports take many hours to run.  If
that's what you're seeing happen with a GET request, I'd call that a
bug.  We need to confirm the problem in the most recent versions before
filing an issue.

Thanks,
Shawn



Re: Reg:- Indexing MySQL data with Solr

2017-12-01 Thread Sujay Bawaskar
You can use data import handler with cache, its faster.

Check document :
https://lucene.apache.org/solr/guide/6_6/uploading-structured-data-store-data-with-the-data-import-handler.html

On Sat, Dec 2, 2017 at 12:21 AM, @Nandan@ 
wrote:

> Hi ,
> I am working on an Ecommerce database . I have more then 40 tables and
> around 20GB of data.
> I want to index data with solr for more effective search feature.
> Please tell me how to index MySQL data with Apache solr.
> Thanks in advance
> Nandan Priyadarshi
>



-- 
Thanks,
Sujay P Bawaskar
M:+91-77091 53669


RE: [Marketing Mail] Re: Request return behavior when trigger DataImportHandler updates

2017-12-01 Thread Nathan Friend
Thanks for the reply, Shawn.  One additional detail I should have mentioned - 
in both scenarios, I'm making a POST request, not a GET.  Even when using a 
POST, Solr seems to be able to accept URL query parameters.

I'm making test requests using Postman (https://www.getpostman.com/), so it's 
also possible this issue is related to how Postman is sending the request 
(although I doubt it).

Either way, it sounds like relying on the synchronous behavior of the HTTP 
request is not a good option.  I've been looking into the "onImportEnd" event 
listener functionality and think this feature will provide a more robust way of 
generating notifications when imports have finished.

Thanks!

Nathan

-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: December 1, 2017 3:04 PM
To: solr-user@lucene.apache.org
Subject: [Marketing Mail] Re: Request return behavior when trigger 
DataImportHandler updates

On 12/1/2017 9:37 AM, Nathan Friend wrote:
> When triggering the DataImportHandler to update via an HTTP request, I've 
> noticed the handler behaves differently depending on how I specify the 
> request parameters.
>
> If I use query parameters, e.g. 
> http://localhost:8983/solr/mycore/dataimport?command=full-import, the HTTP 
> request is not fulfilled until the import is completed.
>
> However, if I instead make a request to 
> http://localhost:8983/solr/mycore/dataimport with a form-data key of 
> "command"="full-import", the HTTP request returns immediately. In this case, 
> the documentation seems to suggest the best way to detect when the import has 
> completed is to poll Solr for status updates until the status switches from 
> "busy" back to "idle".

At one time, way back in version 1.4 and 3.2, I was making such requests from a 
Perl program, and they were definitely GET requests.  These requests did NOT 
wait until the import completed.

I just checked my most recent program, which uses SolrJ, and I discovered that 
I am using Method.POST there, so I can't confirm or deny what you're seeing 
yet.  At one point, I went through all my SolrJ code and made sure all requests 
were using POST, so that I would never run into the limitations on the URL 
size.  It wasn't really necessary for my code that starts DIH, but I was making 
the change everywhere.

> Is this difference in behavior intentional?  I would like to be able to rely 
> on the first behavior - to expect that the import will be finished when the 
> HTTP call returns.  However, I can't seem to find any documentation on this 
> difference; I'd prefer not to have my component rely on what might be a 
> convenient bug :).

As far as I know, Solr should be unaware of any difference between these ways 
of supplying the parameters, because the servlet container (usually
Jetty) should just make parameters available to Solr, no matter how it received 
them.  I think that the dataimport handler has always been async in nature.  
I've NEVER seen an import command wait to respond until the import is complete. 
 My imports take many hours to run.  If that's what you're seeing happen with a 
GET request, I'd call that a bug.  We need to confirm the problem in the most 
recent versions before filing an issue.

Thanks,
Shawn


Confidentiality Notice:
This electronic transmission, and any documents attached to it, may contain 
confidential information belonging to the sender. This information is intended 
solely for the use of the individual or entity named above. If you are not the 
intended recipient, you are hereby notified that any disclosure, copying, 
distribution or the taking of any action in reliance upon the contents of this 
information is prohibited. If you have received this transmission in error, 
please notify the sender immediately and delete the message and all documents.


Java 9 and Solr 6.6

2017-12-01 Thread marotosg
HI all.

Would you recommend installing Solr 6.6.1 with Java 9 for a production
environement?


Thanks,
Sergio



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Reg:- Indexing MySQL data with Solr

2017-12-01 Thread Erick Erickson
Or, depending on your environment and comfort with Java, use SolrJ.
Here's an example:

https://lucidworks.com/2012/02/14/indexing-with-solrj/

Best,
Erick

On Fri, Dec 1, 2017 at 11:22 AM, Sujay Bawaskar  wrote:
> You can use data import handler with cache, its faster.
>
> Check document :
> https://lucene.apache.org/solr/guide/6_6/uploading-structured-data-store-data-with-the-data-import-handler.html
>
> On Sat, Dec 2, 2017 at 12:21 AM, @Nandan@ 
> wrote:
>
>> Hi ,
>> I am working on an Ecommerce database . I have more then 40 tables and
>> around 20GB of data.
>> I want to index data with solr for more effective search feature.
>> Please tell me how to index MySQL data with Apache solr.
>> Thanks in advance
>> Nandan Priyadarshi
>>
>
>
>
> --
> Thanks,
> Sujay P Bawaskar
> M:+91-77091 53669


RE: Re:LTR model upload

2017-12-01 Thread Brian Yee
Thanks Christine, that makes sense. I am currently working around the issue by 
limiting how many trees Ranklib produces which reduces the file size.

--Brian

-Original Message-
From: Christine Poerschke (BLOOMBERG/ QUEEN VIC) 
[mailto:cpoersc...@bloomberg.net] 
Sent: Friday, December 01, 2017 11:03 AM
To: solr-user@lucene.apache.org
Subject: Re:LTR model upload

Hi Brian,

Thank you for this question! It sounds like you may be experiencing the issue 
reported in https://issues.apache.org/jira/browse/SOLR-11049 and perhaps the 
same workaround could work for you?

And in the upcoming Solr 7.2 release 
https://issues.apache.org/jira/browse/SOLR-11250 will provide an alternative 
way for handling large models.

Hope that helps.

Regards,

Christine

- Original Message -
From: solr-user@lucene.apache.org
To: solr-user@lucene.apache.org
At: 11/28/17 23:00:11

When I upload, I can see my model when I hit 
"solr/collection_name/schema/model-store/" but when I reload the collection, 
it's gone.

Is there a size limit for LTR models? I have a 1.6mb / 49,000 line long 
lambdamart model (that's what ranklib spit out for me) which I didn't think 
would be a huge problem. I decided to test by cutting down the model size by 
deleting 99% of it and it worked after reload. Does this mean my model is too 
big or do I possibly have a syntax bug somewhere in my model.json?


  *   Brian



Re: JVM GC Issue

2017-12-01 Thread Dominique Bejean
Hi,


Thank you both for your responses.


I just have solr log for the very last period of the CG log.


Grep command allows me to count queries per minute with hits > 1000 or >
1 and so with the biggest impact on memory and cpu during faceting


> 1000

 59 11:13

 45 11:14

 36 11:15

 45 11:16

 59 11:17

 40 11:18

 95 11:19

123 11:20

137 11:21

123 11:22

 86 11:23

 26 11:24

 19 11:25

 17 11:26


> 1

 55 11:19

 78 11:20

 48 11:21

134 11:22

 93 11:23

 10 11:24


So we see that at the time GC start become nuts, large result set count
increase.


The query field include phonetic filter and results are really not relevant
due to this. I will suggest to :


1/ remove the phonetic filter in order to have less irrelevant results and
so get smaller result set

2/ enable docValues on field use for faceting


I expect decrease GC requirements and stabilize GC.


Regards


Dominique





Le ven. 1 déc. 2017 à 18:17, Erick Erickson  a
écrit :

> Your autowarm counts are rather high, bit as Toke says this doesn't
> seem outrageous.
>
> I have seen situations where Solr is running close to the limits of
> its heap and GC only reclaims a tiny bit of memory each time, when you
> say "full GC with no memory
> reclaimed" is that really no memory _at all_? Or "almost no memory"?
> This situation can be alleviated by allocating more memory to the JVM
> .
>
> Your JVM pressure would certainly be reduced by enabling docValues on
> any field you sort,facet or group on. That would require a full
> reindex of course. Note that this makes your index on disk bigger, but
> reduces JVM pressure by roughly the same amount so it's a win in this
> situation.
>
> Have you attached a memory profiler to the running Solr instance? I'd
> be curious where the memory is being allocated.
>
> Best,
> Erick
>
> On Fri, Dec 1, 2017 at 8:31 AM, Toke Eskildsen  wrote:
> > Dominique Bejean  wrote:
> >> We are encountering issue with GC.
> >
> >> Randomly nearly once a day there are consecutive full GC with no memory
> >> reclaimed.
> >
> > [... 1.2M docs, Xmx 6GB ...]
> >
> >> Gceasy suggest to increase heap size, but I do not agree
> >
> > It does seem strange, with your apparently modest index & workload.
> Nothing you say sounds problematic to me and you have covered the usual
> culprits overlapping searchers, faceting and filterCache.
> >
> > Is it possible for you to share the solr.log around the two times that
> memory usage peaked? 2017-11-30 17:00-19:00 and 2017-12-01 08:00-12:00.
> >
> > If you cannot share, please check if you have excessive traffic around
> that time or if there is a lot of UnInverting going on (triggered by
> faceting on non.DocValues String fields). I know your post implies that you
> have already done so, so this is more of a sanity check.
> >
> >
> > - Toke Eskildsen
>
-- 
Dominique Béjean
06 08 46 12 43


Re: starting SolrCloud nodes

2017-12-01 Thread Shawn Heisey
On 12/1/2017 10:13 AM, Steve Pruitt wrote:
> Thanks to previous help.  I have a ZK ensemble of three nodes running.  I 
> have uploaded the config for my collection and the solr.xml file.
> I have Solr installed on three machines.
>
> I think my next steps are:
>
> Start up each Solr instance:  bin/solr start -c -z 
> "zoo1:2181,zoo2:2181,zoo3:2181"  // I have ZK_Hosts set to my ZK's, but the 
> documentation seems to say I need to provide the list here to prevent 
> embedded ZK from getting used.

The embedded ZK will only get started if you use the -c option and there
is no ZK_HOST variable and no -z option on the commandline.

If you use both -z and ZK_HOST, then the info I've seen says the -z
option will take priority.  I haven't looked at the script closely
enough to confirm, but that would be the most logical way to operate, so
it's probably correct.

If ZK_HOST is defined or you use the -z option, you do not need to
include the -c option when starting Solr.  SolrCloud mode is assumed
when ZK info is available.  The only time the -c option is *required* is
when you want to start the embedded zookeeper.  Having the option won't
hurt anything, though.

To start a solr *service* in cloud mode, all you need is to add ZK_HOST
to /etc/default/.in.sh, where  is the service name, which
defaults to solr.

> From one of the Solr instances create a collection:
> bin/solr create -c nameofuploadedconfig -s 3 -rf 2 //for example.

Nitpick:  The -c option on the create command is the name of the
collection.  To specify the name of the uploaded config, if it happens
to be different from the collection name, use the -n option.  You can
use the -d option to point at a config on disk, and it will be uploaded
to a config in zookeeper named after the -n option or the collection. 
The collection name and the config name are not required to match.  You
can use the same config for multiple collections.

> The documentation I think implies that all of the Solr instances are 
> automatically set with the collection.  There is no further action at this 
> point?

Solr will make an automatic decision as to which nodes will be used to
hold the collection.  If you use the Collections API directly rather
than the commandline, you can give Solr an explicit list of nodes to
use.  Without the explicit list, Solr will spread the collection across
the cluster as widely as it can.

https://lucene.apache.org/solr/guide/6_6/collections-api.html#CollectionsAPI-create

The "bin/solr create" command, when used on SolrCloud, just makes an
HTTP request to the Collections API, unless you use the -d option, in
which case it will upload a config to zookeeper before calling the
Collections API.

Thanks,
Shawn



Re: Solr score use cases

2017-12-01 Thread Faraz Fallahi
Thx for the clarification
Best regards

Am 01.12.2017 18:25 schrieb "Erick Erickson" :

> Sorting certainly ignores scoring, I'm pretty sure it's just not
> calculated in that case.
>
> If your sorting results in multiple documents in the same bin, people
> will combine the primary sort with a secondary sort on score, so in
> that case the score is definitely calculated, ie "&sort=day asc, score
> desc"
>
> Returning the score with documents is usually for development
> purposes. Scores are _not_ comparable except within a single query, so
> IMO telling users that a doc from one search has a score of X and a
> doc from another search has a score of Y is useless-to-misleading
> information. A score of 2X is _not_ necessarily "twice as good" (or
> even as good) as a score of X in another search.
>
> FWIW,
> Erick
>
> On Fri, Dec 1, 2017 at 6:31 AM, Faraz Fallahi
>  wrote:
> > Or does the Score even get calculated when i sort or Not?
> >
> > Am 01.12.2017 4:38 nachm. schrieb "Faraz Fallahi" <
> > faraz.fall...@googlemail.com>:
> >
> >> Oki but If ID Just make an simple query with a "where Claude" and sort
> by
> >> a field i See no sense in calculating a score right?
> >>
> >> Am 01.12.2017 16:33 schrieb "Aman Tandon" :
> >>
> >>> Hi Faraz,
> >>>
> >>> Solr score which you could retrieved by adding in fl parameter could be
> >>> helpful to understand the following:
> >>>
> >>> 1) search relevance ranking: how much score solr has given to the top &
> >>> second top document, and with debug=true you could better understand
> what
> >>> is causing that score.
> >>>
> >>> 2) You could use the function query to multiply score with some feature
> >>> e.g. paid customers score, popularity score, etc to improve the
> relevance
> >>> as per the business.
> >>>
> >>> I am able to think these few points only, someone can also put more
> light
> >>> if I am missing anything. I hope this is what you want to know. 😊
> >>>
> >>> Regards,
> >>> Aman
> >>>
> >>> On Dec 1, 2017 13:38, "Faraz Fallahi" 
> >>> wrote:
> >>>
> >>> Hi
> >>>
> >>> A simple question: what are the most common use cases for the solr
> score
> >>> of
> >>> documents retrieved after firing queries?
> >>> I dont have a real understanding of its purpose at the moment.
> >>>
> >>> Thx for helping
> >>>
> >>
>


Re: Reg:- Indexing MySQL data with Solr

2017-12-01 Thread Shawn Heisey
On 12/1/2017 11:51 AM, @Nandan@ wrote:
> I am working on an Ecommerce database . I have more then 40 tables and
> around 20GB of data.
> I want to index data with solr for more effective search feature.
> Please tell me how to index MySQL data with Apache solr.

There are multiple possible ways to do this.

One of the most straightforward is to use the dataimport handler.  DIH
is reasonably efficient, although the one major limitation it has is
that it's single threaded.  Depending on how your multiple tables relate
to each other, it may be a little bit challenging to configure
properly.  The best results with DIH are obtained when you can write a
single SQL query (possibly using the view feature in MySQL) that can
retrieve every bit of data you want to index.

For best performance, you would want to write a custom program that can
retrieve the data from your database and then start multiple threads or
multiple processes to do indexing in parallel.  You may even want to use
multiple threads for the retrieval process, depending on how fast a
single request against the database can return data.

Thanks,
Shawn



Re: Java 9 and Solr 6.6

2017-12-01 Thread Shawn Heisey
On 12/1/2017 12:32 PM, marotosg wrote:
> Would you recommend installing Solr 6.6.1 with Java 9 for a production
> environement?

Solr 7.x has been tested with Java 9 and should work with no problems. 
I believe that code changes were required to achieve this compatibility,
so 6.6 might have issues with Java 9.

The release information for 6.6 only mentions Java 8, while the release
information for 7.0 explicitly says that it works with Java 9.

I would not try running 6.6 with Java 9 in production without first
testing every part of the implementation on a dev server ... and based
on the limited information I know about, I'm not confident that those
tests would pass.

Thanks,
Shawn



Re: JVM GC Issue

2017-12-01 Thread Erick Erickson
Doninique:

Actually, the memory requirements shouldn't really go up as the number
of hits increases. The general algorithm is (say rows=10)
Calcluate the score of each doc
If the score is zero, ignore
If the score is > the minimum in my current top 10, replace the lowest
scoring doc in my current top 10 with the new doc (a PriorityQueue
last I knew).
else discard the doc.

When all docs have been scored, assemble the return from the top 10
(or whatever rows is set to).

The key here is that most of the Solr index is kept in
MMapDirecotry/OS space, see Uwe's excellent blog here:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html.
In terms of _searching_, very little of the Lucene index structures
are kept in memory.

That said, faceting plays a bit loose with the rules. If you have
docValues set to true, most of the memory structures are in the OS
memory space, not the JVM. If you have docValues set to false, then
the "uninverted" structure is built in the JVM heap space.

Additionally, the JVM requirements are sensitive to the number of
unique values in field being faceted on. For instance, let's say you
faceted by a date field with just facet.field=some_date_field. A
bucket would have to be allocated to hold the counts for each and
every unique date field, i.e. one for each millisecond in your search,
which might be something you're seeing. Conceptually this is just an
array[uniqueValues] of ints (longs? I'm not sure). This should be
relatively easily testable by omitting the facets while measuring.

Where the number of rows _does_ make a difference is in the return
packet. Say I have rows=10. In that case I create a single return
packet with all 10 docs "fl" field. If rows = 10,000 then that return
packet is obviously 1,000 times as large and must be assembled in
memory.

I rather doubt the phonetic filter is to blame. But you can test this
by just omitting the field containing the phonetic filter in the
search query. I've certainly been wrong before.

Best,
Erick

On Fri, Dec 1, 2017 at 2:31 PM, Dominique Bejean
 wrote:
> Hi,
>
>
> Thank you both for your responses.
>
>
> I just have solr log for the very last period of the CG log.
>
>
> Grep command allows me to count queries per minute with hits > 1000 or >
> 1 and so with the biggest impact on memory and cpu during faceting
>
>
>> 1000
>
>  59 11:13
>
>  45 11:14
>
>  36 11:15
>
>  45 11:16
>
>  59 11:17
>
>  40 11:18
>
>  95 11:19
>
> 123 11:20
>
> 137 11:21
>
> 123 11:22
>
>  86 11:23
>
>  26 11:24
>
>  19 11:25
>
>  17 11:26
>
>
>> 1
>
>  55 11:19
>
>  78 11:20
>
>  48 11:21
>
> 134 11:22
>
>  93 11:23
>
>  10 11:24
>
>
> So we see that at the time GC start become nuts, large result set count
> increase.
>
>
> The query field include phonetic filter and results are really not relevant
> due to this. I will suggest to :
>
>
> 1/ remove the phonetic filter in order to have less irrelevant results and
> so get smaller result set
>
> 2/ enable docValues on field use for faceting
>
>
> I expect decrease GC requirements and stabilize GC.
>
>
> Regards
>
>
> Dominique
>
>
>
>
>
> Le ven. 1 déc. 2017 à 18:17, Erick Erickson  a
> écrit :
>
>> Your autowarm counts are rather high, bit as Toke says this doesn't
>> seem outrageous.
>>
>> I have seen situations where Solr is running close to the limits of
>> its heap and GC only reclaims a tiny bit of memory each time, when you
>> say "full GC with no memory
>> reclaimed" is that really no memory _at all_? Or "almost no memory"?
>> This situation can be alleviated by allocating more memory to the JVM
>> .
>>
>> Your JVM pressure would certainly be reduced by enabling docValues on
>> any field you sort,facet or group on. That would require a full
>> reindex of course. Note that this makes your index on disk bigger, but
>> reduces JVM pressure by roughly the same amount so it's a win in this
>> situation.
>>
>> Have you attached a memory profiler to the running Solr instance? I'd
>> be curious where the memory is being allocated.
>>
>> Best,
>> Erick
>>
>> On Fri, Dec 1, 2017 at 8:31 AM, Toke Eskildsen  wrote:
>> > Dominique Bejean  wrote:
>> >> We are encountering issue with GC.
>> >
>> >> Randomly nearly once a day there are consecutive full GC with no memory
>> >> reclaimed.
>> >
>> > [... 1.2M docs, Xmx 6GB ...]
>> >
>> >> Gceasy suggest to increase heap size, but I do not agree
>> >
>> > It does seem strange, with your apparently modest index & workload.
>> Nothing you say sounds problematic to me and you have covered the usual
>> culprits overlapping searchers, faceting and filterCache.
>> >
>> > Is it possible for you to share the solr.log around the two times that
>> memory usage peaked? 2017-11-30 17:00-19:00 and 2017-12-01 08:00-12:00.
>> >
>> > If you cannot share, please check if you have excessive traffic around
>> that time or if there is a lot of UnInverting going on (triggered by
>> f