Solr performance issue

2018-02-15 Thread Srinivas Kashyap
Hi,

I have implemented 'SortedMapBackedCache' in my SqlEntityProcessor for the 
child entities in data-config.xml. And i'm using the same for full-import only. 
And in the beginning of my implementation, i had written delta-import query to 
index the modified changes. But my requirement grew and i have 17 child 
entities for a single parent entity now. When doing delta-import for huge data, 
the number of requests being made to datasource(database)  became more and CPU 
utilization was 100% when concurrent users started modifying the data. For this 
instead of calling delta-import which imports based on last index time, I did 
full-import('SortedMapBackedCache' ) based on last index time.

Though the parent entity query would return only records that are modified, the 
child entity queries pull all the data from the database and the indexing 
happens 'in-memory' which is causing the JVM memory go out of memory.

Is there a way to specify in the child query entity to pull the record related 
to parent entity in the full-import mode.

Thanks and Regards,
Srinivas Kashyap

DISCLAIMER: 
E-mails and attachments from TradeStone Software, Inc. are confidential.
If you are not the intended recipient, please notify the sender immediately by
replying to the e-mail, and then delete it without making copies or using it
in any way. No representation is made that this email or any attachments are
free of viruses. Virus scanning is recommended and is the responsibility of
the recipient.

Re: Reading data from Oracle

2018-02-15 Thread Bernd Fehling
And where is the bottleneck?

Is it reading from Oracle or injecting to Solr?

Regards
Bernd


Am 15.02.2018 um 08:34 schrieb LOPEZ-CORTES Mariano-ext:
> Hello
> 
> We have to delete our Solr collection and feed it periodically from an Oracle 
> database (up to 40M rows).
> 
> We've done the following test: From a java program, we read chunks of data 
> from Oracle and inject to Solr (via Solrj).
> 
> The problem : It is really really slow (1'5 nights).
> 
> Is there one faster method to do that ?
> 
> Thanks in advance.
> 


Re: Solr Recommended setup

2018-02-15 Thread Emir Arnautović
Hi Wael,
It is hard to give recommendation what to do since every data set and access 
patterns differ. There are some guidelines that can be followed, but you will 
need to test to see which setup suites you.
I am guessing that you are running Solr in standalone mode. The problem with 
such approach is that you have to scale it vertically but eventually you will 
reach limits. The most likely it will be query latency that will force you to 
split your index. At that moment it is usually better to switch to SolrCloud 
and let Solr handle shards then doing it on your own.
Re splitting reading/writing - I guess you are talking about master-slave mode 
where you index on master and query slaves. When/if you switch to SolrCloud you 
will no longer need/be able to do that (even there is some work going on to 
support such scenario).

Here are links to some blogposts that explain how you can estimate the right 
setup for you:
https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
 

http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html 


HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 14 Feb 2018, at 17:41, Wael Kader  wrote:
> 
> Hi,
> 
> I would like to get a recommendation for the SOLR setup I have.
> 
> I have an index getting around 2 Million records per day. The index used is
> in Cloudera Search (Solr).
> I am running everything on one node. I run SOLR commits for whatever data
> that comes to the index every 5 minutes.
> The whole Cloudera VM has 64 GB of Ram.
> 
> Its working fine till now having around 80 Million records but Solr gets
> slow once a week so I restart the VM for things to work.
> I would like to get a recommendation on the setup. Note that I can add VM's
> for my setup if needed.
> I read somewhere that its wrong to index and read data from the same place.
> I am doing this now and I do know I am doing things wrong.
> How can I do a setup on Cloudera for SOLR to do indexing in one VM and do
> the reading on another and what recommendations should I do for my setup.
> 
> 
> -- 
> Regards,
> Wael



RE: Reading data from Oracle

2018-02-15 Thread LOPEZ-CORTES Mariano-ext
Injecting too many rows into Solr throws Java heap exception (Higher memory? We 
have 8GB per node).

Have DIH support for paging queries?

Thanks!

-Message d'origine-
De : Bernd Fehling [mailto:bernd.fehl...@uni-bielefeld.de] 
Envoyé : jeudi 15 février 2018 10:13
À : solr-user@lucene.apache.org
Objet : Re: Reading data from Oracle

And where is the bottleneck?

Is it reading from Oracle or injecting to Solr?

Regards
Bernd


Am 15.02.2018 um 08:34 schrieb LOPEZ-CORTES Mariano-ext:
> Hello
> 
> We have to delete our Solr collection and feed it periodically from an Oracle 
> database (up to 40M rows).
> 
> We've done the following test: From a java program, we read chunks of data 
> from Oracle and inject to Solr (via Solrj).
> 
> The problem : It is really really slow (1'5 nights).
> 
> Is there one faster method to do that ?
> 
> Thanks in advance.
> 


RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-15 Thread Alessandro Benedetti
@Pratik: you should have investigated. I understand that solved your issue,
but in case you needed norms it doesn't make sense that cause your index to
grow up by a factor of 30. You must have faced a nasty bug if it was just
the norms.

@Howe : 

*Compound File* .cfs, .cfe  An optional "virtual" file consisting of all the
other index files for systems that frequently run out of file handles.

*Frequencies*   .docContains the list of docs which contain each term along
with frequency

*Field Data*.fdtThe stored fields for documents

*Positions* .posStores position information about where a term occurs in
the index

*Term Index*.tipThe index into the Term Dictionary

So, David, you confirm that those two index have :

1) same number of documents
2) identical documents ( + 1 new field each not indexed)
3) same number of deleted documents
4) they both were born from scratch ( an empty index)

The matter is still suspicious :
- Cfs seems to highlight some sort of malfunctioning during
indexing/committing in relation with the OS. What was the way of commiting
you were using ?

- .doc, .pos, .tip -> they shouldn't change, assuming both the indexes are
optimised, you are adding a not indexed field, those data structures
shouldn't be affected

- the stored content as well, too much of an increment 

Can you send us the full configuration for the new field ?
You don't want, norms, positions and frequencies for it.
But in case they are the issue, you may have found some very edge case,
because also enabling all of them you shouldn't incur in such a penalty for
just an additional tiny field



-
---
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Reading data from Oracle

2018-02-15 Thread Michal Hlavac
Did you try to use ConcurrentUpdateSolrClient instead of HttpSolrClient?

m.

On štvrtok, 15. februára 2018 8:34:06 CET LOPEZ-CORTES Mariano-ext wrote:
> Hello
> 
> We have to delete our Solr collection and feed it periodically from an Oracle 
> database (up to 40M rows).
> 
> We've done the following test: From a java program, we read chunks of data 
> from Oracle and inject to Solr (via Solrj).
> 
> The problem : It is really really slow (1'5 nights).
> 
> Is there one faster method to do that ?
> 
> Thanks in advance.


RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-15 Thread Howe, David

Hi Alessandro,

Some interesting testing today that seems to have gotten me closer to what the 
issue is.  When I run the version of the index that is working correctly 
against my database table that has the extra field in it, the index suddenly 
increases in size.  This is even though the data importer is running the same 
SELECT as before (which doesn't include the extra column) and loads the same 
number of rows.

After scratching my head for a bit and browsing through both versions of the 
table I am loading from (with and without the extra field), I noticed that the 
natural ordering of the tables is different.  These tables are "staging" tables 
that I populate with another set of queries and inserts to get the data into a 
format that is easy to ingest into Solr.  When I add the extra field to these 
queries, it changes the Oracle query plan as the field is contained in a 
different table that I need to join to.  As I don't specify an "ORDER BY" on 
the query (as I didn't think it would make a difference and would slow the 
query down), Oracle is free to chose how it orders the result set.  Adding the 
extra field changes that natural ordering, which affects the order things go 
into my staging table.  As I don't specify an "ORDER BY" when I select things 
out of the staging table, my data in the scenario that is working is being 
loaded in a different order to the scenario which doesn't work.

I am currently running full loads to verify this under each scenario, as I have 
now forced the data in the scenario that doesn't work to be in the same order 
as the scenario that does.  Will see how this load goes overnight.

This leads to the question of what difference does it make to Solr what order I 
load the data in?

I also noticed that the .cfs file is quite large in the second scenario, even 
though this is supposed to be disabled by default in Solr.  I checked my Solr 
config and there is no override of the default.

In answer to your questions:

1) same number of documents - YES ~14,000,000 documents
2) identical documents ( + 1 new field each not indexed) - YES, the second 
scenario has one extra field that is stored but not indexed
3) same number of deleted documents - YES, there are zero deleted documents in 
both scenarios
4) they both were born from scratch ( an empty index) - YES, both start from a 
brand new virtual server with a brand new installation of Solr

I am using the default auto commit, which I think is 15000.

Thanks again for your assistance.

Regards,

David

David Howe
Java Domain Architect
Postal Systems
Level 16, 111 Bourke Street Melbourne VIC 3000

T  0391067904

M  0424036591

E  david.h...@auspost.com.au

W  auspost.com.au
W  startrack.com.au

Australia Post is committed to providing our customers with excellent service. 
If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email communication may be proprietary, 
confidential or legally professionally privileged. It is intended exclusively 
for the individual or entity to which it is addressed. You should only read, 
disclose, re-transmit, copy, distribute, act in reliance on or commercialise 
the information if you are authorised to do so. Australia Post does not 
represent, warrant or guarantee that the integrity of this email communication 
has been maintained nor that the communication is free of errors, virus or 
interference.

If you are not the addressee or intended recipient please notify us by replying 
direct to the sender and then destroy any electronic or paper copy of this 
message. Any views expressed in this email communication are taken to be those 
of the individual sender, except where the sender specifically attributes those 
views to Australia Post and is authorised to do so.

Please consider the environment before printing this email.


Re: Reading data from Oracle

2018-02-15 Thread Bernd Fehling
So it is not SolrJ, but Solr is your problem?

In your first email there was nothing about heap exceptions, only the runtime 
about loading.

What do you means by "injecting too many rows", what is "too many"?

Some numbers while loading from scratch:
- single node 412GB index
- 92 fields
- 123.6 million docs
- 1.937 billion terms
- loading from file system
- indexing time 9 hrs 5 min
- using SolJ ConcurrentUpdateSolrClient
--- queueSize=1, threads=12
--- waitFlush=true, waitSearcher=true, softcommit=false
And, Solr must be configured to "swallow" all this :-)


You say "8GB per node" so it is SolrCloud?

Anyhting else than heap exception?

How many commits?

Regards
Bernd


Am 15.02.2018 um 10:31 schrieb LOPEZ-CORTES Mariano-ext:
> Injecting too many rows into Solr throws Java heap exception (Higher memory? 
> We have 8GB per node).
> 
> Have DIH support for paging queries?
> 
> Thanks!
> 
> -Message d'origine-
> De : Bernd Fehling [mailto:bernd.fehl...@uni-bielefeld.de] 
> Envoyé : jeudi 15 février 2018 10:13
> À : solr-user@lucene.apache.org
> Objet : Re: Reading data from Oracle
> 
> And where is the bottleneck?
> 
> Is it reading from Oracle or injecting to Solr?
> 
> Regards
> Bernd
> 
> 
> Am 15.02.2018 um 08:34 schrieb LOPEZ-CORTES Mariano-ext:
>> Hello
>>
>> We have to delete our Solr collection and feed it periodically from an 
>> Oracle database (up to 40M rows).
>>
>> We've done the following test: From a java program, we read chunks of data 
>> from Oracle and inject to Solr (via Solrj).
>>
>> The problem : It is really really slow (1'5 nights).
>>
>> Is there one faster method to do that ?
>>
>> Thanks in advance.
>>


Re: Multiple context fields in suggester component

2018-02-15 Thread Renuka Srishti
Thanks Alessandro Benedetti for the response. Can you please share the
resources, so that I can explore more about customization of context filter.

On Tue, Feb 13, 2018 at 5:01 PM, Alessandro Benedetti 
wrote:

> Simple answer is No.
> Only one context field is supported out of the box.
> The query you provide as context filtering query ( suggest.cfq= ) is
> going to be parsed and a boolean query for the context field is created
> [1].
>
> You will need some customizations if you are targeting that behavior.
>
> [1] query = new
> StandardQueryParser(contextFilterQueryAnalyzer).parse(contextFilter,
> CONTEXTS_FIELD_NAME);
>
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: Multiple context fields in suggester component

2018-02-15 Thread Alessandro Benedetti
You can start from here :

org/apache/solr/spelling/suggest/SolrSuggester.java:265

Cheers



-
---
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


solr read timeout

2018-02-15 Thread Prateek Jain J

Hi All,

I am using solr 4.8.1 in one of our application and sometimes it gives read 
timeout error. SolrJ is used from client side. How can I increase this default 
read timeout?


Regards,
Prateek Jain



Re: solr read timeout

2018-02-15 Thread Jason Gerlowski
Hi Prateek,

Depending on the SolrServer/SolrClient implementation your application
is using, you can make use of the "setSoTimeout" method, which
controls the socket (read) timeout in milliseconds.  e.g.
http://lucene.apache.org/solr/4_8_1/solr-solrj/org/apache/solr/client/solrj/impl/HttpSolrServer.html#setSoTimeout(int)

Best,

Jason

On Thu, Feb 15, 2018 at 9:58 AM, Prateek Jain J
 wrote:
>
> Hi All,
>
> I am using solr 4.8.1 in one of our application and sometimes it gives read 
> timeout error. SolrJ is used from client side. How can I increase this 
> default read timeout?
>
>
> Regards,
> Prateek Jain
>


Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-15 Thread Pratik Patel
@Alessandro I will see if I can reproduce the same issue just by turning
off omitNorms on field type. I'll open another mail thread if required.
Thanks.

On Thu, Feb 15, 2018 at 6:12 AM, Howe, David 
wrote:

>
> Hi Alessandro,
>
> Some interesting testing today that seems to have gotten me closer to what
> the issue is.  When I run the version of the index that is working
> correctly against my database table that has the extra field in it, the
> index suddenly increases in size.  This is even though the data importer is
> running the same SELECT as before (which doesn't include the extra column)
> and loads the same number of rows.
>
> After scratching my head for a bit and browsing through both versions of
> the table I am loading from (with and without the extra field), I noticed
> that the natural ordering of the tables is different.  These tables are
> "staging" tables that I populate with another set of queries and inserts to
> get the data into a format that is easy to ingest into Solr.  When I add
> the extra field to these queries, it changes the Oracle query plan as the
> field is contained in a different table that I need to join to.  As I don't
> specify an "ORDER BY" on the query (as I didn't think it would make a
> difference and would slow the query down), Oracle is free to chose how it
> orders the result set.  Adding the extra field changes that natural
> ordering, which affects the order things go into my staging table.  As I
> don't specify an "ORDER BY" when I select things out of the staging table,
> my data in the scenario that is working is being loaded in a different
> order to the scenario which doesn't work.
>
> I am currently running full loads to verify this under each scenario, as I
> have now forced the data in the scenario that doesn't work to be in the
> same order as the scenario that does.  Will see how this load goes
> overnight.
>
> This leads to the question of what difference does it make to Solr what
> order I load the data in?
>
> I also noticed that the .cfs file is quite large in the second scenario,
> even though this is supposed to be disabled by default in Solr.  I checked
> my Solr config and there is no override of the default.
>
> In answer to your questions:
>
> 1) same number of documents - YES ~14,000,000 documents
> 2) identical documents ( + 1 new field each not indexed) - YES, the second
> scenario has one extra field that is stored but not indexed
> 3) same number of deleted documents - YES, there are zero deleted
> documents in both scenarios
> 4) they both were born from scratch ( an empty index) - YES, both start
> from a brand new virtual server with a brand new installation of Solr
>
> I am using the default auto commit, which I think is 15000.
>
> Thanks again for your assistance.
>
> Regards,
>
> David
>
> David Howe
> Java Domain Architect
> Postal Systems
> Level 16, 111 Bourke Street Melbourne VIC 3000
>
> T  0391067904
>
> M  0424036591
>
> E  david.h...@auspost.com.au
>
> W  auspost.com.au
> W  startrack.com.au
>
> Australia Post is committed to providing our customers with excellent
> service. If we can assist you in any way please telephone 13 13 18 or visit
> our website.
>
> The information contained in this email communication may be proprietary,
> confidential or legally professionally privileged. It is intended
> exclusively for the individual or entity to which it is addressed. You
> should only read, disclose, re-transmit, copy, distribute, act in reliance
> on or commercialise the information if you are authorised to do so.
> Australia Post does not represent, warrant or guarantee that the integrity
> of this email communication has been maintained nor that the communication
> is free of errors, virus or interference.
>
> If you are not the addressee or intended recipient please notify us by
> replying direct to the sender and then destroy any electronic or paper copy
> of this message. Any views expressed in this email communication are taken
> to be those of the individual sender, except where the sender specifically
> attributes those views to Australia Post and is authorised to do so.
>
> Please consider the environment before printing this email.
>


Re: facet.method=uif not working in solr cloud?

2018-02-15 Thread Yonik Seeley
On Wed, Feb 14, 2018 at 7:24 PM, Wei  wrote:
> Thanks Yonik. If uif has big upfront cost when hits solr the first time,
> in solr cloud the same faceting request could hit different replicas in the
> same shard, so that cost will happen at least for the number of replicas?
> If we are doing frequent auto commits, fieldvaluecache will be invalidated
> and uif will have to pay the upfront cost again after each commit?

Right.  It's not good for frequently changing indexes.

-Yonik

>
>
> On Wed, Feb 14, 2018 at 11:51 AM, Yonik Seeley  wrote:
>
>> On Wed, Feb 14, 2018 at 2:28 PM, Wei  wrote:
>> > Thanks all!   It's really great learning.  A bit off the topic, after I
>> > enabled facet.method = uif in solr cloud,  the faceting performance is
>> > actually much worse than the original fc( ~1000 ms with uif  vs ~200 ms
>> > with fc). My cloud has 8 shards with 6 replicas in each shard.  I do see
>> > that fieldValueCache is getting utilized.  Any reason uif could be so
>> > slow?
>>
>> I haven't seen that before.  Are you sure it's not the first time
>> faceting on a field?  uif has big upfront cost, but is usually faster
>> once that cost has been paid.
>>
>>
>> -Yonik
>>
>> > On Tue, Feb 13, 2018 at 7:41 AM, Yonik Seeley  wrote:
>> >
>> >> Great, thanks for tracking that down!
>> >> It's interesting that a mincount of 0 disables uif processing in the
>> >> first place.  IIRC, it's only the hash-based method (as opposed to
>> >> array-based) that can't return zero counts.
>> >>
>> >> -Yonik
>> >>
>> >>
>> >> On Tue, Feb 13, 2018 at 6:17 AM, Alessandro Benedetti
>> >>  wrote:
>> >> > *Update* : This has been actually already solved by Hoss.
>> >> >
>> >> > https://issues.apache.org/jira/browse/SOLR-11711 and this is the Pull
>> >> > Request : https://github.com/apache/lucene-solr/pull/279/files
>> >> >
>> >> > This should go live with 7.3
>> >> >
>> >> > Cheers
>> >> >
>> >> >
>> >> >
>> >> > -
>> >> > ---
>> >> > Alessandro Benedetti
>> >> > Search Consultant, R&D Software Engineer, Director
>> >> > Sease Ltd. - www.sease.io
>> >> > --
>> >> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>> >>
>>


solr ltr jar is not able to recognize MultipleAdditiveTreesModel

2018-02-15 Thread kusha.pande
Hi I am trying to upload a training model generated from ranklib jar using
lamdamart mart.

The model is like 
{"class":"org.apache.solr.ltr.model.MultipleAdditiveTreesModel",
"name":"lambdamartmodel",
"params" : {
"trees" :[
   {
  "id": "1",
  "weight": "0.1",
  "split": {
 "feature": "8",
 "threshold": "7.111333",
 "split": [
{
   "pos": "left",
   "feature": "8",
   "threshold": "5.223557",
   "split": [
  {
 "pos": "left",
 "feature": "8",
 "threshold": "3.2083516",
 "split": [
{
   "pos": "left",
   "feature": "1",
   "threshold": "100.0",
   "split": [
  {
 "pos": "left",
 "feature": "8",
 "threshold": "2.2626402",
 "split": [
{
   "pos": "left",
   "feature": "8",
   "threshold": "2.2594802",
   "split": [
  {
 "pos": "left",
 "output": "-1.6371088"
  },
  {
 "pos": "right",
 "output": "-2.0"
  }
   ]
},
{
   "pos": "right",
   "feature": "8",
   "threshold": "2.4438097",
   "split": [
  {
 "pos": "left",
 "feature": "2",
 "threshold": "0.05",
 "split": [
{
   "pos": "left",
   "output": "2.0"
},
..


getting an exception as :
Exception: Status: 400 Bad Request
Response: {
  "responseHeader":{
"status":400,
"QTime":43},
  "error":{
"metadata":[
  "error-class","org.apache.solr.common.SolrException",
  "root-error-class","java.lang.RuntimeException"],
"msg":"org.apache.solr.ltr.model.ModelException: Model type does not
exist org.apache.solr.ltr.model.MultipleAdditiveTreesModel",
"code":400}}
.

I have used RankLib-2.1-patched.jar to generate the model and converted the
generated xml to json.





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


RE: solr ltr jar is not able to recognize MultipleAdditiveTreesModel

2018-02-15 Thread Brian Yee
I'm not sure if this will solve your problem, but you are using a very old 
version of Ranklib. The most recent version is 2.9.
https://sourceforge.net/projects/lemur/files/lemur/RankLib-2.9/


-Original Message-
From: kusha.pande [mailto:kusha.pa...@gmail.com] 
Sent: Thursday, February 15, 2018 8:12 AM
To: solr-user@lucene.apache.org
Subject: solr ltr jar is not able to recognize MultipleAdditiveTreesModel

Hi I am trying to upload a training model generated from ranklib jar using 
lamdamart mart.

The model is like
{"class":"org.apache.solr.ltr.model.MultipleAdditiveTreesModel",
"name":"lambdamartmodel",
"params" : {
"trees" :[
   {
  "id": "1",
  "weight": "0.1",
  "split": {
 "feature": "8",
 "threshold": "7.111333",
 "split": [
{
   "pos": "left",
   "feature": "8",
   "threshold": "5.223557",
   "split": [
  {
 "pos": "left",
 "feature": "8",
 "threshold": "3.2083516",
 "split": [
{
   "pos": "left",
   "feature": "1",
   "threshold": "100.0",
   "split": [
  {
 "pos": "left",
 "feature": "8",
 "threshold": "2.2626402",
 "split": [
{
   "pos": "left",
   "feature": "8",
   "threshold": "2.2594802",
   "split": [
  {
 "pos": "left",
 "output": "-1.6371088"
  },
  {
 "pos": "right",
 "output": "-2.0"
  }
   ]
},
{
   "pos": "right",
   "feature": "8",
   "threshold": "2.4438097",
   "split": [
  {
 "pos": "left",
 "feature": "2",
 "threshold": "0.05",
 "split": [
{
   "pos": "left",
   "output": "2.0"
}, ..


getting an exception as :
Exception: Status: 400 Bad Request
Response: {
  "responseHeader":{
"status":400,
"QTime":43},
  "error":{
"metadata":[
  "error-class","org.apache.solr.common.SolrException",
  "root-error-class","java.lang.RuntimeException"],
"msg":"org.apache.solr.ltr.model.ModelException: Model type does not
exist org.apache.solr.ltr.model.MultipleAdditiveTreesModel",
"code":400}}
.

I have used RankLib-2.1-patched.jar to generate the model and converted the
generated xml to json.





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-15 Thread Erick Erickson
David:

Rats, the cfs files make everything I'd hoped to understand with the
sizes ambiguous, since they conceal the underlying sizes of each other
extension. We can approach it a bit differently though. Take one
segment that's _not_ in cfs format where the total size of all files
making up that segment is near 5GB (the default max segment size) and
compare the individual segments for that segment only. What I'm hoping
to find out, of course, is which extensions vary dramatically. But
let's assume for the nonce that the numbers you already have are
comparable if we ignore the .cfs files.

.doc1094.682767.53 - term frequencies.
.fdt 1633.21 5387.92 - stored data
.pos809.23  1272.70 - position information

So the file difference (if borne out) indicates the following

- doc you have more documents or more terms or different options on
your terms [1]
- fdt you're storing more fields than you used to. [1]
- pos you have more docs or more terms or have position information
turned on where you didn't before. [1]

[1] or lots of deleted docs that haven't been merged away. This
information should be on the admin page for any particular core. I
think this unlikely, but who knows? NOTE, just because you get 14M fro
querying *:* does _not_ say anything about the deleted docs, which
take up space. This is highly unlikely to be your problem, but let's
eliminate the easy stuff ;)

Where I'd go from here after checking that these ratios are true for a
single like-sized segment in both cases

1> the LukeReqeustHandler can tell you information about exactly how
the index is defined, and using Luke itself can provide you a much
more detailed look at what's actually _in_ your index. You could also
have Luke reconstruct the same doc from your index in each case and
compare. Perhaps your SQL is doing something really unexpected. This
_should_ show you the realized meta-data for each field and let you
pinpoint any different options that have been enabled.

2> compare your Oracle intermediate tables, are they _really_
identical? The ordering shouldn't make any difference at all to Solr
assuming the same docs are being indexed (plus any expected delta).
There's an edge case I can imagine if you hit a "perfect storm" and
one version has a lot more deleted docs than the other that's possibly
the result of reordering, but that's unlikely. The edge case I'm
imagining would be easily verifiable by the two versions having a
radically different number of deleted docs

Best,
Erick




On Thu, Feb 15, 2018 at 7:13 AM, Pratik Patel  wrote:
> @Alessandro I will see if I can reproduce the same issue just by turning
> off omitNorms on field type. I'll open another mail thread if required.
> Thanks.
>
> On Thu, Feb 15, 2018 at 6:12 AM, Howe, David 
> wrote:
>
>>
>> Hi Alessandro,
>>
>> Some interesting testing today that seems to have gotten me closer to what
>> the issue is.  When I run the version of the index that is working
>> correctly against my database table that has the extra field in it, the
>> index suddenly increases in size.  This is even though the data importer is
>> running the same SELECT as before (which doesn't include the extra column)
>> and loads the same number of rows.
>>
>> After scratching my head for a bit and browsing through both versions of
>> the table I am loading from (with and without the extra field), I noticed
>> that the natural ordering of the tables is different.  These tables are
>> "staging" tables that I populate with another set of queries and inserts to
>> get the data into a format that is easy to ingest into Solr.  When I add
>> the extra field to these queries, it changes the Oracle query plan as the
>> field is contained in a different table that I need to join to.  As I don't
>> specify an "ORDER BY" on the query (as I didn't think it would make a
>> difference and would slow the query down), Oracle is free to chose how it
>> orders the result set.  Adding the extra field changes that natural
>> ordering, which affects the order things go into my staging table.  As I
>> don't specify an "ORDER BY" when I select things out of the staging table,
>> my data in the scenario that is working is being loaded in a different
>> order to the scenario which doesn't work.
>>
>> I am currently running full loads to verify this under each scenario, as I
>> have now forced the data in the scenario that doesn't work to be in the
>> same order as the scenario that does.  Will see how this load goes
>> overnight.
>>
>> This leads to the question of what difference does it make to Solr what
>> order I load the data in?
>>
>> I also noticed that the .cfs file is quite large in the second scenario,
>> even though this is supposed to be disabled by default in Solr.  I checked
>> my Solr config and there is no override of the default.
>>
>> In answer to your questions:
>>
>> 1) same number of documents - YES ~14,000,000 documents
>> 2) identical documents ( + 1 new field eac

Re: Reading data from Oracle

2018-02-15 Thread Erick Erickson
Very simple way to know where to start looking: just don't send the
docs to Solr.

Somewhere you have some code like:

SolrClient client = new CloudSolrClient...
while (more docs from the DB) {
doc_list = build_document_list()
client.add(doc_list);
}

Just comment out the client.add line and run it the program. Very
often, performance like this is slow in getting the data from the DB
rather than Solr being the bottleneck.

Another very easy thing to look at: Are your CPUs running hot? If your
Solr nodes are just idling along at, say, 10% then you're not feeding
docs fast enough.

Final thing to check: Are you batching? Batching docs can
significantly increase throughput, see:
https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/

And Solr simply shouldn't be running out of memory unless you're
sending utterly massive documents or sending huge numbers of docs.
When indexing, the default ramBufferSizeMB is 100, meaning that when
the in-memory structures exceed 100MB, they are flushed to disk. What
are you commit intervals (both soft and hard)?

Best,
Erick

On Thu, Feb 15, 2018 at 5:56 AM, Bernd Fehling
 wrote:
> So it is not SolrJ, but Solr is your problem?
>
> In your first email there was nothing about heap exceptions, only the runtime 
> about loading.
>
> What do you means by "injecting too many rows", what is "too many"?
>
> Some numbers while loading from scratch:
> - single node 412GB index
> - 92 fields
> - 123.6 million docs
> - 1.937 billion terms
> - loading from file system
> - indexing time 9 hrs 5 min
> - using SolJ ConcurrentUpdateSolrClient
> --- queueSize=1, threads=12
> --- waitFlush=true, waitSearcher=true, softcommit=false
> And, Solr must be configured to "swallow" all this :-)
>
>
> You say "8GB per node" so it is SolrCloud?
>
> Anyhting else than heap exception?
>
> How many commits?
>
> Regards
> Bernd
>
>
> Am 15.02.2018 um 10:31 schrieb LOPEZ-CORTES Mariano-ext:
>> Injecting too many rows into Solr throws Java heap exception (Higher memory? 
>> We have 8GB per node).
>>
>> Have DIH support for paging queries?
>>
>> Thanks!
>>
>> -Message d'origine-
>> De : Bernd Fehling [mailto:bernd.fehl...@uni-bielefeld.de]
>> Envoyé : jeudi 15 février 2018 10:13
>> À : solr-user@lucene.apache.org
>> Objet : Re: Reading data from Oracle
>>
>> And where is the bottleneck?
>>
>> Is it reading from Oracle or injecting to Solr?
>>
>> Regards
>> Bernd
>>
>>
>> Am 15.02.2018 um 08:34 schrieb LOPEZ-CORTES Mariano-ext:
>>> Hello
>>>
>>> We have to delete our Solr collection and feed it periodically from an 
>>> Oracle database (up to 40M rows).
>>>
>>> We've done the following test: From a java program, we read chunks of data 
>>> from Oracle and inject to Solr (via Solrj).
>>>
>>> The problem : It is really really slow (1'5 nights).
>>>
>>> Is there one faster method to do that ?
>>>
>>> Thanks in advance.
>>>


Re: Solr performance issue

2018-02-15 Thread Erick Erickson
Srinivas:

Not an answer to your question, but when DIH starts getting this
complicated, I start to seriously think about SolrJ, see:
https://lucidworks.com/2012/02/14/indexing-with-solrj/

IN particular, it moves the heavy lifting of acquiring the data from a
Solr node (which I'm assuming also has to index docs) to "some
client". It also let's you play some tricks with the code to make
things faster.

Best,
Erick

On Thu, Feb 15, 2018 at 1:00 AM, Srinivas Kashyap
 wrote:
> Hi,
>
> I have implemented 'SortedMapBackedCache' in my SqlEntityProcessor for the 
> child entities in data-config.xml. And i'm using the same for full-import 
> only. And in the beginning of my implementation, i had written delta-import 
> query to index the modified changes. But my requirement grew and i have 17 
> child entities for a single parent entity now. When doing delta-import for 
> huge data, the number of requests being made to datasource(database)  became 
> more and CPU utilization was 100% when concurrent users started modifying the 
> data. For this instead of calling delta-import which imports based on last 
> index time, I did full-import('SortedMapBackedCache' ) based on last index 
> time.
>
> Though the parent entity query would return only records that are modified, 
> the child entity queries pull all the data from the database and the indexing 
> happens 'in-memory' which is causing the JVM memory go out of memory.
>
> Is there a way to specify in the child query entity to pull the record 
> related to parent entity in the full-import mode.
>
> Thanks and Regards,
> Srinivas Kashyap
>
> DISCLAIMER:
> E-mails and attachments from TradeStone Software, Inc. are confidential.
> If you are not the intended recipient, please notify the sender immediately by
> replying to the e-mail, and then delete it without making copies or using it
> in any way. No representation is made that this email or any attachments are
> free of viruses. Virus scanning is recommended and is the responsibility of
> the recipient.


RE: Solr search word NOT followed by another word

2018-02-15 Thread Allison, Timothy B.
I've been away from the ComplexQueryParser for a while, and I was wrong when I 
said in my earlier email that no currently included Solr parse generates a 
SpanNotQuery.  

You're right, Emir, that the ComplexQueryParser does generate a SpanNotQuery, 
and, y, I just tried this with 7.2.1, and it retrieves "Leonardo is the name of 
Leonardo da Vinci".

However, if fails to retrieve :
a) "Leonardo da is the name of Leonardo da Vinci"
and
b) "Leonardo Vinci is the name of Leonardo da Vinci"

because the SpanNot exclude is a SpanOr ("da" or "vinci") after the rewrite: 

spanNot(name:leonardo, spanNear([name:leonardo, spanOr([name:da, name:vinci])], 
0, true), 0, 0)







-Original Message-
From: Emir Arnautović [mailto:emir.arnauto...@sematext.com] 
Sent: Tuesday, February 13, 2018 11:23 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr search word NOT followed by another word

Hi Ivan,
Which version of Solr do you use? I’ve just tried it on 6.5.1 and it returned 
expected.

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch 
Consulting Support Training - http://sematext.com/



> On 13 Feb 2018, at 16:08, ivan  wrote:
> 
> Hi Emir,
> 
> unfortunately that does not work, since i'm not getting a match for my 
> third example ("Leonardo is the name of Leonardo da Vinci") because i 
> have both "Leonardo" and "Leonardo da Vinci" in the same field. I'm 
> fine with having "Leonardo da Vinci" as long as i have another 
> "Leonardo" (NOT followed by da Vinci).
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Issue Using JSON Facet API Buckets in Solr 6.6

2018-02-15 Thread Antelmo Aguilar
Hi,

Here are two pastebins.  The first is the full complete response with the
search parameters used.  The second is the stack trace from the logs:

https://pastebin.com/rsHvKK63

https://pastebin.com/8amxacAj

I am not using any custom code or plugins with the Solr instance.

Please let me know if you need anything else and thanks for looking into
this.

-Antelmo

On Wed, Feb 14, 2018 at 12:56 PM, Yonik Seeley  wrote:

> Could you provide the full stack trace containing "Invalid Date
> String"  and the full request that causes it?
> Are you using any custom code/plugins in Solr?
> -Yonik
>
>
> On Mon, Feb 12, 2018 at 4:55 PM, Antelmo Aguilar  wrote:
> > Hi,
> >
> > I was using the following part of a query to get facet buckets so that I
> > can use the information in the buckets for some post-processing:
> >
> > "json":
> > "{\"filter\":[\"bundle:pop_sample\",\"has_abundance_data_
> b:true\",\"has_geodata:true\",\"${project}\"],\"facet\":{\"
> term\":{\"type\":\"terms\",\"limit\":-1,\"field\":\"${term:
> species_category}\",\"facet\":{\"collection_dates\":{\"type\
> ":\"terms\",\"limit\":-1,\"field\":\"collection_date\",\"
> facet\":{\"collection\":
> > {\"type\":\"terms\",\"field\":\"collection_assay_id_s\",\"
> facet\":{\"abnd\":\"sum(div(sample_size_i,
> > collection_duration_days_i))\""
> >
> > Sorry if it is hard to read.  Basically what is was doing was getting the
> > following buckets:
> >
> > First bucket will be categorized by "Species category" by default unless
> we
> > pass in the request the "term" parameter which we will categories the
> first
> > bucket by whatever "term" is set to.  Then inside this first bucket, we
> > create another buckets of the "Collection date" category.  Then inside
> the
> > "Collection date" category buckets, we would use some functions to do
> some
> > calculations and return those calculations inside the "Collection date"
> > category buckets.
> >
> > This query is working fine in Solr 6.2, but I upgraded our instance of
> Solr
> > 6.2 to the latest 6.6 version.  However it seems that upgrading to Solr
> 6.6
> > broke the above query.  Now it complains when trying to create the
> buckets
> > of the "Collection date" category.  I get the following error:
> >
> > Invalid Date String:'Fri Aug 01 00:00:00 UTC 2014'
> >
> > It seems that when creating the buckets of a date field, it does some
> > conversion of the way the date is stored and causes the error to appear.
> > Does anyone have an idea as to why this error is happening?  I would
> really
> > appreciate any help.  Hopefully I was able to explain my issue well.
> >
> > Thanks,
> > Antelmo
>


RE: Solr search word NOT followed by another word

2018-02-15 Thread Allison, Timothy B.
I just updated the SpanQueryParser (LUCENE-5205) and its Solr plugin 
(SOLR-5410) for master and 7.2.1.

What version of Solr are you using and which version of the plugin?

These should be available on maven central shortly: version 7.2-0.1

org.tallison.solr
solr-5410
7.2-0.1


Or you can fork: https://github.com/tballison/lucene-addons/tree/7.2-0.1


-Original Message-
From: ivan [mailto:i...@presstoday.com] 
Sent: Wednesday, February 14, 2018 6:42 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr search word NOT followed by another word

Hi Timothy,

i'm trying to use your Parser, but i'm having some trouble with the versions of 
solr\lucene.
I'm trying to use version 6.4.1 but i'm facing a lot of incompatibilities with 
version 5. Is there any updated version of the plugin?




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr search word NOT followed by another word

2018-02-15 Thread Emir Arnautović
Hi,
I did not provide the right query. If you query as {!complexphrase 
df=name}”Leonardo -da -Vinci” all works as expected. This matches all three 
doc. 

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 15 Feb 2018, at 19:51, Allison, Timothy B.  wrote:
> 
> I've been away from the ComplexQueryParser for a while, and I was wrong when 
> I said in my earlier email that no currently included Solr parse generates a 
> SpanNotQuery.  
> 
> You're right, Emir, that the ComplexQueryParser does generate a SpanNotQuery, 
> and, y, I just tried this with 7.2.1, and it retrieves "Leonardo is the name 
> of Leonardo da Vinci".
> 
> However, if fails to retrieve :
> a) "Leonardo da is the name of Leonardo da Vinci"
> and
> b) "Leonardo Vinci is the name of Leonardo da Vinci"
> 
> because the SpanNot exclude is a SpanOr ("da" or "vinci") after the rewrite: 
> 
> spanNot(name:leonardo, spanNear([name:leonardo, spanOr([name:da, 
> name:vinci])], 0, true), 0, 0)
> 
> 
> 
> 
> 
> 
> 
> -Original Message-
> From: Emir Arnautović [mailto:emir.arnauto...@sematext.com] 
> Sent: Tuesday, February 13, 2018 11:23 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr search word NOT followed by another word
> 
> Hi Ivan,
> Which version of Solr do you use? I’ve just tried it on 6.5.1 and it returned 
> expected.
> 
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection Solr & 
> Elasticsearch Consulting Support Training - http://sematext.com/
> 
> 
> 
>> On 13 Feb 2018, at 16:08, ivan  wrote:
>> 
>> Hi Emir,
>> 
>> unfortunately that does not work, since i'm not getting a match for my 
>> third example ("Leonardo is the name of Leonardo da Vinci") because i 
>> have both "Leonardo" and "Leonardo da Vinci" in the same field. I'm 
>> fine with having "Leonardo da Vinci" as long as i have another 
>> "Leonardo" (NOT followed by da Vinci).
>> 
>> 
>> 
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> 



RE: Solr search word NOT followed by another word

2018-02-15 Thread Allison, Timothy B.
Nice.  Thank you!

-Original Message-
From: Emir Arnautović [mailto:emir.arnauto...@sematext.com] 
Sent: Thursday, February 15, 2018 2:19 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr search word NOT followed by another word

Hi,
I did not provide the right query. If you query as {!complexphrase 
df=name}”Leonardo -da -Vinci” all works as expected. This matches all three 
doc. 

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch 
Consulting Support Training - http://sematext.com/


Re: Reading data from Oracle

2018-02-15 Thread Shawn Heisey
On 2/15/2018 12:34 AM, LOPEZ-CORTES Mariano-ext wrote:
> We've done the following test: From a java program, we read chunks of data 
> from Oracle and inject to Solr (via Solrj).
>
> The problem : It is really really slow (1'5 nights).
>
> Is there one faster method to do that ?

Are you indexing with a single thread?  The way to speed up indexing is
to index with many threads or processes simultaneously.

Using ConcurrentUpdateSolrClient as Michal mentioned is one way to get
multi-threading, but if you go this route, your program will never know
about any indexing errors.  Errors will be logged, but your program
won't know about them.  This client is good for initial bulk indexing,
but when things become more automated, you're probably going to want
automatic detection and notification when there's a problem.

If you care about error handling, then you're going to have to handle
multiple threads or processes in your own program.  If you don't care
about error handling, then go ahead and use ConcurrentUpdateSolrClient. 
But if your database is the bottleneck, that will not make things faster.

For something later in the thread:

One common problem with dataimport and large tables is that almost every
JDBC driver will read the entire result of the SELECT statement into
memory before providing that information to the program that did the
SELECT.  For large tables, this information can be larger than the Java
heap, and that will cause the program to encounter an OutOfMemoryError. 
To solve this, you will need to ask Oracle how to disable this behavior
with their JDBC driver.

For MySQL, the solution is to set the batchSize parameter in DIH to -1,
which results in DIH setting a JDBC fetch size of Integer.MIN_VALUE ...
which tells the MySQL driver to stream results instead of putting them
all into memory.  For Microsoft SQL server, you need a URL parameter for
the JDBC url, or simply to upgrade the JDBC driver to a newer version
that doesn't do this by default.  I have not been able to figure out how
to get the Oracle driver to do it.  Chances are that it will be a JDBC
url parameter.

https://wiki.apache.org/solr/DataImportHandlerFaq

Thanks,
Shawn



Re: Solr performance issue

2018-02-15 Thread Shawn Heisey
On 2/15/2018 2:00 AM, Srinivas Kashyap wrote:
> I have implemented 'SortedMapBackedCache' in my SqlEntityProcessor for the 
> child entities in data-config.xml. And i'm using the same for full-import 
> only. And in the beginning of my implementation, i had written delta-import 
> query to index the modified changes. But my requirement grew and i have 17 
> child entities for a single parent entity now. When doing delta-import for 
> huge data, the number of requests being made to datasource(database)  became 
> more and CPU utilization was 100% when concurrent users started modifying the 
> data. For this instead of calling delta-import which imports based on last 
> index time, I did full-import('SortedMapBackedCache' ) based on last index 
> time.
>
> Though the parent entity query would return only records that are modified, 
> the child entity queries pull all the data from the database and the indexing 
> happens 'in-memory' which is causing the JVM memory go out of memory.

Can you provide your DIH config file (with passwords redacted) and the
precise URL you are using to initiate dataimport?  Also, I would like to
know what field you have defined as your uniqueKey.  I may have more
questions about the data in your system, depending on what I see.

That cache implementation should only cache entries from the database
that are actually requested.  If your query is correctly defined, it
should not pull all records from the DB table.

> Is there a way to specify in the child query entity to pull the record 
> related to parent entity in the full-import mode.

If I am understanding your question correctly, this is one of the fairly
basic things that DIH does.  Look at this config example in the
reference guide:

https://lucene.apache.org/solr/guide/6_6/uploading-structured-data-store-data-with-the-data-import-handler.html#configuring-the-dih-configuration-file

In the entity named feature in that example config, the query string
uses ${item.ID} to reference the ID column from the parent entity, which
is item.

I should warn you that a cached entity does not always improve
performance.  This is particularly true if the lookup into the cache is
the information that goes to your uniqueKey field.  When the lookup is
by uniqueKey, every single row requested from the database will be used
exactly once, so there's not really any point to caching it.

Thanks,
Shawn



RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-15 Thread Howe, David

Hi Erick,

I have the full dump of the Solr index file sizes as well if that is of any 
help.  I have attached it below this message.

We don't have any deleted docs in our index, as we always build it from a brand 
new virtual machine with a brand new installation of Solr.

The ordering is definitely making a difference, as I can run the same indexing 
configuration over a table with the same data just in different orders and it 
produces these vastly different results.  I have been chasing this for a couple 
of weeks trying to work out what the difference is when we just add one extra 
field.  The difference that I have found is that the extra field causes the 
staging table population query to be optimised differently and to select the 
records in a different sequence.  When I force the records back to their 
original sequence, the index goes back to being small again.

I'm currently re-building my staging data to try and get it into the same order 
as before and including the extra field.  I will post the file sizes again when 
I have that result.

Regards,

David

total 14600404
-rw-r--r-- 1 solr solr 97 Feb 14 01:34 _7l.dii
-rw-r--r-- 1 solr solr   83831801 Feb 14 01:34 _7l.dim
-rw-r--r-- 1 solr solr 1431645451 Feb 14 01:33 _7l.fdt
-rw-r--r-- 1 solr solr 381994 Feb 14 01:33 _7l.fdx
-rw-r--r-- 1 solr solr   6370 Feb 14 01:34 _7l.fnm
-rw-r--r-- 1 solr solr   29353048 Feb 14 01:34 _7l.nvd
-rw-r--r-- 1 solr solr463 Feb 14 01:34 _7l.nvm
-rw-r--r-- 1 solr solr606 Feb 14 01:34 _7l.si
-rw-r--r-- 1 solr solr  734701117 Feb 14 01:34 _7l_Lucene50_0.doc
-rw-r--r-- 1 solr solr  335043096 Feb 14 01:34 _7l_Lucene50_0.pos
-rw-r--r-- 1 solr solr   34248274 Feb 14 01:34 _7l_Lucene50_0.tim
-rw-r--r-- 1 solr solr 624945 Feb 14 01:34 _7l_Lucene50_0.tip
-rw-r--r-- 1 solr solr  165958502 Feb 14 01:34 _7l_Lucene70_0.dvd
-rw-r--r-- 1 solr solr   2581 Feb 14 01:34 _7l_Lucene70_0.dvm
-rw-r--r-- 1 solr solr405 Feb 14 01:46 _9p.cfe
-rw-r--r-- 1 solr solr   38776749 Feb 14 01:46 _9p.cfs
-rw-r--r-- 1 solr solr452 Feb 14 01:46 _9p.si
-rw-r--r-- 1 solr solr 97 Feb 14 02:07 _cm.dii
-rw-r--r-- 1 solr solr   83111509 Feb 14 02:07 _cm.dim
-rw-r--r-- 1 solr solr 1419981112 Feb 14 02:02 _cm.fdt
-rw-r--r-- 1 solr solr 379544 Feb 14 02:02 _cm.fdx
-rw-r--r-- 1 solr solr   6370 Feb 14 02:07 _cm.fnm
-rw-r--r-- 1 solr solr   29049434 Feb 14 02:07 _cm.nvd
-rw-r--r-- 1 solr solr463 Feb 14 02:07 _cm.nvm
-rw-r--r-- 1 solr solr606 Feb 14 02:07 _cm.si
-rw-r--r-- 1 solr solr  728509370 Feb 14 02:07 _cm_Lucene50_0.doc
-rw-r--r-- 1 solr solr  332343997 Feb 14 02:07 _cm_Lucene50_0.pos
-rw-r--r-- 1 solr solr   34361884 Feb 14 02:07 _cm_Lucene50_0.tim
-rw-r--r-- 1 solr solr 658404 Feb 14 02:07 _cm_Lucene50_0.tip
-rw-r--r-- 1 solr solr  164612509 Feb 14 02:07 _cm_Lucene70_0.dvd
-rw-r--r-- 1 solr solr   2581 Feb 14 02:07 _cm_Lucene70_0.dvm
-rw-r--r-- 1 solr solr405 Feb 14 02:09 _fb.cfe
-rw-r--r-- 1 solr solr   44333425 Feb 14 02:09 _fb.cfs
-rw-r--r-- 1 solr solr452 Feb 14 02:09 _fb.si
-rw-r--r-- 1 solr solr 97 Feb 14 02:24 _h2.dii
-rw-r--r-- 1 solr solr   77079684 Feb 14 02:24 _h2.dim
-rw-r--r-- 1 solr solr 1304390074 Feb 14 02:22 _h2.fdt
-rw-r--r-- 1 solr solr 347494 Feb 14 02:22 _h2.fdx
-rw-r--r-- 1 solr solr   6370 Feb 14 02:24 _h2.fnm
-rw-r--r-- 1 solr solr   26756876 Feb 14 02:24 _h2.nvd
-rw-r--r-- 1 solr solr463 Feb 14 02:24 _h2.nvm
-rw-r--r-- 1 solr solr606 Feb 14 02:24 _h2.si
-rw-r--r-- 1 solr solr  669875920 Feb 14 02:24 _h2_Lucene50_0.doc
-rw-r--r-- 1 solr solr  305954906 Feb 14 02:24 _h2_Lucene50_0.pos
-rw-r--r-- 1 solr solr   32019733 Feb 14 02:24 _h2_Lucene50_0.tim
-rw-r--r-- 1 solr solr 619562 Feb 14 02:24 _h2_Lucene50_0.tip
-rw-r--r-- 1 solr solr  151772808 Feb 14 02:24 _h2_Lucene70_0.dvd
-rw-r--r-- 1 solr solr   2497 Feb 14 02:24 _h2_Lucene70_0.dvm
-rw-r--r-- 1 solr solr405 Feb 14 02:45 _mx.cfe
-rw-r--r-- 1 solr solr  277937779 Feb 14 02:45 _mx.cfs
-rw-r--r-- 1 solr solr452 Feb 14 02:45 _mx.si
-rw-r--r-- 1 solr solr 97 Feb 14 02:47 _n9.dii
-rw-r--r-- 1 solr solr   82335510 Feb 14 02:47 _n9.dim
-rw-r--r-- 1 solr solr 1400595065 Feb 14 02:46 _n9.fdt
-rw-r--r-- 1 solr solr 374259 Feb 14 02:46 _n9.fdx
-rw-r--r-- 1 solr solr   6370 Feb 14 02:47 _n9.fnm
-rw-r--r-- 1 solr solr   28775974 Feb 14 02:47 _n9.nvd
-rw-r--r-- 1 solr solr463 Feb 14 02:47 _n9.nvm
-rw-r--r-- 1 solr solr606 Feb 14 02:47 _n9.si
-rw-r--r-- 1 solr solr  719183309 Feb 14 02:46 _n9_Lucene50_0.doc
-rw-r--r-- 1 solr solr  328214265 Feb 14 02:46 _n9_Lucene50_0.pos
-rw-r--r-- 1 solr solr   34098919 Feb 14 02:46 _n9_Lucene50_0.tim
-rw-r--r-- 1 solr solr 654313 Feb 14 02:46 _n9_Lucene50_0.tip
-rw-r--r-- 1 solr solr  163220960 Feb 14 02:46 _n9_Lucene70_0.dvd
-rw-r--r-- 1 solr solr   2560 Feb 14 02:46 _n9_Lucene70_0.dvm
-rw-r--r-- 1 solr solr405 Feb 14 02:52 _ns.cfe
-rw-r--r--

Solr streaming expression - options for Full Outer Join

2018-02-15 Thread Ganesh Sethuraman
 I am using Solr 7.2.1. I would to perform full outer join  (emit documents
from both left and right and if there are common combine them) with solr
streaming decorators on two collections and "update" it to a new
destination collection. I see "merge" decorator option exists, but this
seems to return two JSON document for same id field from these two
collections instead of one combined document. The leftOuterJoin seems to do
this combining correctly by returning a document with matched "id"
field into one document. But leftOuterJoin is not exactly what i want to
do, i want to full outer join. Because merge returns two documents with
same id, in the destination collection, only second document exists, not
both. Is there a way to achieve what i am trying to do? Any help
appreciated. Just to give more details, here is something I am doing:

commit(
destinationCollection ,
batchSize=1,
   update(destinationCollection,
batchSize=2,
merge(search(col1,q=id:5,fl="id, collection1_field1 ", sort="id
asc",qt="/export"),search(col2,q=id:5,fl="id, collection2_field2 ",
sort="id asc",qt="/export"),on="id asc")
))


Merge Response
{
"result-set": { "docs": [ { "id": "5", "collection1_field1": 64 }, { "id":
"5", "collection2_field2": 0 }, { "EOF": true, "RESPONSE_TIME": 17 } ] } }

But i need just one document in the response with combined fields


Solr running on Tomcat

2018-02-15 Thread GVK Prasad

I read some posts on setting up Solr to Run on Tomcat. But all these posts are 
about Solr version 4.0 or earlier.
I am thinking of hosting  Solr on Tomcat for scalability.
Any recommendation on this.

Prasad




---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus


Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-15 Thread Erick Erickson
This isn't terribly useful without a similar dump of "the other" index
directory. The point is to compare the different extensions some
segment where the sum of all the files in that segment is roughly
equal. So if you have a listing of the old index around, that would
help.

bq: We don't have any deleted docs in our index, as we always build it
from a brand new virtual machine with a brand new installation of
Solr.

Well, that's an assumption I want to check. Here's the problem. It's
possible that the ordering bit you're talking about is really masking
indexing the same  multiple times. Since indexing a doc
with the same  just marks the old doc as deleted, the old
doc will take up room in your index until it's purged during segment
merging. This is a _really_ long shot mind you, I have a hard time
believing that this is the root cause here. It's worth checking
though. Even doing a q=*:* won't help since that doesn't count deleted
docs. Take a quick glance at the admin overview page for a core and
check, there is "maxDoc", "deletedDocs" and "numDocs". I expect
deletedDocs will be zero and numDocs and maxDoc will be your 14M, but
this problem is so odd that I'm covering as many  bases as I can think
of ;)

Now, ordering may appear to change things, but that could simply be
that the deleted docs don't happen to fall in segments that are
merged. Again, this is unlikely but possible.

The shortcut here would be to optimize afterwards. In the usual course
of events this should _not_ be necessary (or even desirable) unless
you do it every time you build your index for arcane reasons, see:
https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/.
But if you do optimize (forceMerge) and the size drops back to more
reasonable levels it would be a clue.

Ordering simply should not affect the final index size except for,
possibly, changing the number of deleted docs in the index largely
through chance. If you do see a dramatic difference, try the optimize
thing to check.

If simple ordering _does_ really make a difference (outside of number
of deleted docs)  my understanding of Solr is going to undergo a
revision. And we'll probably be raising a JIRA or two ;)

Now, what I really expect the issue is is one of two things:
1> you have some options turned on now that weren't before, either
through some innocent-seeming change, a change in the internal
defaults etc.
2> your SQL with the extra field is behaving unexpectedly.

The proof is of course in the pudding...

Best,
Erick



On Thu, Feb 15, 2018 at 5:15 PM, Howe, David  wrote:
>
> Hi Erick,
>
> I have the full dump of the Solr index file sizes as well if that is of any 
> help.  I have attached it below this message.
>
> We don't have any deleted docs in our index, as we always build it from a 
> brand new virtual machine with a brand new installation of Solr.
>
> The ordering is definitely making a difference, as I can run the same 
> indexing configuration over a table with the same data just in different 
> orders and it produces these vastly different results.  I have been chasing 
> this for a couple of weeks trying to work out what the difference is when we 
> just add one extra field.  The difference that I have found is that the extra 
> field causes the staging table population query to be optimised differently 
> and to select the records in a different sequence.  When I force the records 
> back to their original sequence, the index goes back to being small again.
>
> I'm currently re-building my staging data to try and get it into the same 
> order as before and including the extra field.  I will post the file sizes 
> again when I have that result.
>
> Regards,
>
> David
>
> total 14600404
> -rw-r--r-- 1 solr solr 97 Feb 14 01:34 _7l.dii
> -rw-r--r-- 1 solr solr   83831801 Feb 14 01:34 _7l.dim
> -rw-r--r-- 1 solr solr 1431645451 Feb 14 01:33 _7l.fdt
> -rw-r--r-- 1 solr solr 381994 Feb 14 01:33 _7l.fdx
> -rw-r--r-- 1 solr solr   6370 Feb 14 01:34 _7l.fnm
> -rw-r--r-- 1 solr solr   29353048 Feb 14 01:34 _7l.nvd
> -rw-r--r-- 1 solr solr463 Feb 14 01:34 _7l.nvm
> -rw-r--r-- 1 solr solr606 Feb 14 01:34 _7l.si
> -rw-r--r-- 1 solr solr  734701117 Feb 14 01:34 _7l_Lucene50_0.doc
> -rw-r--r-- 1 solr solr  335043096 Feb 14 01:34 _7l_Lucene50_0.pos
> -rw-r--r-- 1 solr solr   34248274 Feb 14 01:34 _7l_Lucene50_0.tim
> -rw-r--r-- 1 solr solr 624945 Feb 14 01:34 _7l_Lucene50_0.tip
> -rw-r--r-- 1 solr solr  165958502 Feb 14 01:34 _7l_Lucene70_0.dvd
> -rw-r--r-- 1 solr solr   2581 Feb 14 01:34 _7l_Lucene70_0.dvm
> -rw-r--r-- 1 solr solr405 Feb 14 01:46 _9p.cfe
> -rw-r--r-- 1 solr solr   38776749 Feb 14 01:46 _9p.cfs
> -rw-r--r-- 1 solr solr452 Feb 14 01:46 _9p.si
> -rw-r--r-- 1 solr solr 97 Feb 14 02:07 _cm.dii
> -rw-r--r-- 1 solr solr   83111509 Feb 14 02:07 _cm.dim
> -rw-r--r-- 1 solr solr 1419981112 Feb 14 02:02 _cm.fdt
> -rw-r--r-- 1 solr solr 379544 Feb 14 02:02 _cm.f

Re: Solr running on Tomcat

2018-02-15 Thread Erick Erickson
Why to you think Solr on Tomcat == scalability?

Solr has not been distributed as a war file for some time, see:
https://wiki.apache.org/solr/WhyNoWar Just run it as a server.
Eventually it won't even use Jetty, but something like Netty etc

Best,
Erick

On Thu, Feb 15, 2018 at 7:54 PM, GVK Prasad  wrote:
>
> I read some posts on setting up Solr to Run on Tomcat. But all these posts 
> are about Solr version 4.0 or earlier.
> I am thinking of hosting  Solr on Tomcat for scalability.
> Any recommendation on this.
>
> Prasad
>
>
>
>
> ---
> This email has been checked for viruses by Avast antivirus software.
> https://www.avast.com/antivirus


In Place Updates not work as expected

2018-02-15 Thread mganeshs
All,

I have (say 1M, in real time it would be more even) solr documents which has
lot of fields and it's bit huge. We have a functionality, where we need to
go and update a specific field or add new field in to that document. Since
we have to do this for all 1M documents, it's taking up more time and it's
not acceptable. 

So we thought of using "In Place Updates".

As per documentation, we have made sure it's following this criteria
---
*An atomic update operation is performed using this approach only when the
fields to be updated meet these three conditions:

are non-indexed (indexed="false"), non-stored (stored="false"), single
valued (multiValued="false") numeric docValues (docValues="true") fields;

the _version_ field is also a non-indexed, non-stored single valued
docValues field; and,

copy targets of updated fields, if any, are also non-indexed, non-stored
single valued numeric docValues fields.*
---
To check whether it's working as expected, 
* First we tried to update a normal field and it took around 1.5 Hours to
update all 1M docs, as the complete documents is getting re-indexed.

* We also tried to update the docvalue field and it also took around 1.5
hours to complete for 1M docs. 

As in the second case, we are updating docvalue field type, and as it won't
re-index the complete document, isn't that should take lesser time ? 

What could be going wrong ? I am using Sorl 6.5.1. Is this a bug or expected
behavior ? 

Regards,




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html