Apache Solr Question

2016-11-03 Thread Chien Nguyen
Hi everyone! 
I'm a newbie in using Apache Solr. I've read some documents about it. But i
can't answer some questions. 
1. How many documents Solr can search at a moment??
2. Can Solr index the media data?? 
3. What's the max size of document that Solr can index??? 
Can you help me and explain it for me??? Please! It's important to me.
Thank you so much!




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Apache-Solr-Question-tp4304308.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: edixmax

2016-11-03 Thread Rafael Merino García
Hi,
You were absolutely right, there was a *string* field defined in the qf
parameter...
Using mm.autoRelax parameter did the trick
Thank you so much!
Regards

On Wed, Nov 2, 2016 at 5:15 PM, Vincenzo D'Amore  wrote:

> Hi Rafael,
>
> I suggest to check all the fields present in your qf looking for one (or
> ore) where the stopwords filter is missing.
> Very likely there is a field in your qf where the stopword filter is
> missing.
>
> The issue you're experiencing is caused by an attempt to match a stopword
> on a "non-stopword-filtered" field.
> Causing mm=100% to fail.
>
> I also suggest to take a look at mm.autoRelax param for edismax parser.
>
> Best regards,
> Vincenzo
>
> On Wed, Nov 2, 2016 at 4:07 PM, Rafael Merino García <
> rmer...@paradigmadigital.com> wrote:
>
> > Hi guys,
> >
> > I came across the following issue. I configured an edixmax query parser
> > where *mm=100%* and when the user types in a stopword, no result is being
> > returned (stopwords are filtered before indexing, but, somehow, either
> they
> > are not being filtered before searching or they are taken into account
> when
> > computing *mm*). Reading the documentation about the edixmax parser (last
> > version) I found the  parameter *stopwords, but changing it has no
> > effect...*
> >
> > Thanks in advance
> >
> > Regards
> >
>
>
>
> --
> Vincenzo D'Amore
> email: v.dam...@gmail.com
> skype: free.dev
> mobile: +39 349 8513251
>


Re: Apache Solr Question

2016-11-03 Thread Rick Leir


On November 3, 2016 4:49:07 AM EDT, Chien Nguyen  wrote:
>Hi everyone! 
>I'm a newbie in using Apache Solr. 

Welcome!

> I've read some documents about it.
>But i
>can't answer some questions. 
>1. How many documents Solr can search at a moment??

I would like to say unlimited. But it depends on your hardware. Solr can index 
huge numbers of documents.

>2. Can Solr index the media data?? 

Meta data? Yes

>3. What's the max size of document that Solr can index??? 

Again, huge. You could read some intros and blogs on Solr, then come back and 
talk more. 

>Can you help me and explain it for me??? Please! It's important to me.
>Thank you so much!
>
>--
>View this message in context:
>http://lucene.472066.n3.nabble.com/Apache-Solr-Question-tp4304308.html
>Sent from the Solr - User mailing list archive at Nabble.com.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: High CPU Usage in export handler

2016-11-03 Thread Joel Bernstein
Are you doing heavy writes at the time?

How many concurrent reads are are happening?

What version of Solr are you using?

What is the field definition for the double, is it docValues?




Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Nov 3, 2016 at 12:56 AM, Ray Niu  wrote:

> Hello:
>We are using export handler in Solr Cloud to get some data, we only
> request for one field, which type is tdouble, it works well at the
> beginning, but recently we saw high CPU issue in all the solr cloud nodes,
> we took some thread dump and found following information:
>
>java.lang.Thread.State: RUNNABLE
>
> at java.lang.Thread.isAlive(Native Method)
>
> at
> org.apache.lucene.util.CloseableThreadLocal.purge(
> CloseableThreadLocal.java:115)
>
> - locked <0x0006e24d86a8> (a java.util.WeakHashMap)
>
> at
> org.apache.lucene.util.CloseableThreadLocal.maybePurge(
> CloseableThreadLocal.java:105)
>
> at
> org.apache.lucene.util.CloseableThreadLocal.get(
> CloseableThreadLocal.java:88)
>
> at
> org.apache.lucene.index.CodecReader.getNumericDocValues(
> CodecReader.java:143)
>
> at
> org.apache.lucene.index.FilterLeafReader.getNumericDocValues(
> FilterLeafReader.java:430)
>
> at
> org.apache.lucene.uninverting.UninvertingReader.getNumericDocValues(
> UninvertingReader.java:239)
>
> at
> org.apache.lucene.index.FilterLeafReader.getNumericDocValues(
> FilterLeafReader.java:430)
>
> Is this a known issue for export handler? As we only fetch up to 5000
> documents, it should not be data volume issue.
>
> Can anyone help on that? Thanks a lot.
>


Re: Apache Solr Question

2016-11-03 Thread Shawn Heisey
On 11/3/2016 2:49 AM, Chien Nguyen wrote:
> Hi everyone! I'm a newbie in using Apache Solr. I've read some
> documents about it. But i can't answer some questions. 

Second reply, so I'm aiming for more detail. 

> 1. How many documents Solr can search at a moment??

A *single* Solr index has Lucene's limitation of slightly more than 2
billion documents.  This is part of the problem solved by SolrCloud.  By
throwing multiple machines/shards at the problem, there is effectively
no limit to the size of a SolrCloud collection.  I have encountered
someone who has a collection with five billion documents in it.

That 2 billion document limit I mentioned, which is Java's
Integer.MAX_VALUE, is the ONLY hard limit that I know of in the
software, and only applies when the index is not sharded.

> 2. Can Solr index the media data??

I have no idea what you meant here, but if you mean metadata, Solr most
likely can handle it.  If you meant actual media, like an image, I
believe there is a binary field type that you can even store a full
source document in, but that is not normally the way Solr is used, and I
don't recommend it.

> 3. What's the max size of document that Solr can index??? 

I don't think there is a limit.  I think there are some limits on the
number and size of individual terms, but not on the total size of a
document.  If documents get particularly large and numerous, performance
might suffer, but I am not aware of any total size limitations.

Thanks,
Shawn



Re: Apache Solr Question

2016-11-03 Thread Doug Turnbull
For general search use cases, it's generally not a good idea to index giant
documents. A relevance score for an entire book is generally less
meaningful than if you can break it up into chapters or sections. Those
subdivisions are often much more useful to a user from a usability
standpoint for understanding not just that say a book is relevant but a
particular section in a book is relevant to their query.

Just my 2 cents
-Doug

On Thu, Nov 3, 2016 at 9:57 AM Shawn Heisey  wrote:

> On 11/3/2016 2:49 AM, Chien Nguyen wrote:
> > Hi everyone! I'm a newbie in using Apache Solr. I've read some
> > documents about it. But i can't answer some questions.
>
> Second reply, so I'm aiming for more detail.
>
> > 1. How many documents Solr can search at a moment??
>
> A *single* Solr index has Lucene's limitation of slightly more than 2
> billion documents.  This is part of the problem solved by SolrCloud.  By
> throwing multiple machines/shards at the problem, there is effectively
> no limit to the size of a SolrCloud collection.  I have encountered
> someone who has a collection with five billion documents in it.
>
> That 2 billion document limit I mentioned, which is Java's
> Integer.MAX_VALUE, is the ONLY hard limit that I know of in the
> software, and only applies when the index is not sharded.
>
> > 2. Can Solr index the media data??
>
> I have no idea what you meant here, but if you mean metadata, Solr most
> likely can handle it.  If you meant actual media, like an image, I
> believe there is a binary field type that you can even store a full
> source document in, but that is not normally the way Solr is used, and I
> don't recommend it.
>
> > 3. What's the max size of document that Solr can index???
>
> I don't think there is a limit.  I think there are some limits on the
> number and size of individual terms, but not on the total size of a
> document.  If documents get particularly large and numerous, performance
> might suffer, but I am not aware of any total size limitations.
>
> Thanks,
> Shawn
>
>


RE: Apache Solr Question

2016-11-03 Thread Davis, Daniel (NIH/NLM) [C]
Case in point - https://collections.nlm.nih.gov/ has one index (core) for 
documents and another index (core) for pages within the documents.
I think LOC (Library of Congress) does something similar from a presentation 
they gave at Lucene/DC Exchange.

-Original Message-
From: Doug Turnbull [mailto:dturnb...@opensourceconnections.com] 
Sent: Thursday, November 03, 2016 10:26 AM
To: solr-user@lucene.apache.org
Subject: Re: Apache Solr Question

For general search use cases, it's generally not a good idea to index giant 
documents. A relevance score for an entire book is generally less meaningful 
than if you can break it up into chapters or sections. Those subdivisions are 
often much more useful to a user from a usability standpoint for understanding 
not just that say a book is relevant but a particular section in a book is 
relevant to their query.

Just my 2 cents
-Doug

On Thu, Nov 3, 2016 at 9:57 AM Shawn Heisey  wrote:

> On 11/3/2016 2:49 AM, Chien Nguyen wrote:
> > Hi everyone! I'm a newbie in using Apache Solr. I've read some 
> > documents about it. But i can't answer some questions.
>
> Second reply, so I'm aiming for more detail.
>
> > 1. How many documents Solr can search at a moment??
>
> A *single* Solr index has Lucene's limitation of slightly more than 2 
> billion documents.  This is part of the problem solved by SolrCloud.  
> By throwing multiple machines/shards at the problem, there is 
> effectively no limit to the size of a SolrCloud collection.  I have 
> encountered someone who has a collection with five billion documents in it.
>
> That 2 billion document limit I mentioned, which is Java's 
> Integer.MAX_VALUE, is the ONLY hard limit that I know of in the 
> software, and only applies when the index is not sharded.
>
> > 2. Can Solr index the media data??
>
> I have no idea what you meant here, but if you mean metadata, Solr 
> most likely can handle it.  If you meant actual media, like an image, 
> I believe there is a binary field type that you can even store a full 
> source document in, but that is not normally the way Solr is used, and 
> I don't recommend it.
>
> > 3. What's the max size of document that Solr can index???
>
> I don't think there is a limit.  I think there are some limits on the 
> number and size of individual terms, but not on the total size of a 
> document.  If documents get particularly large and numerous, 
> performance might suffer, but I am not aware of any total size limitations.
>
> Thanks,
> Shawn
>
>


Re: Apache Solr Question

2016-11-03 Thread Susheel Kumar
For media like images etc, there is LIRE solr plugin which can be utilised.
I have used in the past and may meet your requirement. See
http://www.lire-project.net/

Thanks,
Susheel

On Thu, Nov 3, 2016 at 9:57 AM, Shawn Heisey  wrote:

> On 11/3/2016 2:49 AM, Chien Nguyen wrote:
> > Hi everyone! I'm a newbie in using Apache Solr. I've read some
> > documents about it. But i can't answer some questions.
>
> Second reply, so I'm aiming for more detail.
>
> > 1. How many documents Solr can search at a moment??
>
> A *single* Solr index has Lucene's limitation of slightly more than 2
> billion documents.  This is part of the problem solved by SolrCloud.  By
> throwing multiple machines/shards at the problem, there is effectively
> no limit to the size of a SolrCloud collection.  I have encountered
> someone who has a collection with five billion documents in it.
>
> That 2 billion document limit I mentioned, which is Java's
> Integer.MAX_VALUE, is the ONLY hard limit that I know of in the
> software, and only applies when the index is not sharded.
>
> > 2. Can Solr index the media data??
>
> I have no idea what you meant here, but if you mean metadata, Solr most
> likely can handle it.  If you meant actual media, like an image, I
> believe there is a binary field type that you can even store a full
> source document in, but that is not normally the way Solr is used, and I
> don't recommend it.
>
> > 3. What's the max size of document that Solr can index???
>
> I don't think there is a limit.  I think there are some limits on the
> number and size of individual terms, but not on the total size of a
> document.  If documents get particularly large and numerous, performance
> might suffer, but I am not aware of any total size limitations.
>
> Thanks,
> Shawn
>
>


Re: Apache Solr Question

2016-11-03 Thread Erick Erickson
bq: I have encountered someone who has a collection with five billion
documents in it...

I know of installations many times that. Admittedly when you start
getting into the 100s of billions you must plan carefully

Erick

On Thu, Nov 3, 2016 at 7:44 AM, Susheel Kumar  wrote:
> For media like images etc, there is LIRE solr plugin which can be utilised.
> I have used in the past and may meet your requirement. See
> http://www.lire-project.net/
>
> Thanks,
> Susheel
>
> On Thu, Nov 3, 2016 at 9:57 AM, Shawn Heisey  wrote:
>
>> On 11/3/2016 2:49 AM, Chien Nguyen wrote:
>> > Hi everyone! I'm a newbie in using Apache Solr. I've read some
>> > documents about it. But i can't answer some questions.
>>
>> Second reply, so I'm aiming for more detail.
>>
>> > 1. How many documents Solr can search at a moment??
>>
>> A *single* Solr index has Lucene's limitation of slightly more than 2
>> billion documents.  This is part of the problem solved by SolrCloud.  By
>> throwing multiple machines/shards at the problem, there is effectively
>> no limit to the size of a SolrCloud collection.  I have encountered
>> someone who has a collection with five billion documents in it.
>>
>> That 2 billion document limit I mentioned, which is Java's
>> Integer.MAX_VALUE, is the ONLY hard limit that I know of in the
>> software, and only applies when the index is not sharded.
>>
>> > 2. Can Solr index the media data??
>>
>> I have no idea what you meant here, but if you mean metadata, Solr most
>> likely can handle it.  If you meant actual media, like an image, I
>> believe there is a binary field type that you can even store a full
>> source document in, but that is not normally the way Solr is used, and I
>> don't recommend it.
>>
>> > 3. What's the max size of document that Solr can index???
>>
>> I don't think there is a limit.  I think there are some limits on the
>> number and size of individual terms, but not on the total size of a
>> document.  If documents get particularly large and numerous, performance
>> might suffer, but I am not aware of any total size limitations.
>>
>> Thanks,
>> Shawn
>>
>>


Re: Poor Solr Cloud Query Performance against a Small Dataset

2016-11-03 Thread Dave Seltzer
Good tip Rick,

I'll dig in and make sure everything is set up correctly.

Thanks!

-D

Dave Seltzer 
Chief Systems Architect
TVEyes
(203) 254-3600 x222

On Wed, Nov 2, 2016 at 9:05 PM, Rick Leir  wrote:

> Here is a wild guess. Whenever I see a 5 second delay in networking, I
> think DNS timeouts. YMMV, good luck.
>
> cheers -- Rick
>
> On 2016-11-01 04:18 PM, Dave Seltzer wrote:
>
>> Hello!
>>
>> I'm trying to utilize Solr Cloud to help with a hash search problem. The
>> record set has only 4,300 documents.
>>
>> When I run my search against a single core I get results on the order of
>> 10ms. When I run the same search against Solr Cloud results take about
>> 5,000 ms.
>>
>> Is there something about this particular query which makes it perform
>> poorly in a Cloud environment? The query looks like this (linebreaks added
>> for readability):
>>
>> {!frange+l%3D5+u%3D25}sum(
>>  termfreq(hashTable_0,'225706351'),
>>  termfreq(hashTable_1,'17664000'),
>>  termfreq(hashTable_2,'86447642'),
>>  termfreq(hashTable_3,'134816033'),
>>
>
>


Re: Posting files 405 http error

2016-11-03 Thread Pablo Anzorena
Thanks for the answer.

I checked the log and it wasn't logging anything.

The error i'm facing is way bizarre... I create a new fresh collection and
then index with no problem, but it keeps throwing this error if i copy the
collection from one solrcloud to the other and then index.

Any clue on why is this happening?

2016-11-01 17:42 GMT-03:00 Erick Erickson :

> What does the solr log say? I'd tail the Solr log while
> sending the query, that'll do two things:
>
> 1> insure that your request is actually getting to the
> Solr you expect.
>
> 2> the details in the solr log are often much more helpful
> than what gets returned to the client.
>
> Best,
> Erick
>
> On Tue, Nov 1, 2016 at 1:37 PM, Pablo Anzorena 
> wrote:
> > Hey,
> >
> > I'm indexing a file with a delete query in xml format using the
> post.jar. I
> > have two solrclouds, which apparently have all the same configurations.
> The
> > thing is that I have no problem when indexing in one of them, but the
> other
> > keeps giving me this error:
> >
> > SimplePostTool version 5.0.0
> > Posting files to [base] url
> > http://solr2:8983/solr/mycollection/update?separator=| using
> content-type
> > application/xml...
> > POSTing file delete_file.unl.tmp to [base]
> > SimplePostTool: WARNING: Solr returned an error #405 (Method Not Allowed)
> > for url: http://solr2:8983/solr/mycollection/update?separator=|
> > SimplePostTool: WARNING: Response:
> > ErrorHTTP method POST is not
> > supported by this URL
> > SimplePostTool: WARNING: IOException while reading response:
> > java.io.IOException: Server returned HTTP response code: 405 for URL:
> > http://solr2:8983/solr/mycollection/update?separator=|
> > 1 files indexed.
> > Time spent: 0:00:00.253
> >
> > Do I need some extra configuration to support for xml updates?
> >
> > Thanks!
>


RE: CachedSqlEntityProcessor with delta-import

2016-11-03 Thread Mohan, Sowmya
Thanks. We did implement the delete by query on another core and thought of 
giving the delta import a try here. Looks like differential via full index and 
deletes using delete by id/query is the way to go. 

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Tuesday, October 25, 2016 12:31 PM
To: solr-user 
Subject: Re: CachedSqlEntityProcessor with delta-import

Why not use delete by id rather than query? It'll be more efficient

Probably not a big deal though.

On Tue, Oct 25, 2016 at 1:47 AM, Aniket Khare  wrote:
> Hi Sowmya,
>
> I my case I have implemeneted the data indexing suggested by James and 
> for deleting the reords I have created my own data indexing job which 
> will call the delete API periodically by passing the list of unique Id.
> https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+I
> ndex+Handlers
>
> http://localhost:8983/solr/update?stream.body=
> id:1234&commit=true
>
> Thanks,
> Aniket S. Khare
>
> On Tue, Oct 25, 2016 at 1:32 AM, Mohan, Sowmya  wrote:
>
>> Thanks James. That's what I was using before. But I also wanted to 
>> perform deletes using deletedPkQuery and hence switched to delta 
>> imports. The problem with using deletedPkQuery with the full import 
>> is that dataimporter.last_index_time is no longer accurate.
>>
>> Below is an example of my deletedPkQuery. If run the full-import for 
>> a differential index, that would update the last index time. Running 
>> the delta import to remove the deleted records then wouldn't do 
>> anything since nothing changed since the last index time.
>>
>>
>>  deletedPkQuery="SELECT id
>> FROM content
>> WHERE active = 1 AND lastUpdate > 
>> '${dataimporter.last_index_time}'"
>>
>>
>>
>>
>>
>>
>> -Original Message-
>> From: Dyer, James [mailto:james.d...@ingramcontent.com]
>> Sent: Friday, October 21, 2016 4:23 PM
>> To: solr-user@lucene.apache.org
>> Subject: RE: CachedSqlEntityProcessor with delta-import
>>
>> Sowmya,
>>
>> My memory is that the cache feature does not work with Delta Imports.  
>> In fact, I believe that nearly all DIH features except straight JDBC 
>> imports do not work with Delta Imports.  My advice is to not use the 
>> Delta Import feature at all as the same result can (often 
>> more-efficiently) be accomplished following the approach outlined here:
>> https://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport
>>
>> James Dyer
>> Ingram Content Group
>>
>> -Original Message-
>> From: Mohan, Sowmya [mailto:sowmya.mo...@icf.com]
>> Sent: Tuesday, October 18, 2016 10:07 AM
>> To: solr-user@lucene.apache.org
>> Subject: CachedSqlEntityProcessor with delta-import
>>
>> Good morning,
>>
>> Can CachedSqlEntityProcessor be used with delta-import? In my setup 
>> when running a delta-import with CachedSqlEntityProcessor, the child 
>> entity values are not correctly updated for the parent record. I am on Solr 
>> 4.3.
>> Has anyone experienced this and if so how to resolve it?
>>
>> Thanks,
>> Sowmya.
>>
>>
>
>
> --
> Regards,
>
> Aniket S. Khare


Re: Posting files 405 http error

2016-11-03 Thread Shawn Heisey
On 11/3/2016 9:10 AM, Pablo Anzorena wrote:
> Thanks for the answer.
>
> I checked the log and it wasn't logging anything.
>
> The error i'm facing is way bizarre... I create a new fresh collection and
> then index with no problem, but it keeps throwing this error if i copy the
> collection from one solrcloud to the other and then index.
>
> Any clue on why is this happening?

Solr's source code doesn't seem to even have a 405 error, so I bet
what's happening is that you have Solr sitting behind a proxy or load
balancer, and that server doesn't like the request you sent, so it
rejects it and Solr never receives anything.

Here's an excerpt of code from the SolrException class in the master branch:

  /**
   * This list of valid HTTP Status error codes that Solr may return in
   * the case of a "Server Side" error.
   *
   * @since solr 1.2
   */
  public enum ErrorCode {
BAD_REQUEST( 400 ),
UNAUTHORIZED( 401 ),
FORBIDDEN( 403 ),
NOT_FOUND( 404 ),
CONFLICT( 409 ),
UNSUPPORTED_MEDIA_TYPE( 415 ),
SERVER_ERROR( 500 ),
SERVICE_UNAVAILABLE( 503 ),
INVALID_STATE( 510 ),
UNKNOWN(0);
public final int code;
   
private ErrorCode( int c )
{
  code = c;
}
public static ErrorCode getErrorCode(int c){
  for (ErrorCode err : values()) {
if(err.code == c) return err;
  }
  return UNKNOWN;
}
  };

Thanks,
Shawn



Re: Problem with Password Decryption in Data Import Handler

2016-11-03 Thread Jamie Jackson
You were right, Fuad. There was a flaw in my script (inconsistent naming of
the `plain_db_pwd` variable.

Thanks for figuring that out.

For posterity, here's the fixed script:


encrypt_key=your_encryption_key
plain_db_pwd=your_db_password
cred_dir=/your/credentials/directory

cd "${cred_dir}
echo -n "${encrypt_key}" > encrypt.key
echo -n "${plain_db_pwd}" | openssl enc -aes-128-cbc -a -salt -k
"${encrypt_key}"
#==#

Then, in the DIH config:
 encryptKeyFile="/your/credentials/directory/encrypt.key"

I have another, semi-related, issue that I'll bring up in another thread.

Thanks,
Jamie


On Wed, Nov 2, 2016 at 6:26 PM, Fuad Efendi  wrote:

> Then I can only guess that in current configuration decrypted password is
> empty string.
>
> Try to manually replace some characters in encpwd.txt file to see if you
> get different errors; try to delete this file completely to see if you get
> different errors. Try to add new line in this file; try to change password
> in config file.
>
>
>
> On November 2, 2016 at 5:23:33 PM, Jamie Jackson (jamieja...@gmail.com)
> wrote:
>
> I should have mentioned that I verified connectivity with plain passwords:
>
> From the same machine that Solr's running on:
>
> solr@000650cbdd5e:/opt/solr$ mysql -uroot -pOakton153 -h local.mysite.com
> mysite -e "select 'foo' as bar;"
> +-+
> | bar |
> +-+
> | foo |
> +-+
>
> Also, if I add the plain-text password to the config, it connects fine:
>
>  driver="org.mariadb.jdbc.Driver"
> url="jdbc:mysql://local.mysite.com:3306/mysite"
> user="root"
> password="Oakton153"
> />
>
>
> So that is why I claim to have a problem with encryptKeyFile, specifically,
> because I've eliminated general connectivity/authentication problems.
>
> Thanks,
> Jamie
>
> On Wed, Nov 2, 2016 at 4:58 PM, Fuad Efendi  wrote:
>
> > In MySQL, this command will explicitly allow to connect from
> > remote ICZ2002912 host, check MySQL documentation:
> >
> > GRANT ALL ON mysite.* TO 'root’@'ICZ2002912' IDENTIFIED BY ‘Oakton123’;
> >
> >
> >
> > On November 2, 2016 at 4:41:48 PM, Fuad Efendi (f...@efendi.ca) wrote:
> >
> > This is the root of the problem:
> > "Access denied for user 'root'@'ICZ2002912' (using password: NO) “
> >
> >
> > First of all, ensure that plain (non-encrypted) password settings work
> for
> > you.
> >
> > Check that you can connect using MySQL client from ICZ2002912 to your
> > MySQL & Co. instance
> >
> > I suspect you need to allow MySQL & Co. to accept connections
> > from ICZ2002912. Plus, check DNS resolution, etc.
> >
> >
> > Thanks,
> >
> >
> > --
> > Fuad Efendi
> > (416) 993-2060
> > http://www.tokenizer.ca
> > Recommender Systems
> >
> >
> > On November 2, 2016 at 2:37:08 PM, Jamie Jackson (jamieja...@gmail.com)
> > wrote:
> >
> > I'm at a brick wall. Here's the latest status:
> >
> > Here are some sample commands that I'm using:
> >
> > *Create the encryptKeyFile and encrypted password:*
> >
> >
> > encrypter_password='this_is_my_encrypter_password'
> > plain_db_pw='Oakton153'
> >
> > cd /var/docker/solr_stage2/credentials/
> > echo -n "${encrypter_password}" > encpwd.txt
> > echo -n "${plain_db_pwd}" > plaindbpwd.txt
> > openssl enc -aes-128-cbc -a -salt -in plaindbpwd.txt -k
> > "${encrypter_password}"
> >
> > rm plaindbpwd.txt
> >
> > That generated this as the password, by the way:
> >
> > U2FsdGVkX19pBVTeZaSl43gFFAlrx+Th1zSg1GvlX9o=
> >
> > *Configure DIH configuration:*
> >
> > 
> >
> >  > driver="org.mariadb.jdbc.Driver"
> > url="jdbc:mysql://local.mysite.com:3306/mysite"
> > user="root"
> > password="U2FsdGVkX19pBVTeZaSl43gFFAlrx+Th1zSg1GvlX9o="
> > encryptKeyFile="/opt/solr/credentials/encpwd.txt"
> > />
> > ...
> >
> >
> > By the way, /var/docker/solr_stage2/credentials/ is mapped to
> > /opt/solr/credentials/ in the docker container, so that's why the paths
> > *seem* different (but aren't, really).
> >
> >
> > *Authentication error when data import is run:*
> >
> > Exception while processing: question document :
> > SolrInputDocument(fields:
> > []):org.apache.solr.handler.dataimport.DataImportHandlerException:
> > Unable to execute query: select 'foo' as bar; Processing
> > Document # 1
> > at org.apache.solr.handler.dataimport.DataImportHandlerException.
> > wrapAndThrow(DataImportHandlerException.java:69)
> > at org.apache.solr.handler.dataimport.JdbcDataSource$
> > ResultSetIterator.(JdbcDataSource.java:323)
> > at org.apache.solr.handler.dataimport.JdbcDataSource.
> > getData(JdbcDataSource.java:283)
> > at org.apache.solr.handler.dataimport.JdbcDataSource.
> > getData(JdbcDataSource.java:52)
> > at org.apache.solr.handler.dataimport.SqlEntityProcessor.
> > initQuery(SqlEntityProcessor.java:59)
> > at org.apache.solr.handler.dataimport.SqlEntityProcessor.
> > nextRow(SqlEntityProcessor.java:73)
> > at org.apache.solr.handler.dataimport.EntityProcessorWrappe

display searched for text in Solr 6

2016-11-03 Thread win harrington
I used solr/post to insert some *.txt files intoSolr 6. I can search for words 
in Solr and itreturns the id with the file name.
How do I display the text?
managed-schema has


Thank you.


Re: display searched for text in Solr 6

2016-11-03 Thread Binoy Dalal
Append the fields you want to display to the query using the fl parameter.
Eg. q=something&fl=_text_

On Thu, Nov 3, 2016 at 10:28 PM win harrington
 wrote:

> I used solr/post to insert some *.txt files intoSolr 6. I can search for
> words in Solr and itreturns the id with the file name.
> How do I display the text?
> managed-schema has
>  multiValued="true"/>
> 
> Thank you.
>
-- 
Regards,
Binoy Dalal


Re: display searched for text in Solr 6

2016-11-03 Thread win harrington
I inserted five /opt/solr/*.txt files for testing. Four of the files contain 
the word 'notice'.Solr finds 4 documents, but I can't see the text.
http://localhost:8983/solr/core1/select?fl=_text_&indent=on&q=notice&wt=json
"response":{"numFound":4, "start":0, "docs"{{},{},{},{}} 

On Thursday, November 3, 2016 1:02 PM, Binoy Dalal  
wrote:
 

 Append the fields you want to display to the query using the fl parameter.
Eg. q=something&fl=_text_

On Thu, Nov 3, 2016 at 10:28 PM win harrington
 wrote:

> I used solr/post to insert some *.txt files intoSolr 6. I can search for
> words in Solr and itreturns the id with the file name.
> How do I display the text?
> managed-schema has
>  multiValued="true"/>
> 
> Thank you.
>
-- 
Regards,
Binoy Dalal


   

UpdateProcessor as a batch

2016-11-03 Thread Markus Jelsma
Hi - i need to process a batch of documents on update but i cannot seem to find 
a point where i can hook in and process a list of SolrInputDocuments, not in 
UpdateProcessor nor in UpdateHandler.

For now i let it go and implemented it on a per-document basis, it is fast, but 
i'd prefer batches. Is that possible at all?

Thanks,
Markus


Re: display searched for text in Solr 6

2016-11-03 Thread Binoy Dalal
Are you sure that the text is stored in the _text_ field? Try
q=*:*&fl=_text_
If you see stuff being printed then this field does have data, else this
field is empty. To check which filed have data, try using the schema
browser.

On Thu, Nov 3, 2016 at 10:43 PM win harrington
 wrote:

> I inserted five /opt/solr/*.txt files for testing. Four of the files
> contain the word 'notice'.Solr finds 4 documents, but I can't see the text.
>
> http://localhost:8983/solr/core1/select?fl=_text_&indent=on&q=notice&wt=json
> "response":{"numFound":4, "start":0, "docs"{{},{},{},{}}
>
> On Thursday, November 3, 2016 1:02 PM, Binoy Dalal <
> binoydala...@gmail.com> wrote:
>
>
>  Append the fields you want to display to the query using the fl parameter.
> Eg. q=something&fl=_text_
>
> On Thu, Nov 3, 2016 at 10:28 PM win harrington
>  wrote:
>
> > I used solr/post to insert some *.txt files intoSolr 6. I can search for
> > words in Solr and itreturns the id with the file name.
> > How do I display the text?
> > managed-schema has
> >  > multiValued="true"/>
> > 
> > Thank you.
> >
> --
> Regards,
> Binoy Dalal
>
>
>

-- 
Regards,
Binoy Dalal


Re: Posting files 405 http error

2016-11-03 Thread Pablo Anzorena
Thanks Shawn.

Actually there is no load balancer or proxy in the middle, but even if
there was, how would you explain that I can index if a create a completely
new collection?

I figured out how to fix it. What I'm doing is creating a new collection,
then unloading it (by unloading all the shards/replicas), then copy the
data directory from the collection in the other solrcloud, and finally
creating again the collection. It's not the best solution, but it works,
nevertheless I still would like to know what's causing the problem...

It's worth to mention that I'm not using jetty, I'm using solr-undertow
https://github.com/kohesive/solr-undertow

2016-11-03 12:56 GMT-03:00 Shawn Heisey :

> On 11/3/2016 9:10 AM, Pablo Anzorena wrote:
> > Thanks for the answer.
> >
> > I checked the log and it wasn't logging anything.
> >
> > The error i'm facing is way bizarre... I create a new fresh collection
> and
> > then index with no problem, but it keeps throwing this error if i copy
> the
> > collection from one solrcloud to the other and then index.
> >
> > Any clue on why is this happening?
>
> Solr's source code doesn't seem to even have a 405 error, so I bet
> what's happening is that you have Solr sitting behind a proxy or load
> balancer, and that server doesn't like the request you sent, so it
> rejects it and Solr never receives anything.
>
> Here's an excerpt of code from the SolrException class in the master
> branch:
>
>   /**
>* This list of valid HTTP Status error codes that Solr may return in
>* the case of a "Server Side" error.
>*
>* @since solr 1.2
>*/
>   public enum ErrorCode {
> BAD_REQUEST( 400 ),
> UNAUTHORIZED( 401 ),
> FORBIDDEN( 403 ),
> NOT_FOUND( 404 ),
> CONFLICT( 409 ),
> UNSUPPORTED_MEDIA_TYPE( 415 ),
> SERVER_ERROR( 500 ),
> SERVICE_UNAVAILABLE( 503 ),
> INVALID_STATE( 510 ),
> UNKNOWN(0);
> public final int code;
>
> private ErrorCode( int c )
> {
>   code = c;
> }
> public static ErrorCode getErrorCode(int c){
>   for (ErrorCode err : values()) {
> if(err.code == c) return err;
>   }
>   return UNKNOWN;
> }
>   };
>
> Thanks,
> Shawn
>
>


Re: Posting files 405 http error

2016-11-03 Thread Erick Erickson
Wait. What were you doing originally? Just copying the entire
SOLR_HOME over or something?

Because one of the things each core carries along is a
"core.properties" file that identifies
1> the name of the core, something like collection_shard1_replica1
2> the name of the collection the core belongs to

So if you just copy a directory containing the core.properties file
from one place to another _and_ they're pointing to the same Zookeeper
then the behavior is undefined.

And if you _don't_ point to the same zookeeper, your copied collection
isn't registered with ZK so that's a weird state as well.

If your goal is to move things from one collection to another, here's
a possibility (assuming the nodes can all "see" each other).

1> index to your source collection
2> create a new destination collection
3a> use the "fetchindex" command to move the relevant indexes from the
source to the destination, see
https://cwiki.apache.org/confluence/display/solr/Index+Replication#IndexReplication-HTTPAPICommandsfortheReplicationHandler
3b> instead of <3a>, manually copy the data directory from the source
to each replica.
4> In either 3a> or 3b>, it's probably easier to create a leader-only
(replicationFactor=1) destination collection then use the ADDREPLICA
command to add replicas, that way they'll all sync automatically.

Best,
Erick

On Thu, Nov 3, 2016 at 10:28 AM, Pablo Anzorena  wrote:
> Thanks Shawn.
>
> Actually there is no load balancer or proxy in the middle, but even if
> there was, how would you explain that I can index if a create a completely
> new collection?
>
> I figured out how to fix it. What I'm doing is creating a new collection,
> then unloading it (by unloading all the shards/replicas), then copy the
> data directory from the collection in the other solrcloud, and finally
> creating again the collection. It's not the best solution, but it works,
> nevertheless I still would like to know what's causing the problem...
>
> It's worth to mention that I'm not using jetty, I'm using solr-undertow
> https://github.com/kohesive/solr-undertow
>
> 2016-11-03 12:56 GMT-03:00 Shawn Heisey :
>
>> On 11/3/2016 9:10 AM, Pablo Anzorena wrote:
>> > Thanks for the answer.
>> >
>> > I checked the log and it wasn't logging anything.
>> >
>> > The error i'm facing is way bizarre... I create a new fresh collection
>> and
>> > then index with no problem, but it keeps throwing this error if i copy
>> the
>> > collection from one solrcloud to the other and then index.
>> >
>> > Any clue on why is this happening?
>>
>> Solr's source code doesn't seem to even have a 405 error, so I bet
>> what's happening is that you have Solr sitting behind a proxy or load
>> balancer, and that server doesn't like the request you sent, so it
>> rejects it and Solr never receives anything.
>>
>> Here's an excerpt of code from the SolrException class in the master
>> branch:
>>
>>   /**
>>* This list of valid HTTP Status error codes that Solr may return in
>>* the case of a "Server Side" error.
>>*
>>* @since solr 1.2
>>*/
>>   public enum ErrorCode {
>> BAD_REQUEST( 400 ),
>> UNAUTHORIZED( 401 ),
>> FORBIDDEN( 403 ),
>> NOT_FOUND( 404 ),
>> CONFLICT( 409 ),
>> UNSUPPORTED_MEDIA_TYPE( 415 ),
>> SERVER_ERROR( 500 ),
>> SERVICE_UNAVAILABLE( 503 ),
>> INVALID_STATE( 510 ),
>> UNKNOWN(0);
>> public final int code;
>>
>> private ErrorCode( int c )
>> {
>>   code = c;
>> }
>> public static ErrorCode getErrorCode(int c){
>>   for (ErrorCode err : values()) {
>> if(err.code == c) return err;
>>   }
>>   return UNKNOWN;
>> }
>>   };
>>
>> Thanks,
>> Shawn
>>
>>


Re: High CPU Usage in export handler

2016-11-03 Thread Ray Niu
Thanks Joel
here is the information you requested.
Are you doing heavy writes at the time?
we are doing write very frequently, but not very heavy, we will update
about 100 solr document per second.
How many concurrent reads are are happening?
the concurrent reads are about 1000-2000 per minute per node
What version of Solr are you using?
we are using solr 5.5.2
What is the field definition for the double, is it docValues?
the field definition is



2016-11-03 6:30 GMT-07:00 Joel Bernstein :

> Are you doing heavy writes at the time?
>
> How many concurrent reads are are happening?
>
> What version of Solr are you using?
>
> What is the field definition for the double, is it docValues?
>
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Nov 3, 2016 at 12:56 AM, Ray Niu  wrote:
>
> > Hello:
> >We are using export handler in Solr Cloud to get some data, we only
> > request for one field, which type is tdouble, it works well at the
> > beginning, but recently we saw high CPU issue in all the solr cloud
> nodes,
> > we took some thread dump and found following information:
> >
> >java.lang.Thread.State: RUNNABLE
> >
> > at java.lang.Thread.isAlive(Native Method)
> >
> > at
> > org.apache.lucene.util.CloseableThreadLocal.purge(
> > CloseableThreadLocal.java:115)
> >
> > - locked <0x0006e24d86a8> (a java.util.WeakHashMap)
> >
> > at
> > org.apache.lucene.util.CloseableThreadLocal.maybePurge(
> > CloseableThreadLocal.java:105)
> >
> > at
> > org.apache.lucene.util.CloseableThreadLocal.get(
> > CloseableThreadLocal.java:88)
> >
> > at
> > org.apache.lucene.index.CodecReader.getNumericDocValues(
> > CodecReader.java:143)
> >
> > at
> > org.apache.lucene.index.FilterLeafReader.getNumericDocValues(
> > FilterLeafReader.java:430)
> >
> > at
> > org.apache.lucene.uninverting.UninvertingReader.getNumericDocValues(
> > UninvertingReader.java:239)
> >
> > at
> > org.apache.lucene.index.FilterLeafReader.getNumericDocValues(
> > FilterLeafReader.java:430)
> >
> > Is this a known issue for export handler? As we only fetch up to 5000
> > documents, it should not be data volume issue.
> >
> > Can anyone help on that? Thanks a lot.
> >
>


Re: UpdateProcessor as a batch

2016-11-03 Thread Erick Erickson
Markus:

How are you indexing? SolrJ has a client.add(List)
form, and post.jar lets you add as many documents as you want in a
batch

Best,
Erick

On Thu, Nov 3, 2016 at 10:18 AM, Markus Jelsma
 wrote:
> Hi - i need to process a batch of documents on update but i cannot seem to 
> find a point where i can hook in and process a list of SolrInputDocuments, 
> not in UpdateProcessor nor in UpdateHandler.
>
> For now i let it go and implemented it on a per-document basis, it is fast, 
> but i'd prefer batches. Is that possible at all?
>
> Thanks,
> Markus


RE: UpdateProcessor as a batch

2016-11-03 Thread Markus Jelsma
Erick - in this case data can come from anywhere. There is one piece of code 
all incoming documents, regardless of their origin, are passed thru, the update 
handler and update processors of Solr.

In my case that is the most convenient point to partially modify the documents, 
instead of moving that logic to separate places.

I've seen the ContentStream in SolrQueryResponse and i probably could tear 
incoming data apart and put it back together again, but that would not be so 
easy as working with already deserialized objects such as SolrInputDocument.

UpdateHandler doesn't seem to work on a list of documents, it looked like it 
works on incoming stuff, not a whole list. I've also looked if i could buffer a 
batch in UpdateProcessor, work on them, and release them, but that seems 
impossible.

Thanks, 
Markus
 
-Original message-
> From:Erick Erickson 
> Sent: Thursday 3rd November 2016 18:57
> To: solr-user 
> Subject: Re: UpdateProcessor as a batch
> 
> Markus:
> 
> How are you indexing? SolrJ has a client.add(List)
> form, and post.jar lets you add as many documents as you want in a
> batch
> 
> Best,
> Erick
> 
> On Thu, Nov 3, 2016 at 10:18 AM, Markus Jelsma
>  wrote:
> > Hi - i need to process a batch of documents on update but i cannot seem to 
> > find a point where i can hook in and process a list of SolrInputDocuments, 
> > not in UpdateProcessor nor in UpdateHandler.
> >
> > For now i let it go and implemented it on a per-document basis, it is fast, 
> > but i'd prefer batches. Is that possible at all?
> >
> > Thanks,
> > Markus
> 


Re: High CPU Usage in export handler

2016-11-03 Thread Ray Niu
the soft commit is 15 seconds and hard commit is 10 minutes.

2016-11-03 11:11 GMT-07:00 Erick Erickson :

> Followup question: You say you're indexing 100 docs/second.  How often
> are you _committing_? Either
> soft commit
> or
> hardcommit with openSearcher=true
>
> ?
>
> Best,
> Erick
>
> On Thu, Nov 3, 2016 at 11:00 AM, Ray Niu  wrote:
> > Thanks Joel
> > here is the information you requested.
> > Are you doing heavy writes at the time?
> > we are doing write very frequently, but not very heavy, we will update
> > about 100 solr document per second.
> > How many concurrent reads are are happening?
> > the concurrent reads are about 1000-2000 per minute per node
> > What version of Solr are you using?
> > we are using solr 5.5.2
> > What is the field definition for the double, is it docValues?
> > the field definition is
> >  > docValues="true"/>
> >
> >
> > 2016-11-03 6:30 GMT-07:00 Joel Bernstein :
> >
> >> Are you doing heavy writes at the time?
> >>
> >> How many concurrent reads are are happening?
> >>
> >> What version of Solr are you using?
> >>
> >> What is the field definition for the double, is it docValues?
> >>
> >>
> >>
> >>
> >> Joel Bernstein
> >> http://joelsolr.blogspot.com/
> >>
> >> On Thu, Nov 3, 2016 at 12:56 AM, Ray Niu  wrote:
> >>
> >> > Hello:
> >> >We are using export handler in Solr Cloud to get some data, we only
> >> > request for one field, which type is tdouble, it works well at the
> >> > beginning, but recently we saw high CPU issue in all the solr cloud
> >> nodes,
> >> > we took some thread dump and found following information:
> >> >
> >> >java.lang.Thread.State: RUNNABLE
> >> >
> >> > at java.lang.Thread.isAlive(Native Method)
> >> >
> >> > at
> >> > org.apache.lucene.util.CloseableThreadLocal.purge(
> >> > CloseableThreadLocal.java:115)
> >> >
> >> > - locked <0x0006e24d86a8> (a java.util.WeakHashMap)
> >> >
> >> > at
> >> > org.apache.lucene.util.CloseableThreadLocal.maybePurge(
> >> > CloseableThreadLocal.java:105)
> >> >
> >> > at
> >> > org.apache.lucene.util.CloseableThreadLocal.get(
> >> > CloseableThreadLocal.java:88)
> >> >
> >> > at
> >> > org.apache.lucene.index.CodecReader.getNumericDocValues(
> >> > CodecReader.java:143)
> >> >
> >> > at
> >> > org.apache.lucene.index.FilterLeafReader.getNumericDocValues(
> >> > FilterLeafReader.java:430)
> >> >
> >> > at
> >> > org.apache.lucene.uninverting.UninvertingReader.getNumericDocValues(
> >> > UninvertingReader.java:239)
> >> >
> >> > at
> >> > org.apache.lucene.index.FilterLeafReader.getNumericDocValues(
> >> > FilterLeafReader.java:430)
> >> >
> >> > Is this a known issue for export handler? As we only fetch up to 5000
> >> > documents, it should not be data volume issue.
> >> >
> >> > Can anyone help on that? Thanks a lot.
> >> >
> >>
>


Re: Posting files 405 http error

2016-11-03 Thread Pablo Anzorena
When I manually copy one collection to another, I copy the core.properties
from the source to the destination with the name core.properties.unloaded
so there is no problem.

So the steps I'm doing are:
1> index to my source collection.
2>Copy the directory of the source collection, excluding the
core.properties.
3>Copy the core.properties under the name of core.properties.unloaded to
the destination.
4>Create the collection in the destination.
5>Use the ADDREPLICA command to add replicas.

With these it throws the error.

They are very similar to those you mentioned, but instead you first create
the destination collection and then copy the data. The problem I face with
your approach is that unless I unload my collection, solr doesn't realize
there is data indexed.

2016-11-03 14:54 GMT-03:00 Erick Erickson :

> Wait. What were you doing originally? Just copying the entire
> SOLR_HOME over or something?
>
> Because one of the things each core carries along is a
> "core.properties" file that identifies
> 1> the name of the core, something like collection_shard1_replica1
> 2> the name of the collection the core belongs to
>
> So if you just copy a directory containing the core.properties file
> from one place to another _and_ they're pointing to the same Zookeeper
> then the behavior is undefined.
>
> And if you _don't_ point to the same zookeeper, your copied collection
> isn't registered with ZK so that's a weird state as well.
>
> If your goal is to move things from one collection to another, here's
> a possibility (assuming the nodes can all "see" each other).
>
> 1> index to your source collection
> 2> create a new destination collection
> 3a> use the "fetchindex" command to move the relevant indexes from the
> source to the destination, see
> https://cwiki.apache.org/confluence/display/solr/Index+
> Replication#IndexReplication-HTTPAPICommandsfortheReplicationHandler
> 3b> instead of <3a>, manually copy the data directory from the source
> to each replica.
> 4> In either 3a> or 3b>, it's probably easier to create a leader-only
> (replicationFactor=1) destination collection then use the ADDREPLICA
> command to add replicas, that way they'll all sync automatically.
>
> Best,
> Erick
>
> On Thu, Nov 3, 2016 at 10:28 AM, Pablo Anzorena 
> wrote:
> > Thanks Shawn.
> >
> > Actually there is no load balancer or proxy in the middle, but even if
> > there was, how would you explain that I can index if a create a
> completely
> > new collection?
> >
> > I figured out how to fix it. What I'm doing is creating a new collection,
> > then unloading it (by unloading all the shards/replicas), then copy the
> > data directory from the collection in the other solrcloud, and finally
> > creating again the collection. It's not the best solution, but it works,
> > nevertheless I still would like to know what's causing the problem...
> >
> > It's worth to mention that I'm not using jetty, I'm using solr-undertow
> > https://github.com/kohesive/solr-undertow
> >
> > 2016-11-03 12:56 GMT-03:00 Shawn Heisey :
> >
> >> On 11/3/2016 9:10 AM, Pablo Anzorena wrote:
> >> > Thanks for the answer.
> >> >
> >> > I checked the log and it wasn't logging anything.
> >> >
> >> > The error i'm facing is way bizarre... I create a new fresh collection
> >> and
> >> > then index with no problem, but it keeps throwing this error if i copy
> >> the
> >> > collection from one solrcloud to the other and then index.
> >> >
> >> > Any clue on why is this happening?
> >>
> >> Solr's source code doesn't seem to even have a 405 error, so I bet
> >> what's happening is that you have Solr sitting behind a proxy or load
> >> balancer, and that server doesn't like the request you sent, so it
> >> rejects it and Solr never receives anything.
> >>
> >> Here's an excerpt of code from the SolrException class in the master
> >> branch:
> >>
> >>   /**
> >>* This list of valid HTTP Status error codes that Solr may return in
> >>* the case of a "Server Side" error.
> >>*
> >>* @since solr 1.2
> >>*/
> >>   public enum ErrorCode {
> >> BAD_REQUEST( 400 ),
> >> UNAUTHORIZED( 401 ),
> >> FORBIDDEN( 403 ),
> >> NOT_FOUND( 404 ),
> >> CONFLICT( 409 ),
> >> UNSUPPORTED_MEDIA_TYPE( 415 ),
> >> SERVER_ERROR( 500 ),
> >> SERVICE_UNAVAILABLE( 503 ),
> >> INVALID_STATE( 510 ),
> >> UNKNOWN(0);
> >> public final int code;
> >>
> >> private ErrorCode( int c )
> >> {
> >>   code = c;
> >> }
> >> public static ErrorCode getErrorCode(int c){
> >>   for (ErrorCode err : values()) {
> >> if(err.code == c) return err;
> >>   }
> >>   return UNKNOWN;
> >> }
> >>   };
> >>
> >> Thanks,
> >> Shawn
> >>
> >>
>


Re: UpdateProcessor as a batch

2016-11-03 Thread Erick Erickson
I _thought_ you'd been around long enough to know about the options I
mentioned ;).

Right. I'd guess you're in UpdateHandler.addDoc and there's really no
batching at that level that I know of. I'm pretty sure that even
indexing batches of 1,000 documents from, say, SolrJ go through this
method.

I don't think there's much to be gained by any batching at this level,
it pretty immediately tells Lucene to index the doc.

FWIW
Erick

On Thu, Nov 3, 2016 at 11:10 AM, Markus Jelsma
 wrote:
> Erick - in this case data can come from anywhere. There is one piece of code 
> all incoming documents, regardless of their origin, are passed thru, the 
> update handler and update processors of Solr.
>
> In my case that is the most convenient point to partially modify the 
> documents, instead of moving that logic to separate places.
>
> I've seen the ContentStream in SolrQueryResponse and i probably could tear 
> incoming data apart and put it back together again, but that would not be so 
> easy as working with already deserialized objects such as SolrInputDocument.
>
> UpdateHandler doesn't seem to work on a list of documents, it looked like it 
> works on incoming stuff, not a whole list. I've also looked if i could buffer 
> a batch in UpdateProcessor, work on them, and release them, but that seems 
> impossible.
>
> Thanks,
> Markus
>
> -Original message-
>> From:Erick Erickson 
>> Sent: Thursday 3rd November 2016 18:57
>> To: solr-user 
>> Subject: Re: UpdateProcessor as a batch
>>
>> Markus:
>>
>> How are you indexing? SolrJ has a client.add(List)
>> form, and post.jar lets you add as many documents as you want in a
>> batch
>>
>> Best,
>> Erick
>>
>> On Thu, Nov 3, 2016 at 10:18 AM, Markus Jelsma
>>  wrote:
>> > Hi - i need to process a batch of documents on update but i cannot seem to 
>> > find a point where i can hook in and process a list of SolrInputDocuments, 
>> > not in UpdateProcessor nor in UpdateHandler.
>> >
>> > For now i let it go and implemented it on a per-document basis, it is 
>> > fast, but i'd prefer batches. Is that possible at all?
>> >
>> > Thanks,
>> > Markus
>>


Re: High CPU Usage in export handler

2016-11-03 Thread Erick Erickson
Followup question: You say you're indexing 100 docs/second.  How often
are you _committing_? Either
soft commit
or
hardcommit with openSearcher=true

?

Best,
Erick

On Thu, Nov 3, 2016 at 11:00 AM, Ray Niu  wrote:
> Thanks Joel
> here is the information you requested.
> Are you doing heavy writes at the time?
> we are doing write very frequently, but not very heavy, we will update
> about 100 solr document per second.
> How many concurrent reads are are happening?
> the concurrent reads are about 1000-2000 per minute per node
> What version of Solr are you using?
> we are using solr 5.5.2
> What is the field definition for the double, is it docValues?
> the field definition is
>  docValues="true"/>
>
>
> 2016-11-03 6:30 GMT-07:00 Joel Bernstein :
>
>> Are you doing heavy writes at the time?
>>
>> How many concurrent reads are are happening?
>>
>> What version of Solr are you using?
>>
>> What is the field definition for the double, is it docValues?
>>
>>
>>
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Thu, Nov 3, 2016 at 12:56 AM, Ray Niu  wrote:
>>
>> > Hello:
>> >We are using export handler in Solr Cloud to get some data, we only
>> > request for one field, which type is tdouble, it works well at the
>> > beginning, but recently we saw high CPU issue in all the solr cloud
>> nodes,
>> > we took some thread dump and found following information:
>> >
>> >java.lang.Thread.State: RUNNABLE
>> >
>> > at java.lang.Thread.isAlive(Native Method)
>> >
>> > at
>> > org.apache.lucene.util.CloseableThreadLocal.purge(
>> > CloseableThreadLocal.java:115)
>> >
>> > - locked <0x0006e24d86a8> (a java.util.WeakHashMap)
>> >
>> > at
>> > org.apache.lucene.util.CloseableThreadLocal.maybePurge(
>> > CloseableThreadLocal.java:105)
>> >
>> > at
>> > org.apache.lucene.util.CloseableThreadLocal.get(
>> > CloseableThreadLocal.java:88)
>> >
>> > at
>> > org.apache.lucene.index.CodecReader.getNumericDocValues(
>> > CodecReader.java:143)
>> >
>> > at
>> > org.apache.lucene.index.FilterLeafReader.getNumericDocValues(
>> > FilterLeafReader.java:430)
>> >
>> > at
>> > org.apache.lucene.uninverting.UninvertingReader.getNumericDocValues(
>> > UninvertingReader.java:239)
>> >
>> > at
>> > org.apache.lucene.index.FilterLeafReader.getNumericDocValues(
>> > FilterLeafReader.java:430)
>> >
>> > Is this a known issue for export handler? As we only fetch up to 5000
>> > documents, it should not be data volume issue.
>> >
>> > Can anyone help on that? Thanks a lot.
>> >
>>


Re: Problem with Password Decryption in Data Import Handler

2016-11-03 Thread William Bell
I cannot get it to work either.

Here are my steps. I took the key from the Patch in
https://issues.apache.org/jira/secure/attachment/12730862/SOLR-4392.patch.

echo
U2FsdGVkX19Gz7q7/4jj3Wsin7801TlFbob1PBT2YEacbPEUARDiuV5zGSAwU4Sz7upXDEPIQPU48oY1fBWM6Q==
> pass.enc

openssl aes-128-cbc -d -a -salt -in pass.enc

I typed: Password

enter aes-128-cbc decryption password:

SomeRandomEncryptedTextUsingAES128

I cannot find a test case in the latest v5.5.3 code.? It seems like openssl
command is wrong?

So it worked for that. Not sure if the code changed, but after doing this I
get in solr.log:


2016-11-03 12:06:20.139 INFO  (Thread-127) [   x:autosuggestfull]
o.a.s.u.p.LogUpdateProcessorFactory [autosuggestfull]  webapp=/solr
path=/dataimport
params={debug=false&optimize=false&indent=true&commit=false&clean=false&wt=json&command=full-import&entity=spec&verbose=false}
status=0 QTime=19{} 0 64

2016-11-03 12:06:20.140 ERROR (Thread-127) [   x:autosuggestfull]
o.a.s.h.d.DataImporter Full Import failed:java.lang.RuntimeException:
java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException: Error
decoding password Processing Document # 1

at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:270)

at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:416)

at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:480)

at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:461)

Caused by: java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException: Error
decoding password Processing Document # 1

at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:416)

at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:329)

at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:232)

... 3 more

Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException:
Error decoding password Processing Document # 1

at
org.apache.solr.handler.dataimport.JdbcDataSource.decryptPwd(JdbcDataSource.java:131)

at
org.apache.solr.handler.dataimport.JdbcDataSource.init(JdbcDataSource.java:74)

at
org.apache.solr.handler.dataimport.DataImporter.getDataSourceInstance(DataImporter.java:389)

at
org.apache.solr.handler.dataimport.ContextImpl.getDataSource(ContextImpl.java:100)

at
org.apache.solr.handler.dataimport.SqlEntityProcessor.init(SqlEntityProcessor.java:53)

at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:75)

at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:433)

at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:414)

... 5 more

Caused by: java.lang.IllegalStateException: Bad password, algorithm, mode
or padding; no salt, wrong number of iterations or corrupted ciphertext.

at org.apache.solr.util.CryptoKeys.decodeAES(CryptoKeys.java:249)

at org.apache.solr.util.CryptoKeys.decodeAES(CryptoKeys.java:195)

at
org.apache.solr.handler.dataimport.JdbcDataSource.decryptPwd(JdbcDataSource.java:129)

... 12 more

Caused by: javax.crypto.BadPaddingException: Given final block not properly
padded

at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:975)

at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:833)

at
com.sun.crypto.provider.AESCipher.engineDoFinal(AESCipher.java:446)

at javax.crypto.Cipher.doFinal(Cipher.java:2165)

at org.apache.solr.util.CryptoKeys.decodeAES(CryptoKeys.java:245)

... 14 more


2016-11-03 12:06:20.140 INFO  (Thread-127) [   x:autosuggestfull]
o.a.s.u.DirectUpdateHandler2 start rollback{}

2016-11-03 12:06:20.140 INFO  (Thread-127) [   x:autosuggestfull]
o.a.s.u.DefaultSolrCoreState Rollback old IndexWriter...
core=autosuggestfull

2016-11-03 12:06:20.154 INFO  (Thread-127) [   x:autosuggestfull]
o.a.s.c.SolrDeletionPolicy SolrDeletionPolicy.onInit: commits: num=1



On Wed, Nov 2, 2016 at 12:21 PM, Jamie Jackson  wrote:

> I'm at a brick wall. Here's the latest status:
>
> Here are some sample commands that I'm using:
>
> *Create the encryptKeyFile and encrypted password:*
>
>
> encrypter_password='this_is_my_encrypter_password'
> plain_db_pw='Oakton153'
>
> cd /var/docker/solr_stage2/credentials/
> echo -n "${encrypter_password}" > encpwd.txt
> echo -n "${plain_db_pwd}" > plaindbpwd.txt
> openssl enc -aes-128-cbc -a -salt -in plaindbpwd.txt -k
> "${encrypter_password}"
>
> rm plaindbpwd.txt
>
> That generated this as the password, by the way:
>
> U2FsdGVkX19pBVTeZaSl43gFFAlrx+Th1zSg1GvlX9o=
>
> *Configure DIH configuration:*
>
> 
>
>  driver="org.mariadb.jdbc.Driver"
> url="jdbc:mysql://local.mysite.com:3306/mysite"
> user="root"
> password="U2FsdGVkX19pBVTeZa

Re: Problem with Password Decryption in Data Import Handler

2016-11-03 Thread William Bell
OK it was

echo -n "${encrypt_key}" > encrypt.key



On Thu, Nov 3, 2016 at 12:20 PM, William Bell  wrote:

> I cannot get it to work either.
>
> Here are my steps. I took the key from the Patch in
> https://issues.apache.org/jira/secure/attachment/12730862/SOLR-4392.patch.
>
> echo U2FsdGVkX19Gz7q7/4jj3Wsin7801TlFbob1PBT2YEacbPE
> UARDiuV5zGSAwU4Sz7upXDEPIQPU48oY1fBWM6Q== > pass.enc
>
> openssl aes-128-cbc -d -a -salt -in pass.enc
>
> I typed: Password
>
> enter aes-128-cbc decryption password:
>
> SomeRandomEncryptedTextUsingAES128
>
> I cannot find a test case in the latest v5.5.3 code.? It seems like
> openssl command is wrong?
>
> So it worked for that. Not sure if the code changed, but after doing this
> I get in solr.log:
>
>
> 2016-11-03 12:06:20.139 INFO  (Thread-127) [   x:autosuggestfull]
> o.a.s.u.p.LogUpdateProcessorFactory [autosuggestfull]  webapp=/solr
> path=/dataimport params={debug=false&optimize=false&indent=true&commit=
> false&clean=false&wt=json&command=full-import&entity=spec&verbose=false}
> status=0 QTime=19{} 0 64
>
> 2016-11-03 12:06:20.140 ERROR (Thread-127) [   x:autosuggestfull]
> o.a.s.h.d.DataImporter Full Import failed:java.lang.RuntimeException:
> java.lang.RuntimeException: 
> org.apache.solr.handler.dataimport.DataImportHandlerException:
> Error decoding password Processing Document # 1
>
> at org.apache.solr.handler.dataimport.DocBuilder.execute(
> DocBuilder.java:270)
>
> at org.apache.solr.handler.dataimport.DataImporter.
> doFullImport(DataImporter.java:416)
>
> at org.apache.solr.handler.dataimport.DataImporter.
> runCmd(DataImporter.java:480)
>
> at org.apache.solr.handler.dataimport.DataImporter$1.run(
> DataImporter.java:461)
>
> Caused by: java.lang.RuntimeException: 
> org.apache.solr.handler.dataimport.DataImportHandlerException:
> Error decoding password Processing Document # 1
>
> at org.apache.solr.handler.dataimport.DocBuilder.
> buildDocument(DocBuilder.java:416)
>
> at org.apache.solr.handler.dataimport.DocBuilder.
> doFullDump(DocBuilder.java:329)
>
> at org.apache.solr.handler.dataimport.DocBuilder.execute(
> DocBuilder.java:232)
>
> ... 3 more
>
> Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException:
> Error decoding password Processing Document # 1
>
> at org.apache.solr.handler.dataimport.JdbcDataSource.
> decryptPwd(JdbcDataSource.java:131)
>
> at org.apache.solr.handler.dataimport.JdbcDataSource.
> init(JdbcDataSource.java:74)
>
> at org.apache.solr.handler.dataimport.DataImporter.
> getDataSourceInstance(DataImporter.java:389)
>
> at org.apache.solr.handler.dataimport.ContextImpl.
> getDataSource(ContextImpl.java:100)
>
> at org.apache.solr.handler.dataimport.SqlEntityProcessor.
> init(SqlEntityProcessor.java:53)
>
> at org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(
> EntityProcessorWrapper.java:75)
>
> at org.apache.solr.handler.dataimport.DocBuilder.
> buildDocument(DocBuilder.java:433)
>
> at org.apache.solr.handler.dataimport.DocBuilder.
> buildDocument(DocBuilder.java:414)
>
> ... 5 more
>
> Caused by: java.lang.IllegalStateException: Bad password, algorithm, mode
> or padding; no salt, wrong number of iterations or corrupted ciphertext.
>
> at org.apache.solr.util.CryptoKeys.decodeAES(CryptoKeys.java:249)
>
> at org.apache.solr.util.CryptoKeys.decodeAES(CryptoKeys.java:195)
>
> at org.apache.solr.handler.dataimport.JdbcDataSource.
> decryptPwd(JdbcDataSource.java:129)
>
> ... 12 more
>
> Caused by: javax.crypto.BadPaddingException: Given final block not
> properly padded
>
> at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:975)
>
> at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:833)
>
> at com.sun.crypto.provider.AESCipher.engineDoFinal(
> AESCipher.java:446)
>
> at javax.crypto.Cipher.doFinal(Cipher.java:2165)
>
> at org.apache.solr.util.CryptoKeys.decodeAES(CryptoKeys.java:245)
>
> ... 14 more
>
>
> 2016-11-03 12:06:20.140 INFO  (Thread-127) [   x:autosuggestfull]
> o.a.s.u.DirectUpdateHandler2 start rollback{}
>
> 2016-11-03 12:06:20.140 INFO  (Thread-127) [   x:autosuggestfull]
> o.a.s.u.DefaultSolrCoreState Rollback old IndexWriter...
> core=autosuggestfull
>
> 2016-11-03 12:06:20.154 INFO  (Thread-127) [   x:autosuggestfull]
> o.a.s.c.SolrDeletionPolicy SolrDeletionPolicy.onInit: commits: num=1
>
>
>
> On Wed, Nov 2, 2016 at 12:21 PM, Jamie Jackson 
> wrote:
>
>> I'm at a brick wall. Here's the latest status:
>>
>> Here are some sample commands that I'm using:
>>
>> *Create the encryptKeyFile and encrypted password:*
>>
>>
>> encrypter_password='this_is_my_encrypter_password'
>> plain_db_pw='Oakton153'
>>
>> cd /var/docker/solr_stage2/credentials/
>> echo -n "${encrypter_password}" > encpwd.txt
>> echo -n "${plain_db_pwd}" > plaindbpwd.txt
>> openssl enc -ae

RE: UpdateProcessor as a batch

2016-11-03 Thread Markus Jelsma
Hi - i believe i did not explain myself well enough.

Getting the data in Solr is not a problem, various sources index docs to Solr, 
all in fine batches as everyone should do indeed. The thing is that i need to 
do some preprocessing before it is indexed. Normally, UpdateProcessors are the 
way to go. I've made quite a few of them and they work fine.

The problem is, i need to do a remote lookup for each document being indexed. 
Right now, i make an external connection for each doc being indexed in the 
current UpdateProcessor. This is still fast. But the remote backend supports 
batched lookups, which are faster.

This is why i'd love to be able to buffer documents in an UpdateProcessor, and 
if there are enough, i do a remote lookup for all of them, do some processing 
and let them be indexed.

Thanks,
Markus

 
 
-Original message-
> From:Erick Erickson 
> Sent: Thursday 3rd November 2016 19:18
> To: solr-user 
> Subject: Re: UpdateProcessor as a batch
> 
> I _thought_ you'd been around long enough to know about the options I
> mentioned ;).
> 
> Right. I'd guess you're in UpdateHandler.addDoc and there's really no
> batching at that level that I know of. I'm pretty sure that even
> indexing batches of 1,000 documents from, say, SolrJ go through this
> method.
> 
> I don't think there's much to be gained by any batching at this level,
> it pretty immediately tells Lucene to index the doc.
> 
> FWIW
> Erick
> 
> On Thu, Nov 3, 2016 at 11:10 AM, Markus Jelsma
>  wrote:
> > Erick - in this case data can come from anywhere. There is one piece of 
> > code all incoming documents, regardless of their origin, are passed thru, 
> > the update handler and update processors of Solr.
> >
> > In my case that is the most convenient point to partially modify the 
> > documents, instead of moving that logic to separate places.
> >
> > I've seen the ContentStream in SolrQueryResponse and i probably could tear 
> > incoming data apart and put it back together again, but that would not be 
> > so easy as working with already deserialized objects such as 
> > SolrInputDocument.
> >
> > UpdateHandler doesn't seem to work on a list of documents, it looked like 
> > it works on incoming stuff, not a whole list. I've also looked if i could 
> > buffer a batch in UpdateProcessor, work on them, and release them, but that 
> > seems impossible.
> >
> > Thanks,
> > Markus
> >
> > -Original message-
> >> From:Erick Erickson 
> >> Sent: Thursday 3rd November 2016 18:57
> >> To: solr-user 
> >> Subject: Re: UpdateProcessor as a batch
> >>
> >> Markus:
> >>
> >> How are you indexing? SolrJ has a client.add(List)
> >> form, and post.jar lets you add as many documents as you want in a
> >> batch
> >>
> >> Best,
> >> Erick
> >>
> >> On Thu, Nov 3, 2016 at 10:18 AM, Markus Jelsma
> >>  wrote:
> >> > Hi - i need to process a batch of documents on update but i cannot seem 
> >> > to find a point where i can hook in and process a list of 
> >> > SolrInputDocuments, not in UpdateProcessor nor in UpdateHandler.
> >> >
> >> > For now i let it go and implemented it on a per-document basis, it is 
> >> > fast, but i'd prefer batches. Is that possible at all?
> >> >
> >> > Thanks,
> >> > Markus
> >>
> 


child doc filter

2016-11-03 Thread Tim Williams
I'm using the BlockJoinQuery to query child docs and return the
parent.  I'd like to have the equivalent of a filter that applies to
child docs and I don't see a way to do that with the BlockJoin stuffs.
It looks like I could modify it to accept some childFilter param and
add a QueryWrapperFilter right after the child query is created[1] but
before I did that, I wanted to see if there's a built-in way to
achieve the same behavior?

Thanks,
--tim

[1] - 
https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/search/join/BlockJoinParentQParser.java#L69


Re: UpdateProcessor as a batch

2016-11-03 Thread Erick Erickson
I thought we might be talking past each other...

I think you're into "roll your own" here. Anything that
accumulated docs for a while, did a batch lookup
on the external system, then passed on the docs
runs the risk of losing docs if the server is abnormally
shut down.

I guess ideally you'd like to augment the list coming in
rather than the docs once they're removed from the
incoming batch and passed on, but I admit I have no
clue where to do that. Possibly in an update chain? If
so, you'd need to be careful to only augment when
they'd reached their final shard leader or all at once
before distribution to shard leaders.

Is the expense for the external lookup doing the actual
lookups or establishing the connection? Would
having some kind of shared connection to the external
source be worthwhile?

FWIW,
Erick

On Thu, Nov 3, 2016 at 12:06 PM, Markus Jelsma
 wrote:
> Hi - i believe i did not explain myself well enough.
>
> Getting the data in Solr is not a problem, various sources index docs to 
> Solr, all in fine batches as everyone should do indeed. The thing is that i 
> need to do some preprocessing before it is indexed. Normally, 
> UpdateProcessors are the way to go. I've made quite a few of them and they 
> work fine.
>
> The problem is, i need to do a remote lookup for each document being indexed. 
> Right now, i make an external connection for each doc being indexed in the 
> current UpdateProcessor. This is still fast. But the remote backend supports 
> batched lookups, which are faster.
>
> This is why i'd love to be able to buffer documents in an UpdateProcessor, 
> and if there are enough, i do a remote lookup for all of them, do some 
> processing and let them be indexed.
>
> Thanks,
> Markus
>
>
>
> -Original message-
>> From:Erick Erickson 
>> Sent: Thursday 3rd November 2016 19:18
>> To: solr-user 
>> Subject: Re: UpdateProcessor as a batch
>>
>> I _thought_ you'd been around long enough to know about the options I
>> mentioned ;).
>>
>> Right. I'd guess you're in UpdateHandler.addDoc and there's really no
>> batching at that level that I know of. I'm pretty sure that even
>> indexing batches of 1,000 documents from, say, SolrJ go through this
>> method.
>>
>> I don't think there's much to be gained by any batching at this level,
>> it pretty immediately tells Lucene to index the doc.
>>
>> FWIW
>> Erick
>>
>> On Thu, Nov 3, 2016 at 11:10 AM, Markus Jelsma
>>  wrote:
>> > Erick - in this case data can come from anywhere. There is one piece of 
>> > code all incoming documents, regardless of their origin, are passed thru, 
>> > the update handler and update processors of Solr.
>> >
>> > In my case that is the most convenient point to partially modify the 
>> > documents, instead of moving that logic to separate places.
>> >
>> > I've seen the ContentStream in SolrQueryResponse and i probably could tear 
>> > incoming data apart and put it back together again, but that would not be 
>> > so easy as working with already deserialized objects such as 
>> > SolrInputDocument.
>> >
>> > UpdateHandler doesn't seem to work on a list of documents, it looked like 
>> > it works on incoming stuff, not a whole list. I've also looked if i could 
>> > buffer a batch in UpdateProcessor, work on them, and release them, but 
>> > that seems impossible.
>> >
>> > Thanks,
>> > Markus
>> >
>> > -Original message-
>> >> From:Erick Erickson 
>> >> Sent: Thursday 3rd November 2016 18:57
>> >> To: solr-user 
>> >> Subject: Re: UpdateProcessor as a batch
>> >>
>> >> Markus:
>> >>
>> >> How are you indexing? SolrJ has a client.add(List)
>> >> form, and post.jar lets you add as many documents as you want in a
>> >> batch
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Thu, Nov 3, 2016 at 10:18 AM, Markus Jelsma
>> >>  wrote:
>> >> > Hi - i need to process a batch of documents on update but i cannot seem 
>> >> > to find a point where i can hook in and process a list of 
>> >> > SolrInputDocuments, not in UpdateProcessor nor in UpdateHandler.
>> >> >
>> >> > For now i let it go and implemented it on a per-document basis, it is 
>> >> > fast, but i'd prefer batches. Is that possible at all?
>> >> >
>> >> > Thanks,
>> >> > Markus
>> >>
>>


Re: UpdateProcessor as a batch

2016-11-03 Thread mike st. john
maybe introduce a distributed queue such as apache ignite,  hazelcast or
even redis.   Read from the queue in batches, do your lookup then index the
same batch.

just a thought.

Mike St. John.

On Nov 3, 2016 3:58 PM, "Erick Erickson"  wrote:

> I thought we might be talking past each other...
>
> I think you're into "roll your own" here. Anything that
> accumulated docs for a while, did a batch lookup
> on the external system, then passed on the docs
> runs the risk of losing docs if the server is abnormally
> shut down.
>
> I guess ideally you'd like to augment the list coming in
> rather than the docs once they're removed from the
> incoming batch and passed on, but I admit I have no
> clue where to do that. Possibly in an update chain? If
> so, you'd need to be careful to only augment when
> they'd reached their final shard leader or all at once
> before distribution to shard leaders.
>
> Is the expense for the external lookup doing the actual
> lookups or establishing the connection? Would
> having some kind of shared connection to the external
> source be worthwhile?
>
> FWIW,
> Erick
>
> On Thu, Nov 3, 2016 at 12:06 PM, Markus Jelsma
>  wrote:
> > Hi - i believe i did not explain myself well enough.
> >
> > Getting the data in Solr is not a problem, various sources index docs to
> Solr, all in fine batches as everyone should do indeed. The thing is that i
> need to do some preprocessing before it is indexed. Normally,
> UpdateProcessors are the way to go. I've made quite a few of them and they
> work fine.
> >
> > The problem is, i need to do a remote lookup for each document being
> indexed. Right now, i make an external connection for each doc being
> indexed in the current UpdateProcessor. This is still fast. But the remote
> backend supports batched lookups, which are faster.
> >
> > This is why i'd love to be able to buffer documents in an
> UpdateProcessor, and if there are enough, i do a remote lookup for all of
> them, do some processing and let them be indexed.
> >
> > Thanks,
> > Markus
> >
> >
> >
> > -Original message-
> >> From:Erick Erickson 
> >> Sent: Thursday 3rd November 2016 19:18
> >> To: solr-user 
> >> Subject: Re: UpdateProcessor as a batch
> >>
> >> I _thought_ you'd been around long enough to know about the options I
> >> mentioned ;).
> >>
> >> Right. I'd guess you're in UpdateHandler.addDoc and there's really no
> >> batching at that level that I know of. I'm pretty sure that even
> >> indexing batches of 1,000 documents from, say, SolrJ go through this
> >> method.
> >>
> >> I don't think there's much to be gained by any batching at this level,
> >> it pretty immediately tells Lucene to index the doc.
> >>
> >> FWIW
> >> Erick
> >>
> >> On Thu, Nov 3, 2016 at 11:10 AM, Markus Jelsma
> >>  wrote:
> >> > Erick - in this case data can come from anywhere. There is one piece
> of code all incoming documents, regardless of their origin, are passed
> thru, the update handler and update processors of Solr.
> >> >
> >> > In my case that is the most convenient point to partially modify the
> documents, instead of moving that logic to separate places.
> >> >
> >> > I've seen the ContentStream in SolrQueryResponse and i probably could
> tear incoming data apart and put it back together again, but that would not
> be so easy as working with already deserialized objects such as
> SolrInputDocument.
> >> >
> >> > UpdateHandler doesn't seem to work on a list of documents, it looked
> like it works on incoming stuff, not a whole list. I've also looked if i
> could buffer a batch in UpdateProcessor, work on them, and release them,
> but that seems impossible.
> >> >
> >> > Thanks,
> >> > Markus
> >> >
> >> > -Original message-
> >> >> From:Erick Erickson 
> >> >> Sent: Thursday 3rd November 2016 18:57
> >> >> To: solr-user 
> >> >> Subject: Re: UpdateProcessor as a batch
> >> >>
> >> >> Markus:
> >> >>
> >> >> How are you indexing? SolrJ has a client.add(List<
> SolrInputDocument>)
> >> >> form, and post.jar lets you add as many documents as you want in a
> >> >> batch
> >> >>
> >> >> Best,
> >> >> Erick
> >> >>
> >> >> On Thu, Nov 3, 2016 at 10:18 AM, Markus Jelsma
> >> >>  wrote:
> >> >> > Hi - i need to process a batch of documents on update but i cannot
> seem to find a point where i can hook in and process a list of
> SolrInputDocuments, not in UpdateProcessor nor in UpdateHandler.
> >> >> >
> >> >> > For now i let it go and implemented it on a per-document basis, it
> is fast, but i'd prefer batches. Is that possible at all?
> >> >> >
> >> >> > Thanks,
> >> >> > Markus
> >> >>
> >>
>


Re: UpdateProcessor as a batch

2016-11-03 Thread Alexandre Rafalovitch
How big a batch we are talking about?

Because I believe you could accumulate the docs in the first URP in
the processAdd and then do the batch lookup and actually processing of
them on processCommit.

They are daisy chain, so as long as you are holding on to the chain,
the rest of the URPs don't happen.

Obviously you are relying on the commit here to trigger the final call.

Or you could do a two collection sequence with indexing to first
collection, querying for whatever you need to batch lookup and then
doing Collection-to-Collection enhanced copy.

Regards,
   Alex.

Solr Example reading group is starting November 2016, join us at
http://j.mp/SolrERG
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 4 November 2016 at 07:35, mike st. john  wrote:
> maybe introduce a distributed queue such as apache ignite,  hazelcast or
> even redis.   Read from the queue in batches, do your lookup then index the
> same batch.
>
> just a thought.
>
> Mike St. John.
>
> On Nov 3, 2016 3:58 PM, "Erick Erickson"  wrote:
>
>> I thought we might be talking past each other...
>>
>> I think you're into "roll your own" here. Anything that
>> accumulated docs for a while, did a batch lookup
>> on the external system, then passed on the docs
>> runs the risk of losing docs if the server is abnormally
>> shut down.
>>
>> I guess ideally you'd like to augment the list coming in
>> rather than the docs once they're removed from the
>> incoming batch and passed on, but I admit I have no
>> clue where to do that. Possibly in an update chain? If
>> so, you'd need to be careful to only augment when
>> they'd reached their final shard leader or all at once
>> before distribution to shard leaders.
>>
>> Is the expense for the external lookup doing the actual
>> lookups or establishing the connection? Would
>> having some kind of shared connection to the external
>> source be worthwhile?
>>
>> FWIW,
>> Erick
>>
>> On Thu, Nov 3, 2016 at 12:06 PM, Markus Jelsma
>>  wrote:
>> > Hi - i believe i did not explain myself well enough.
>> >
>> > Getting the data in Solr is not a problem, various sources index docs to
>> Solr, all in fine batches as everyone should do indeed. The thing is that i
>> need to do some preprocessing before it is indexed. Normally,
>> UpdateProcessors are the way to go. I've made quite a few of them and they
>> work fine.
>> >
>> > The problem is, i need to do a remote lookup for each document being
>> indexed. Right now, i make an external connection for each doc being
>> indexed in the current UpdateProcessor. This is still fast. But the remote
>> backend supports batched lookups, which are faster.
>> >
>> > This is why i'd love to be able to buffer documents in an
>> UpdateProcessor, and if there are enough, i do a remote lookup for all of
>> them, do some processing and let them be indexed.
>> >
>> > Thanks,
>> > Markus
>> >
>> >
>> >
>> > -Original message-
>> >> From:Erick Erickson 
>> >> Sent: Thursday 3rd November 2016 19:18
>> >> To: solr-user 
>> >> Subject: Re: UpdateProcessor as a batch
>> >>
>> >> I _thought_ you'd been around long enough to know about the options I
>> >> mentioned ;).
>> >>
>> >> Right. I'd guess you're in UpdateHandler.addDoc and there's really no
>> >> batching at that level that I know of. I'm pretty sure that even
>> >> indexing batches of 1,000 documents from, say, SolrJ go through this
>> >> method.
>> >>
>> >> I don't think there's much to be gained by any batching at this level,
>> >> it pretty immediately tells Lucene to index the doc.
>> >>
>> >> FWIW
>> >> Erick
>> >>
>> >> On Thu, Nov 3, 2016 at 11:10 AM, Markus Jelsma
>> >>  wrote:
>> >> > Erick - in this case data can come from anywhere. There is one piece
>> of code all incoming documents, regardless of their origin, are passed
>> thru, the update handler and update processors of Solr.
>> >> >
>> >> > In my case that is the most convenient point to partially modify the
>> documents, instead of moving that logic to separate places.
>> >> >
>> >> > I've seen the ContentStream in SolrQueryResponse and i probably could
>> tear incoming data apart and put it back together again, but that would not
>> be so easy as working with already deserialized objects such as
>> SolrInputDocument.
>> >> >
>> >> > UpdateHandler doesn't seem to work on a list of documents, it looked
>> like it works on incoming stuff, not a whole list. I've also looked if i
>> could buffer a batch in UpdateProcessor, work on them, and release them,
>> but that seems impossible.
>> >> >
>> >> > Thanks,
>> >> > Markus
>> >> >
>> >> > -Original message-
>> >> >> From:Erick Erickson 
>> >> >> Sent: Thursday 3rd November 2016 18:57
>> >> >> To: solr-user 
>> >> >> Subject: Re: UpdateProcessor as a batch
>> >> >>
>> >> >> Markus:
>> >> >>
>> >> >> How are you indexing? SolrJ has a client.add(List<
>> SolrInputDocument>)
>> >> >> form, and post.jar lets you add as many documents as you want in a
>> >> >

Re: UpdateProcessor as a batch

2016-11-03 Thread Joel Bernstein
This might be useful. In this scenario you load you content into Solr for
staging and perform your ETL from Solr to Solr:

http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html

Basically Solr becomes a text processing warehouse.

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Nov 3, 2016 at 5:05 PM, Alexandre Rafalovitch 
wrote:

> How big a batch we are talking about?
>
> Because I believe you could accumulate the docs in the first URP in
> the processAdd and then do the batch lookup and actually processing of
> them on processCommit.
>
> They are daisy chain, so as long as you are holding on to the chain,
> the rest of the URPs don't happen.
>
> Obviously you are relying on the commit here to trigger the final call.
>
> Or you could do a two collection sequence with indexing to first
> collection, querying for whatever you need to batch lookup and then
> doing Collection-to-Collection enhanced copy.
>
> Regards,
>Alex.
> 
> Solr Example reading group is starting November 2016, join us at
> http://j.mp/SolrERG
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 4 November 2016 at 07:35, mike st. john  wrote:
> > maybe introduce a distributed queue such as apache ignite,  hazelcast or
> > even redis.   Read from the queue in batches, do your lookup then index
> the
> > same batch.
> >
> > just a thought.
> >
> > Mike St. John.
> >
> > On Nov 3, 2016 3:58 PM, "Erick Erickson" 
> wrote:
> >
> >> I thought we might be talking past each other...
> >>
> >> I think you're into "roll your own" here. Anything that
> >> accumulated docs for a while, did a batch lookup
> >> on the external system, then passed on the docs
> >> runs the risk of losing docs if the server is abnormally
> >> shut down.
> >>
> >> I guess ideally you'd like to augment the list coming in
> >> rather than the docs once they're removed from the
> >> incoming batch and passed on, but I admit I have no
> >> clue where to do that. Possibly in an update chain? If
> >> so, you'd need to be careful to only augment when
> >> they'd reached their final shard leader or all at once
> >> before distribution to shard leaders.
> >>
> >> Is the expense for the external lookup doing the actual
> >> lookups or establishing the connection? Would
> >> having some kind of shared connection to the external
> >> source be worthwhile?
> >>
> >> FWIW,
> >> Erick
> >>
> >> On Thu, Nov 3, 2016 at 12:06 PM, Markus Jelsma
> >>  wrote:
> >> > Hi - i believe i did not explain myself well enough.
> >> >
> >> > Getting the data in Solr is not a problem, various sources index docs
> to
> >> Solr, all in fine batches as everyone should do indeed. The thing is
> that i
> >> need to do some preprocessing before it is indexed. Normally,
> >> UpdateProcessors are the way to go. I've made quite a few of them and
> they
> >> work fine.
> >> >
> >> > The problem is, i need to do a remote lookup for each document being
> >> indexed. Right now, i make an external connection for each doc being
> >> indexed in the current UpdateProcessor. This is still fast. But the
> remote
> >> backend supports batched lookups, which are faster.
> >> >
> >> > This is why i'd love to be able to buffer documents in an
> >> UpdateProcessor, and if there are enough, i do a remote lookup for all
> of
> >> them, do some processing and let them be indexed.
> >> >
> >> > Thanks,
> >> > Markus
> >> >
> >> >
> >> >
> >> > -Original message-
> >> >> From:Erick Erickson 
> >> >> Sent: Thursday 3rd November 2016 19:18
> >> >> To: solr-user 
> >> >> Subject: Re: UpdateProcessor as a batch
> >> >>
> >> >> I _thought_ you'd been around long enough to know about the options I
> >> >> mentioned ;).
> >> >>
> >> >> Right. I'd guess you're in UpdateHandler.addDoc and there's really no
> >> >> batching at that level that I know of. I'm pretty sure that even
> >> >> indexing batches of 1,000 documents from, say, SolrJ go through this
> >> >> method.
> >> >>
> >> >> I don't think there's much to be gained by any batching at this
> level,
> >> >> it pretty immediately tells Lucene to index the doc.
> >> >>
> >> >> FWIW
> >> >> Erick
> >> >>
> >> >> On Thu, Nov 3, 2016 at 11:10 AM, Markus Jelsma
> >> >>  wrote:
> >> >> > Erick - in this case data can come from anywhere. There is one
> piece
> >> of code all incoming documents, regardless of their origin, are passed
> >> thru, the update handler and update processors of Solr.
> >> >> >
> >> >> > In my case that is the most convenient point to partially modify
> the
> >> documents, instead of moving that logic to separate places.
> >> >> >
> >> >> > I've seen the ContentStream in SolrQueryResponse and i probably
> could
> >> tear incoming data apart and put it back together again, but that would
> not
> >> be so easy as working with already deserialized objects such as
> >> SolrInputDocument.
> >> >> >
> >> >> > UpdateHandler doesn't seem to work on a list of documents, it
> look

Re: child doc filter

2016-11-03 Thread Mikhail Khludnev
Hello Tim,

I think
http://blog-archive.griddynamics.com/2013/12/grandchildren-and-siblings-with-block.html
provides a few relevant examples. To summarize, it's wort to use nested
query syntax {!parent...  v=$foo} to nest a complex child clause. If you
need to exploit filter cache, use filter(foo:bar) syntax.

Regards

On Thu, Nov 3, 2016 at 10:13 PM, Tim Williams  wrote:

> I'm using the BlockJoinQuery to query child docs and return the
> parent.  I'd like to have the equivalent of a filter that applies to
> child docs and I don't see a way to do that with the BlockJoin stuffs.
> It looks like I could modify it to accept some childFilter param and
> add a QueryWrapperFilter right after the child query is created[1] but
> before I did that, I wanted to see if there's a built-in way to
> achieve the same behavior?
>
> Thanks,
> --tim
>
> [1] - https://github.com/apache/lucene-solr/blob/master/solr/
> core/src/java/org/apache/solr/search/join/BlockJoinParentQParser.java#L69
>



-- 
Sincerely yours
Mikhail Khludnev