Re: Using solr(cloud) as source-of-truth for data (with no backing external db)

2016-11-18 Thread Dorian Hoxha
@alex
That makes sense, but it can be ~fixed by just storing every field that you
need.

@Walter
Many of those things are missing from many nosql dbs yet they're used as
source of data.
As long as the backup is "point in time", meaning consistent timestamp
across all shards it ~should be ok for many usecases.

The 1-line-curl may need a patch to be disabled from config.

On Thu, Nov 17, 2016 at 6:29 PM, Walter Underwood 
wrote:

> I agree, it is a bad idea.
>
> Solr is missing nearly everything you want in a repository, because it is
> not designed to be a repository.
>
> Does not have:
>
> * access control
> * transactions
> * transactional backup
> * dump and load
> * schema migration
> * versioning
>
> And so on.
>
> Also, I’m glad to share a one-line curl command that will delete all the
> documents
> in your collection.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Nov 17, 2016, at 1:20 AM, Alexandre Rafalovitch 
> wrote:
> >
> > I've heard of people doing it but it is not recommended.
> >
> > One of the biggest implementation breakthroughs is that - after the
> > initial learning curve - you will start mapping your input data to
> > signals. Those signals will not look very much like your original data
> > and therefore are not terribly suitable to be the source of it.
> >
> > We are talking copyFields, UpdateRequestProcessor pre-processing,
> > fields that are not stored, nested documents flattening,
> > denormalization, etc. Getting back from that to original shape of data
> > is painful.
> >
> > Regards,
> >   Alex.
> > 
> > Solr Example reading group is starting November 2016, join us at
> > http://j.mp/SolrERG
> > Newsletter and resources for Solr beginners and intermediates:
> > http://www.solr-start.com/
> >
> >
> > On 17 November 2016 at 18:46, Dorian Hoxha 
> wrote:
> >> Hi,
> >>
> >> Anyone use solr for source-of-data with no `normal` db (of course with
> >> normal backups/replication) ?
> >>
> >> Are there any drawbacks ?
> >>
> >> Thank You
>
>


Combined Dismax and Block Join Scoring on nested documents

2016-11-18 Thread Mike Allen
Apologies if I'm doing something incredibly stupid as I'm new to Solr. I am
having an issue with scoring child documents in a block join query when
including a dismax query. I'm actually a little unclear on whether or not
that's a complete oxymoron, combining dismax and block join.

 

Problem statement: Given a set of Product documents - which contain the
product names and descriptions - which contain nested variant documents (see
below for abridged example) - which contain the boolean stock status
(in_stock) and the variant prices (list_price_gbp) - I want to do a Dismax
query of, say, "skirt" on the product name (name) and sort the resulting
product documents by the minimum price (list_price_gbp) of their child
variant documents. Note that, although the abridged document doesn't show
them, there are a number of other arbitrary fields which may be used as
filter queries on the child documents, for example size or colour, which
will in effect change the "active" minimum price of a product. Hence,
denormalizing, or flattening, the documents is not really an option I want
to pursue.

 

An abridged example document returned by the Solr Admin Query console which
I am querying:

  



12345

product

black flared skirt

40.0



  

12345abcd

12345

variant

65.0

true

  

  

12345fghi

12345

variant

40.0

true

   



 

So I am familiar with the block join score mode; setting aside the dismax
aspect for now, this query, using the Function Query {!func}list_price_gbp,
with score ascending, returns documents ordered correctly, with a £2.00
(cheapest) product first:

 

q={!parent which=content_type:product
score=min}+(in_stock:(true)){!func}list_price_gbp&doc.q={!terms
f="productid"
v=$row.id}&doc.rows=1000&doc.fl=score,*&doc.fq=(in_stock:(true))&start=0&row
s=103&fl=score,*,doc:[subquery]&sort=score asc&debugQuery=on&wt=xml

 

The "explain" for this is:

 

2.184 = Score based on 1 child docs in range from 26752 to 26752, best
match:

  2.184 = sum of:

1.8374416E-5 = weight(in_stock:T in 26752) [], result of:

  1.8374416E-5 = score(doc=26752,freq=1.0 = termFreq=1.0

), product of:

1.8374416E-5 = idf(docFreq=27211, docCount=27211)

1.0 = tfNorm, computed from:

  1.0 = termFreq=1.0

  1.2 = parameter k1

  0.0 = parameter b (norms omitted for field)

2.0 = FunctionQuery(float(list_price_gbp)), product of:

  2.0 = float(list_price_gbp)=2.0

  1.0 = boost

  1.0 = queryNorm

 

Even though this is doing what I want, I have a slight niggle the that
overall score is not just the result of the Function Query, however, as all
results get the same tiny fraction added, it doesn't matter.

 

However, when I prepend my dismax query:

 

q={!dismax v="skirt" qf="name"}+{!parent which=content_type:product
score=min}+(in_stock:(true)){!func}list_price_gbp&doc.q={!terms
f="productid"
v=$row.id}&doc.rows=1000&doc.fl=score,*&doc.fq=(in_stock:(true))&start=0&row
s=103&fl=score,*,doc:[subquery]&sort=score asc&debugQuery=on&wt=xml

 

The scoring is only dependent on the dismax scoring, where the "explain" for
this is:

 

2.7600822 = sum of:

  2.7600822 = weight(name:skirt in 13406) [], result of:

2.7600822 = score(doc=13406,freq=1.0 = termFreq=1.0

), product of:

  3.5851278 = idf(docFreq=103, docCount=3731)

  0.76987 = tfNorm, computed from:

1.0 = termFreq=1.0

1.2 = parameter k1

0.75 = parameter b

4.108818 = avgFieldLength

7.11 = fieldLength  



So in actual fact, with score ascending, it is ordering the results by least
matching first and the nested document list_price_gbp is irrelevant. I
strongly suspect I am being totally dumb and that this is expected behaviour
for an obvious reason that escapes me, apart from perhaps it's because the
two scoring methods are just plainly incompatible.

 

I have additionally tried just doing a lucene query instead:

 

q=+name:skirt +{!parent which=content_type:product score=min}
(in_stock:(true)){!func}list_price_gbp&doc.q={!terms f="productid"
v=$row.id}&doc.rows=1000&doc.fl=score,*&doc.fq=(in_stock:(true))&start=0&row
s=103&fl=score,*,doc:[subquery]&sort=score asc&debugQuery=on&wt=xml

 

The "explain" of this indicates it's scoring products, for which
list_price_gbp simply does not exist, as the Function Query always returns
zero. 

 

6243963 = sum of:

  3.624396 = weight(name:skirt in 18113) [], result of:

3.624396 = score(doc=18113,freq=1.0 = termFreq=1.0

), product of:

  3.5851278 = idf(docFreq=103, docCount=3731)

Re: Using solr(cloud) as source-of-truth for data (with no backing external db)

2016-11-18 Thread Alexandre Rafalovitch
Sure. And the people do it. Especially for their first deployment. I
have some prototypes/proof-of-concepts like that myself.

Just later don't say you didn't ask and we didn't tell :-)

Regards,
Alex.

Solr Example reading group is starting November 2016, join us at
http://j.mp/SolrERG
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 18 November 2016 at 20:45, Dorian Hoxha  wrote:
> @alex
> That makes sense, but it can be ~fixed by just storing every field that you
> need.
>
> @Walter
> Many of those things are missing from many nosql dbs yet they're used as
> source of data.
> As long as the backup is "point in time", meaning consistent timestamp
> across all shards it ~should be ok for many usecases.
>
> The 1-line-curl may need a patch to be disabled from config.
>
> On Thu, Nov 17, 2016 at 6:29 PM, Walter Underwood 
> wrote:
>
>> I agree, it is a bad idea.
>>
>> Solr is missing nearly everything you want in a repository, because it is
>> not designed to be a repository.
>>
>> Does not have:
>>
>> * access control
>> * transactions
>> * transactional backup
>> * dump and load
>> * schema migration
>> * versioning
>>
>> And so on.
>>
>> Also, I’m glad to share a one-line curl command that will delete all the
>> documents
>> in your collection.
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>>
>> > On Nov 17, 2016, at 1:20 AM, Alexandre Rafalovitch 
>> wrote:
>> >
>> > I've heard of people doing it but it is not recommended.
>> >
>> > One of the biggest implementation breakthroughs is that - after the
>> > initial learning curve - you will start mapping your input data to
>> > signals. Those signals will not look very much like your original data
>> > and therefore are not terribly suitable to be the source of it.
>> >
>> > We are talking copyFields, UpdateRequestProcessor pre-processing,
>> > fields that are not stored, nested documents flattening,
>> > denormalization, etc. Getting back from that to original shape of data
>> > is painful.
>> >
>> > Regards,
>> >   Alex.
>> > 
>> > Solr Example reading group is starting November 2016, join us at
>> > http://j.mp/SolrERG
>> > Newsletter and resources for Solr beginners and intermediates:
>> > http://www.solr-start.com/
>> >
>> >
>> > On 17 November 2016 at 18:46, Dorian Hoxha 
>> wrote:
>> >> Hi,
>> >>
>> >> Anyone use solr for source-of-data with no `normal` db (of course with
>> >> normal backups/replication) ?
>> >>
>> >> Are there any drawbacks ?
>> >>
>> >> Thank You
>>
>>


json facet api and facet.threads

2016-11-18 Thread Michael Aleythe, Sternwald
Hi Everybody,

can anyone point me in the right direction for using "facet.threads" with the 
json faceting-api? Does it only work if terms facets are exclusively used in 
the query?

Best regards

Michael Aleythe
Java Entwickler | STERNWALD SYSTEMS GMBH




Re: Combined Dismax and Block Join Scoring on nested documents

2016-11-18 Thread Mikhail Khludnev
Hello Mike,
Structured queries in Solr are way cumbersome.
Start from:
q=+{!dismax v="skirt" qf="name"} +{!parent which=content_type:product
score=min v=childq}&childq=+in_stock:true^=0 {!func}list_price_gbp&...

beside of "explain" there is a parsed query entry in debug that's more
useful for troubleshooting purposes.
Please also make sure that + is properly encoded by %2B and pass http
hurdle.

On Fri, Nov 18, 2016 at 2:14 PM, Mike Allen <
mike.al...@thecommercepartnership.com> wrote:

> Apologies if I'm doing something incredibly stupid as I'm new to Solr. I am
> having an issue with scoring child documents in a block join query when
> including a dismax query. I'm actually a little unclear on whether or not
> that's a complete oxymoron, combining dismax and block join.
>
>
>
> Problem statement: Given a set of Product documents - which contain the
> product names and descriptions - which contain nested variant documents
> (see
> below for abridged example) - which contain the boolean stock status
> (in_stock) and the variant prices (list_price_gbp) - I want to do a Dismax
> query of, say, "skirt" on the product name (name) and sort the resulting
> product documents by the minimum price (list_price_gbp) of their child
> variant documents. Note that, although the abridged document doesn't show
> them, there are a number of other arbitrary fields which may be used as
> filter queries on the child documents, for example size or colour, which
> will in effect change the "active" minimum price of a product. Hence,
> denormalizing, or flattening, the documents is not really an option I want
> to pursue.
>
>
>
> An abridged example document returned by the Solr Admin Query console which
> I am querying:
>
>
>
> 
>
> 12345
>
> product
>
> black flared skirt
>
> 40.0
>
> 
>
>   
>
> 12345abcd
>
> 12345
>
> variant
>
> 65.0
>
> true
>
>   
>
>   
>
> 12345fghi
>
> 12345
>
> variant
>
> 40.0
>
> true
>
>   
>
> 
>
>
>
> So I am familiar with the block join score mode; setting aside the dismax
> aspect for now, this query, using the Function Query {!func}list_price_gbp,
> with score ascending, returns documents ordered correctly, with a £2.00
> (cheapest) product first:
>
>
>
> q={!parent which=content_type:product
> score=min}+(in_stock:(true)){!func}list_price_gbp&doc.q={!terms
> f="productid"
> v=$row.id}&doc.rows=1000&doc.fl=score,*&doc.fq=(in_stock:(
> true))&start=0&row
> s=103&fl=score,*,doc:[subquery]&sort=score asc&debugQuery=on&wt=xml
>
>
>
> The "explain" for this is:
>
>
>
> 2.184 = Score based on 1 child docs in range from 26752 to 26752, best
> match:
>
>   2.184 = sum of:
>
> 1.8374416E-5 = weight(in_stock:T in 26752) [], result of:
>
>   1.8374416E-5 = score(doc=26752,freq=1.0 = termFreq=1.0
>
> ), product of:
>
> 1.8374416E-5 = idf(docFreq=27211, docCount=27211)
>
> 1.0 = tfNorm, computed from:
>
>   1.0 = termFreq=1.0
>
>   1.2 = parameter k1
>
>   0.0 = parameter b (norms omitted for field)
>
> 2.0 = FunctionQuery(float(list_price_gbp)), product of:
>
>   2.0 = float(list_price_gbp)=2.0
>
>   1.0 = boost
>
>   1.0 = queryNorm
>
>
>
> Even though this is doing what I want, I have a slight niggle the that
> overall score is not just the result of the Function Query, however, as all
> results get the same tiny fraction added, it doesn't matter.
>
>
>
> However, when I prepend my dismax query:
>
>
>
> q={!dismax v="skirt" qf="name"}+{!parent which=content_type:product
> score=min}+(in_stock:(true)){!func}list_price_gbp&doc.q={!terms
> f="productid"
> v=$row.id}&doc.rows=1000&doc.fl=score,*&doc.fq=(in_stock:(
> true))&start=0&row
> s=103&fl=score,*,doc:[subquery]&sort=score asc&debugQuery=on&wt=xml
>
>
>
> The scoring is only dependent on the dismax scoring, where the "explain"
> for
> this is:
>
>
>
> 2.7600822 = sum of:
>
>   2.7600822 = weight(name:skirt in 13406) [], result of:
>
> 2.7600822 = score(doc=13406,freq=1.0 = termFreq=1.0
>
> ), product of:
>
>   3.5851278 = idf(docFreq=103, docCount=3731)
>
>   0.76987 = tfNorm, computed from:
>
> 1.0 = termFreq=1.0
>
> 1.2 = parameter k1
>
> 0.75 = parameter b
>
> 4.108818 = avgFieldLength
>
> 7.11 = fieldLength
>
>
>
> So in actual fact, with score ascending, it is ordering the results by
> least
> matching first and the nested document list_price_gbp is irrelevant. I
> strongly suspect I am being totally dumb and that this is expected
> behaviour
> for an obvious reason that escapes me, apart from perhaps it's because the
> two scoring methods are just plainly in

SolrJ bulk indexing documents - HttpSolrClient vs. ConcurrentUpdateSolrClient

2016-11-18 Thread Sebastian Riemer
Hi all,

I am looking to improve indexing speed when loading many documents as part of 
an import. I am using the SolrJ-Client and currently I add the documents 
one-by-one using HttpSolrClient and  its method add(SolrInputDocument doc, int 
commitWithinMs).

My first step would be to change that to use add(Collection 
docs, int commitWithinMs) instead, which I expect would already improve 
performance.
Does it matter which method I use? Beside the method taking a 
Collection there is also one that takes an 
Iterator ... and what about ConcurrentUpdateSolrClient? 
Should I use it for bulk indexing instead of HttpSolrClient?

Currently we are on version 5.5.0 of solr, and we don't run SolrCloud, i.e. 
only one instance etc.
Indexing 39657 documents (which result in a core size of appr. 127MB) took 
about 10 minutes with the one-by-one approach.

Best regards and thanks for any suggestions,

Sebastian Riemer



Re: SolrJ bulk indexing documents - HttpSolrClient vs. ConcurrentUpdateSolrClient

2016-11-18 Thread Shawn Heisey
On 11/18/2016 6:00 AM, Sebastian Riemer wrote:
> I am looking to improve indexing speed when loading many documents as part of 
> an import. I am using the SolrJ-Client and currently I add the documents 
> one-by-one using HttpSolrClient and  its method add(SolrInputDocument doc, 
> int commitWithinMs).

If you batch them (probably around 500 to 1000 at a time), indexing
speed will go up.  Below you have described the add methods used for
batching.

> My first step would be to change that to use 
> add(Collection docs, int commitWithinMs) instead, which I 
> expect would already improve performance.
> Does it matter which method I use? Beside the method taking a 
> Collection there is also one that takes an 
> Iterator ... and what about ConcurrentUpdateSolrClient? 
> Should I use it for bulk indexing instead of HttpSolrClient?
>
> Currently we are on version 5.5.0 of solr, and we don't run SolrCloud, i.e. 
> only one instance etc.
> Indexing 39657 documents (which result in a core size of appr. 127MB) took 
> about 10 minutes with the one-by-one approach.

The concurrent client will send updates in parallel, without any
threading code in your own program, but there is one glaring
disadvantage -- indexing failures will be logged (via SLF4J), but your
program will NOT be informed about them, which means that the entire
Solr cluster could be down, and all your indexing requests will still
appear to succeed from your program's point of view.  Here's an issue I
filed on the problem.  It hasn't been fixed because there really isn't a
good solution.

https://issues.apache.org/jira/browse/SOLR-3284

The concurrent client swallows all exceptions that occur during add()
operations -- they are conducted in the background.  This might also
happen during delete operations, though I am unsure about that.  You
won't know about any problems unless those problems are still there when
your program tries an operation that can't happen in the background,
like commit or query.  If you're relying on automatic commits, your
indexing program might NEVER become aware of problems on the server end.

In a nutshell ... the concurrent client is great for initial bulk
loading (if and only if you don't need error detection), but not all
that useful for ongoing update activity that runs all the time.

If you set up multiple indexing threads in your own program, you can use
HttpSolrClient or CloudSolrClient with similar concurrent effectiveness
to the concurrent client, without sacrificing the ability to detect
errors during indexing.

Indexing 40K documents in batches should take very little time, and in
my opinion is not worth the disadvantages of the concurrent client, or
taking the time to write multi-threaded code.  If you reach the point
where you've got millions of documents, then you might want to consider
writing multi-threaded indexing code.

Thanks,
Shawn



Re: Bkd tree numbers/geo on solr 6.3 ?

2016-11-18 Thread Dorian Hoxha
Looks like it needs https://issues.apache.org/jira/browse/SOLR-8396 .

On Thu, Nov 17, 2016 at 2:41 PM, Dorian Hoxha 
wrote:

> Hi,
>
> I've read that lucene 6 has fancy bkd-tree implementation for numbers. But
> on latest cwiki I only see TrieNumbers. Aren't they implemented or did I
> miss something (they still mention "indexing multiple values for
> range-queries" , which is the old way)?
>
> Thank You
>


Data Import Request Handler isolated into its own project - any suggestions?

2016-11-18 Thread Marek Ščevlík
Hello. My name is Marek Scevlik.



Currently I am working for a small company where we are interested in
implementing your Sorl 6.3 search engine.



We are hoping to take out from the original source package the Data Import
Request Handler into its own project and create a usable .jar file out of
it.



It should then serve as tool that would allow to connect to a remote server
and return data for us to our other application that would use the returned
data.



What do you think? Would anything like this possible? To isolate out the
Data Import Request Handler into its own standalone project?



If we could achieve this we won’t mind to share with the community this new
feature.



I realize this is a first email and may lead into several hundreds so for
the start my request is very simple and not so high level detailed but I am
sure you realize it may lead into being quite complex.



So I wonder if anyone replies.



Thanks a lot for any replies and further info or guidance.





Thanks.

Regards Marek Scevlik


RE: Data Import Request Handler isolated into its own project - any suggestions?

2016-11-18 Thread Davis, Daniel (NIH/NLM) [C]
Marek,

I've wanted to do something like this in the past as well.  However, a rewrite 
that supports the same XML syntax might be better.   There are several problems 
with the design of the Data Import Handler that make it not quite suitable:

- Not designed for Multi-threading
- Bad implementation of XPath

Another issue is that one of the big advantages of Data Import Handler goes 
away at this point, which is that it is hosted within Solr, and has a UI for 
testing within the Solr admin.

A better open-source Java solution might be to connect Solr with Apache Camel - 
http://camel.apache.org/solr.html.

If you are not tied absolutely to pure open-source, and freemium products will 
do, then you might look at Pentaho Spoon and Kettle.   Although Talend is much 
more established in the market, I find Pentaho's XML-based ETL a bit easier to 
integrate as a developer, and unit test and such.   Talend does better when you 
have a full infrastructure set up, but then the attention required to unit 
tests and Git integration seems over the top.

Another powerful way to get things done, depending on what you are indexing, is 
to use LogStash and couple that with Document processing chains.   Many of our 
projects benefit from having a single RDBMS view, perhaps a materialized view, 
that is used for the index.   LogStash does just fine here, pulling from the 
RDBMS and posting each row to Solr.  The hierarchical execution of Data Import 
Handler is very nice, but this can often be handled on the RDBMS side by 
creating a view, maybe using functions to provide some rows.   Many RDBMS 
systems also support federation and the import of XML from files, so that this 
brings XML processing into the picture.

Hoping this helps,

Dan Davis, Systems/Applications Architect (Contractor),
Office of Computer and Communications Systems,
National Library of Medicine, NIH




-Original Message-
From: Marek Ščevlík [mailto:mscev...@codenameprojects.com] 
Sent: Friday, November 18, 2016 9:29 AM
To: solr-user@lucene.apache.org
Subject: Data Import Request Handler isolated into its own project - any 
suggestions?

Hello. My name is Marek Scevlik.



Currently I am working for a small company where we are interested in 
implementing your Sorl 6.3 search engine.



We are hoping to take out from the original source package the Data Import 
Request Handler into its own project and create a usable .jar file out of it.



It should then serve as tool that would allow to connect to a remote server and 
return data for us to our other application that would use the returned data.



What do you think? Would anything like this possible? To isolate out the Data 
Import Request Handler into its own standalone project?



If we could achieve this we won’t mind to share with the community this new 
feature.



I realize this is a first email and may lead into several hundreds so for the 
start my request is very simple and not so high level detailed but I am sure 
you realize it may lead into being quite complex.



So I wonder if anyone replies.



Thanks a lot for any replies and further info or guidance.





Thanks.

Regards Marek Scevlik


Re: field set up help

2016-11-18 Thread Comcast
Perfect. Just had to wrap the pho curl request URL with urlencode and it worked

Sent from my iPhone

> On Nov 17, 2016, at 5:56 PM, Kris Musshorn  wrote:
> 
> This q={!prefix f=metatag.date}2016-10 returns zero records
> 
> -Original Message-
> From: KRIS MUSSHORN [mailto:mussho...@comcast.net] 
> Sent: Thursday, November 17, 2016 3:00 PM
> To: solr-user@lucene.apache.org
> Subject: Re: field set up help
> 
> so if the field was named metatag.date q={!prefix f=metatag.date}2016-10 
> 
> - Original Message -
> 
> From: "Erik Hatcher"  
> To: solr-user@lucene.apache.org 
> Sent: Thursday, November 17, 2016 2:46:32 PM 
> Subject: Re: field set up help 
> 
> Given what you’ve said, my hunch is you could make the query like this: 
> 
>q={!prefix f=field_name}2016-10 
> 
> tada!  ?! 
> 
> there’s nothing wrong with indexing dates as text like that, as long as your 
> queries are performantly possible.   And in the case of the query type you 
> mentioned, the text/string’ish indexing you’ve done is suited quite well to 
> prefix queries to grab dates by year, year-month, and year-month-day.   But 
> you could, if needed to get more sophisticated with date queries 
> (DateRangeField is my new favorite) you can leverage 
> ParseDateFieldUpdateProcessorFactory without having to change the incoming 
> format. 
> 
>Erik 
> 
> 
> 
> 
>> On Nov 17, 2016, at 1:55 PM, KRIS MUSSHORN  wrote: 
>> 
>> 
>> I have a field in solr 5.4.1 that has values like: 
>> 2016-10-15 
>> 2016-09-10 
>> 2015-10-12 
>> 2010-09-02 
>> 
>> Yes it is a date being stored as text. 
>> 
>> I am getting the data onto solr via nutch and the metatag plug in. 
>> 
>> The data is coming directly from the website I am crawling and I am not able 
>> to change the data at the source to something more palpable. 
>> 
>> The field is set in solr to be of type TextField that is indexed, tokenized, 
>> stored, multivalued and norms are omitted. 
>> 
>> Both the index and query analysis chains contain just the whitespace 
>> tokenizer factory and the lowercase filter factory. 
>> 
>> I need to be able to query for 2016-10 and only match 2016-10-15. 
>> 
>> Any ideas on how to set this up? 
>> 
>> TIA 
>> 
>> Kris   
>> 
> 
> 
> 



Re: Detecting schema errors while adding documents

2016-11-18 Thread Shawn Heisey
On 11/16/2016 11:02 AM, Mike Thomsen wrote:
> We're stuck on Solr 4.10.3 (Cloudera bundle). Is there any way to detect
> with SolrJ when a document added to the index violated the schema? All we
> see when we look at the stacktrace for the SolrException that comes back is
> that it contains messages about an IOException when talking to the solr
> nodes. Solr is up and running, and the documents are only invalid because I
> added a Java statement to make a field invalid for testing purposes. When I
> remove that statement, the indexing happens just fine.
>
> Any way to do this? I seem to recall that at least in newer versions of
> Solr it would tell you more about the specific error.

What *exactly* are you trying to get SolrJ/Solr to tell you that it
isn't telling you?  Erick's response has information for one possible
scenario you might be describing.

Using the 4.10.3 client, trying to add a document with an unknown field,
I get very specific and relevant messages like the following from both
HttpSolrServer and CloudSolrServer:

org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
ERROR: [doc=123] unknown field 'florj'
at
org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at
org.apache.solr.client.solrj.impl.LBHttpSolrServer.doRequest(LBHttpSolrServer.java:340)
at
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:301)
at
org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:659)
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:102)
at org.elyograg.Flubber.main(Flubber.java:44)

(this specific stacktrace came from using a 4.10.3 client with SolrCloud
running 4.2.1 -- so my CloudSolrServer object had to be configured to
use xml instead of javabin)

When I adjusted the code to send a collection of docs instead of a
single doc, with one good doc and one bad doc, I got the same message,
with the uniqueKey field value from the bad document.

For newer versions, there is an issue where the load balancing client
(used by the cloud client) wraps *any* problem in an exception that just
says "No live SolrServers available to handle this request" ... but that
doesn't seem to be a problem in SolrJ 4.10.3.  The problem was probably
introduced by the big changes for 5.0.

https://issues.apache.org/jira/browse/SOLR-7951

If you are running into SOLR-7951 (or any other bug), it will NOT be
fixed in any 4.x version.  Development on 4.x has ceased entirely. 
There's a good chance it won't even be fixed in 5.x, but only in a new
6.x version.  I have no idea when Cloudera might update the version of
Solr that they include.

Note that even on versions affected by SOLR-7951, you'd still be able to
see the actual problem exception, because it's still there, as the cause
of the outer exception.

It's always possible that Cloudera has embedded a layer on top of Solr
or SolrJ that gets rid of the meaningful messages that Solr normally
returns.  We'll need the actual entire stacktrace and error message
you're seeing.

Thanks,
Shawn



Re: SolrJ bulk indexing documents - HttpSolrClient vs. ConcurrentUpdateSolrClient

2016-11-18 Thread Erick Erickson
Here's some numbers for batching improvements:

https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/

And I totally agree with Shawn that for 40K documents anything more
complex is probably overkill.

Best,
Erick

On Fri, Nov 18, 2016 at 6:02 AM, Shawn Heisey  wrote:
> On 11/18/2016 6:00 AM, Sebastian Riemer wrote:
>> I am looking to improve indexing speed when loading many documents as part 
>> of an import. I am using the SolrJ-Client and currently I add the documents 
>> one-by-one using HttpSolrClient and  its method add(SolrInputDocument doc, 
>> int commitWithinMs).
>
> If you batch them (probably around 500 to 1000 at a time), indexing
> speed will go up.  Below you have described the add methods used for
> batching.
>
>> My first step would be to change that to use 
>> add(Collection docs, int commitWithinMs) instead, which I 
>> expect would already improve performance.
>> Does it matter which method I use? Beside the method taking a 
>> Collection there is also one that takes an 
>> Iterator ... and what about ConcurrentUpdateSolrClient? 
>> Should I use it for bulk indexing instead of HttpSolrClient?
>>
>> Currently we are on version 5.5.0 of solr, and we don't run SolrCloud, i.e. 
>> only one instance etc.
>> Indexing 39657 documents (which result in a core size of appr. 127MB) took 
>> about 10 minutes with the one-by-one approach.
>
> The concurrent client will send updates in parallel, without any
> threading code in your own program, but there is one glaring
> disadvantage -- indexing failures will be logged (via SLF4J), but your
> program will NOT be informed about them, which means that the entire
> Solr cluster could be down, and all your indexing requests will still
> appear to succeed from your program's point of view.  Here's an issue I
> filed on the problem.  It hasn't been fixed because there really isn't a
> good solution.
>
> https://issues.apache.org/jira/browse/SOLR-3284
>
> The concurrent client swallows all exceptions that occur during add()
> operations -- they are conducted in the background.  This might also
> happen during delete operations, though I am unsure about that.  You
> won't know about any problems unless those problems are still there when
> your program tries an operation that can't happen in the background,
> like commit or query.  If you're relying on automatic commits, your
> indexing program might NEVER become aware of problems on the server end.
>
> In a nutshell ... the concurrent client is great for initial bulk
> loading (if and only if you don't need error detection), but not all
> that useful for ongoing update activity that runs all the time.
>
> If you set up multiple indexing threads in your own program, you can use
> HttpSolrClient or CloudSolrClient with similar concurrent effectiveness
> to the concurrent client, without sacrificing the ability to detect
> errors during indexing.
>
> Indexing 40K documents in batches should take very little time, and in
> my opinion is not worth the disadvantages of the concurrent client, or
> taking the time to write multi-threaded code.  If you reach the point
> where you've got millions of documents, then you might want to consider
> writing multi-threaded indexing code.
>
> Thanks,
> Shawn
>


Best Way to Read A Nested Structure from Solr?

2016-11-18 Thread Jennifer Coston
Hello,

I am sure there have been many discussions on the best way to do this, but I am 
lost and need your advice. I have a nested Solr Document containing multiple 
levels of sub-documents. Here is a JSON example so you can see the full 
structure:

{
  "id": "Test Library",
  "description": "example of nested document",
  "content_type": "library",
  "authors": [{
  "id": "author1",
  "content_type": "author",
  "name": "First Author",
  "books": {
"id": "book1",
"content_type": "book",
"title": "title of book 1"
  },
  "shortStories": {
"id": "shortStory1",
"content_type": "shortStory",
"title": "title of short story 1"
  }
  },
  {
  "id": "author2",
  "content_type": "author",
  "name": "Second Author",
  "books": {
"id": "book1",
"content_type": "book",
"title": "title of book 1"
  },
  "shortStories": {
"id": "shortStory1",
"content_type": "shortStory",
"title": "title of short story 1"
  }
}]
}

I want to query for a document and retrieve the nested structure. I tried using 
the ChildDocumentTranformerFactory but it flattened the result to be just 
Library and all other documents as children:

{
  "id": "Test Library",
  "description": "example of nested document",
  "content_type": "library",
  "_childDocuments_":[
{"id": "author1",
  "content_type": "author",
  "name": "First Author"
},
{"id": "book1",
"content_type": "book",
"title": "title of book 1"
},
{
"id": "shortStory1",
"content_type": "shortStory",
"title": "title of short story 1"
},
{
 "id": "author2",
 "content_type": "author",
"name": "Second Author"
},
{
"id": "book1",
"content_type": "book",
"title": "title of book 1"
},
{
"id": "shortStory1",
"content_type": "shortStory",
"title": "title of short story 1"
}
   ]
}

Here are the query parameters I used:
q={!parent which='content_type:library'}
df=id
fl=*,[child parentFilter='content_type:library' childFilter='id:*']
wt=json
indent=true

What is the best way to read the nested structure from Solr? Do I need to do 
some sort of faceting?

Thank you,

Jennifer Coston

P.S. I am using Solr version 5.2.1


Index and search on PDF text using Solr

2016-11-18 Thread vascaino90
Hello, i'm new in Solr and i have a big problem.
I have many text documents in PDF format (more than 1) and I need to
create a site with this PDFs. In this site, I have to create a search by any
terms in this PDFs.
I don't have idea how to start.
Anyone can help me?

Thank you so much.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Index-and-search-on-PDF-text-using-Solr-tp4306486.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Index and search on PDF text using Solr

2016-11-18 Thread Erick Erickson
see the section in the Solr Reference Guide: "Uploading Data with Solr
Cell using Apache Tika" here:

https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

to get a start.

The basic idea is to use Apache Tika to parse the PDF file and then
stuff the data into Solr. There are a
lot of tweaks you'll need to do, particularly mapping the meta-data
fields to Solr fields, but the above
should get you started. Once you get that operating, you can refine
your approach.

I'm personally not a fan of doing all this on the Solr server in a
_production_ environment unless it's a
one-time operation, here's a writeup of why I think that and a model
Java program that'd allow you to
do this on a Java client. It uses some older Solr classes (i.e.
CloudSolrServer is not CloudSolrClient)
but it should give you a starting place if you want to do something
similar. It has both a database
bit and a Tika bit but the database bits can just be taken out,
there's nothing about parsing the files
with Tika that requires it.

https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/

Best,
Erick

On Fri, Nov 18, 2016 at 10:14 AM, vascaino90  wrote:
> Hello, i'm new in Solr and i have a big problem.
> I have many text documents in PDF format (more than 1) and I need to
> create a site with this PDFs. In this site, I have to create a search by any
> terms in this PDFs.
> I don't have idea how to start.
> Anyone can help me?
>
> Thank you so much.
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Index-and-search-on-PDF-text-using-Solr-tp4306486.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Data Import Request Handler isolated into its own project - any suggestions?

2016-11-18 Thread Alexandre Rafalovitch
Is your goal to still index into Solr? It was not clear.

If yes, then it has been discussed quite a bit. The challenge is that
DIH is integrated into AdminUI, which makes it easier to see the
progress and set some flags. Plus the required jars are loaded via
solrconfig.xml, just like all other extra libraries. So, contribution
back would need to take that into account.

If you are not ready to face that, it may make sense to look at other
libraries first. Apache Camel, Apache NiFi, Cloudera morphline, etc.
All of them can send data into Solr, though their version support
differ. For example Camel seems to need Solr 3.5 still. Somebody
updating their implementation to Solr 6.3 and contributing that back
to that project would do a lot of good.

Regards,
Alex.

Solr Example reading group is starting November 2016, join us at
http://j.mp/SolrERG
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 19 November 2016 at 01:29, Marek Ščevlík
 wrote:
> Hello. My name is Marek Scevlik.
>
>
>
> Currently I am working for a small company where we are interested in
> implementing your Sorl 6.3 search engine.
>
>
>
> We are hoping to take out from the original source package the Data Import
> Request Handler into its own project and create a usable .jar file out of
> it.
>
>
>
> It should then serve as tool that would allow to connect to a remote server
> and return data for us to our other application that would use the returned
> data.
>
>
>
> What do you think? Would anything like this possible? To isolate out the
> Data Import Request Handler into its own standalone project?
>
>
>
> If we could achieve this we won’t mind to share with the community this new
> feature.
>
>
>
> I realize this is a first email and may lead into several hundreds so for
> the start my request is very simple and not so high level detailed but I am
> sure you realize it may lead into being quite complex.
>
>
>
> So I wonder if anyone replies.
>
>
>
> Thanks a lot for any replies and further info or guidance.
>
>
>
>
>
> Thanks.
>
> Regards Marek Scevlik


CloudSolrClient$RouteException: Cannot talk to ZooKeeper - Updates are disabled.

2016-11-18 Thread Chetas Joshi
Hi,

I have a SolrCloud (on HDFS) of 50 nodes and a ZK quorum of 5 nodes. The
SolrCloud is having difficulties talking to ZK when I am ingesting data
into the collections. At that time I am also running queries (that return
millions of docs). The ingest job is crying with the the following exception

org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException: Error
from server at http://xxx/solr/collection1_shard15_replica1: Cannot talk to
ZooKeeper - Updates are disabled.

I think this is happening when the ingest job is trying to update the
clusterstate.json file but the query is reading from that file and thus has
some kind of a lock on that file. Are there any factors that will cause the
"READ" to acquire lock for a long time? Is my understanding correct? I am
using the cursor approach using SolrJ to get back results from Solr.

How often is the ZK updated with the latest cluster state and what
parameter governs that? Should I just increase the ZK client timeout so
that it retries connecting to the ZK for a longer period of time (right now
it is 15 seconds)?

Thanks!


Re: CloudSolrClient$RouteException: Cannot talk to ZooKeeper - Updates are disabled.

2016-11-18 Thread Erick Erickson
The clusterstate on Zookeeper shouldn't be changing
very often, only when nodes come and go.

bq: At that time I am also running queries (that return
millions of docs).

As in rows=milions? This is an anti-pattern, if that's true
then you're probably network saturated and the like. If
you mean your numFound is millions, then this is unlikely
to be a problem.

you say "clusterstate.json", which indicates you're on
4x? This has been changed to make a state.json for
each collection, so either you upgraded sometime and
didn't transform you ZK (there's a command to do that)
or can you upgrade?

What I'm guessing is that you have too much going on
somehow and you're overloading your system and
getting a timeout. So increasing the timeout
is definitely a possibility, or reducing the ingestion load
as a test.

Best,
Erick

On Fri, Nov 18, 2016 at 4:51 PM, Chetas Joshi  wrote:
> Hi,
>
> I have a SolrCloud (on HDFS) of 50 nodes and a ZK quorum of 5 nodes. The
> SolrCloud is having difficulties talking to ZK when I am ingesting data
> into the collections. At that time I am also running queries (that return
> millions of docs). The ingest job is crying with the the following exception
>
> org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException: Error
> from server at http://xxx/solr/collection1_shard15_replica1: Cannot talk to
> ZooKeeper - Updates are disabled.
>
> I think this is happening when the ingest job is trying to update the
> clusterstate.json file but the query is reading from that file and thus has
> some kind of a lock on that file. Are there any factors that will cause the
> "READ" to acquire lock for a long time? Is my understanding correct? I am
> using the cursor approach using SolrJ to get back results from Solr.
>
> How often is the ZK updated with the latest cluster state and what
> parameter governs that? Should I just increase the ZK client timeout so
> that it retries connecting to the ZK for a longer period of time (right now
> it is 15 seconds)?
>
> Thanks!


Re: CloudSolrClient$RouteException: Cannot talk to ZooKeeper - Updates are disabled.

2016-11-18 Thread Chetas Joshi
Thanks Erick.

The numFound is millions but I was also trying with rows= 1 Million. I will
reduce it to 500K.

I am sorry. It is state.json. I am using Solr 5.5.0

One of the things I am not able to understand is why my ingestion job is
complaining about "Cannot talk to ZooKeeper - Updates are disabled."

I have a spark streaming job that continuously ingests into Solr. My shards
are always up and running. The moment I start a query on SolrCloud it
starts running into this exception. However as you said ZK will only update
the state of the cluster when the shards go down. Then why my job is trying
to contact ZK when the cluster is up and why is the exception about
updating ZK?


On Fri, Nov 18, 2016 at 5:11 PM, Erick Erickson 
wrote:

> The clusterstate on Zookeeper shouldn't be changing
> very often, only when nodes come and go.
>
> bq: At that time I am also running queries (that return
> millions of docs).
>
> As in rows=milions? This is an anti-pattern, if that's true
> then you're probably network saturated and the like. If
> you mean your numFound is millions, then this is unlikely
> to be a problem.
>
> you say "clusterstate.json", which indicates you're on
> 4x? This has been changed to make a state.json for
> each collection, so either you upgraded sometime and
> didn't transform you ZK (there's a command to do that)
> or can you upgrade?
>
> What I'm guessing is that you have too much going on
> somehow and you're overloading your system and
> getting a timeout. So increasing the timeout
> is definitely a possibility, or reducing the ingestion load
> as a test.
>
> Best,
> Erick
>
> On Fri, Nov 18, 2016 at 4:51 PM, Chetas Joshi 
> wrote:
> > Hi,
> >
> > I have a SolrCloud (on HDFS) of 50 nodes and a ZK quorum of 5 nodes. The
> > SolrCloud is having difficulties talking to ZK when I am ingesting data
> > into the collections. At that time I am also running queries (that return
> > millions of docs). The ingest job is crying with the the following
> exception
> >
> > org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException: Error
> > from server at http://xxx/solr/collection1_shard15_replica1: Cannot
> talk to
> > ZooKeeper - Updates are disabled.
> >
> > I think this is happening when the ingest job is trying to update the
> > clusterstate.json file but the query is reading from that file and thus
> has
> > some kind of a lock on that file. Are there any factors that will cause
> the
> > "READ" to acquire lock for a long time? Is my understanding correct? I am
> > using the cursor approach using SolrJ to get back results from Solr.
> >
> > How often is the ZK updated with the latest cluster state and what
> > parameter governs that? Should I just increase the ZK client timeout so
> > that it retries connecting to the ZK for a longer period of time (right
> now
> > it is 15 seconds)?
> >
> > Thanks!
>


Re: CloudSolrClient$RouteException: Cannot talk to ZooKeeper - Updates are disabled.

2016-11-18 Thread Shawn Heisey
On 11/18/2016 6:50 PM, Chetas Joshi wrote:
> The numFound is millions but I was also trying with rows= 1 Million. I will 
> reduce it to 500K.
>
> I am sorry. It is state.json. I am using Solr 5.5.0
>
> One of the things I am not able to understand is why my ingestion job is
> complaining about "Cannot talk to ZooKeeper - Updates are disabled."
>
> I have a spark streaming job that continuously ingests into Solr. My shards 
> are always up and running. The moment I start a query on SolrCloud it starts 
> running into this exception. However as you said ZK will only update the 
> state of the cluster when the shards go down. Then why my job is trying to 
> contact ZK when the cluster is up and why is the exception about updating ZK?

SolrCloud and SolrJ (CloudSolrClient) both maintain constant connections
to all the zookeeper servers they are configured to use.  If zookeeper
quorum is lost, SolrCloud will go read-only -- no updating is possible. 
That is what is meant by "updates are disabled."

Solr and Lucene are optimized for very low rowcounts, typically two or
three digits.  Asking for hundreds of thousands of rows is problematic. 
The cursorMark feature is designed for efficient queries when paging
deeply into results, but it assumes your rows value is relatively small,
and that you will be making many queries to get a large number of
results, each of which will be fast and won't overload the server.

Since it appears you are having a performance issue, here's a few things
I have written on the topic:

https://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn