Re: Big SolrCloud cluster with a lot of collections

2015-08-16 Thread yura last
Thanks for your answersCurrently I have one machine (6 cores, 148 GB RAM, 2.5 
TB HDD) and I index around 60 million documents for a day - the index size is 
around 26GB.I do have customer-ID today and I use it for the queries. I don't 
split the customers but I get bad performance.
If I will make small collection for each customer then I know to query only 
those collections and I get better performance - the indexes are smaller and 
the Solr don't need to keep the other customers data in the memory. I checked 
it and the performance is much better.
I do have 1 billion documents today but I can't index them - so it is a real 
requirement for today to be ably index 1 billion and keep the data for 90 
days.We want to grow and to support more customers so I want to understand what 
design I need for 10 billions per day.
I will think if I can split the customers to clusters and merge the results 
myself - it is a good idea. Thanks for the advise.
What is better - 1 powerful machine or a few smaller? For example - one machine 
with 12 cores and 256GB 2.5 TB or 5 machines each with 4 cores and 32 GB 0.5 TB?
Thanks,Yuri  


 On Saturday, August 15, 2015 5:53 PM, Toke Eskildsen 
 wrote:
   

 yura last  wrote:
> Hi All, I am testing a SolrCloud with many collections. The version is 5.2.1
> and I installed 3 machines – each one with 4 cores and 8 GB Ram.Then I
> created collections with 3 shards and replication factor of 2. It gives me 2
> cores per collection on each machine.I reached almost 900 collections
> and then the cluster was stuck and I couldn’t revive the cluster.

That mirrors what others are reporting.

> As I understand Solr have issues with many collections (thousands).If I
> will use much more machines – does it will give me the ability to create
> tens of thousands of collections or the limit is couple of thousands?

(Caveat: I have no real world experience with high collection count in Solr)

Adding more machines will not really help you as the problem with thousands of 
collections is not hardware power per se, but rather the coordination of them. 
You mention 180K collections below and with the current Solr architecture, I do 
not see that happening.

> I want to build a cluster that will handle 10 billion of documents (currently 
> I
> have 1 billion) per day and to keep the data for 90 days.

Are those real requirements or something somebody hope will come true some 
years down the road? Technology has a habit of catching up and while a 900 
billion document setup is a challenge today, it is probably a lot easier in 5 
years.

When we are discussion this, it would help if you could also approximate the 
index size in bytes. How large do you expect the sum of shards for 1 billion of 
your documents to be? Likewise, which kind of queries do you expect? Grouping? 
Faceting? All these things multiply.

Anyway, your requirements are in a league where there is not much collective 
experience. You will definitely have to build a serious prototype or three to 
get a proper idea of how much power you need: The standard advices for scaling 
Solr does not make economical sense beyond a point. But you seem to have 
started that process already with your current tests.

> I want to support 2000 customers so I would like to split them to collections
> and also to split it by days. (180,000 collections) 

As 180,000 collections currently seems infeasible for a single SolrCloud, you 
should consider alternatives:

1) If your collections are independent, then build fully independent clusters 
of machines.

2) Don't use collections for dividing data between your customers. Use a field 
with a customer-ID or something like that.

> If I will create big collections I will have performance issues with queries
> and also most of the queries are for a specific customer.

Why would many smaller collections have better performance than fewer larger 
collections?

> (I also have cross customers queries)

If you make independent setups, that could be solved by querying them 
independently and do the merging yourself.

- Toke Eskildsen


  

Re: Cache for percentiles facets

2015-08-16 Thread Håvard Wahl Kongsgård
Hi, just a general question as I was unable to find any old posts relating
to stats/percentile/facets performance/cache settings.

I have been using Solr since version 4.0 , now using the latest v. 5.2.1.

What I have done:

- Increase heap memory to 30gb

- Experimented with the cache settings

- Merged segments

- Used docvalues as filter

- Tried with ramdrive for index as well

-The field I calculate percentile on is type int, seems to be a big
performance difference between int and float/decimal etc.

The database consists of multiple sets with 5 mil rows I calculate facets
stats for a field filtered by those sets. My fields are indexed not stored

The queries are basic

curl http://localhost:8983/solr/demo/query -d
'rows=0&fq=set_id:id_of_set&q=*:*&
json.facet={
  by_something:{terms:{
field:myfield,
facet:{
  median_value:"percentile(myvalue_field,50)"
}
  }}
}


As a quick fix I created a cache in redis ;)

-Håvard


On Sat, Aug 15, 2015 at 10:26 PM, Erick Erickson 
wrote:

> You have to provide a lot more info about your problem, including
> what you've tried, what your data looks like, etc.
>
> You might review:
> http://wiki.apache.org/solr/UsingMailingLists
>
> Best,
> Erick
>
> On Sat, Aug 15, 2015 at 10:27 AM, Håvard Wahl Kongsgård
>  wrote:
> > Hi, I have tried various options to speed up percentile calculation for
> > facets. But the internal solr cache only speed up my queries from 22 to
> 19
> > sec.
> >
> > I'am using the new json facets http://yonik.com/json-facet-api/
> >
> > Any tips for caching stats?
> >
> >
> > -Håvard
>


Re: Big SolrCloud cluster with a lot of collections

2015-08-16 Thread Toke Eskildsen
yura last  wrote:
> I have one machine (6 cores, 148 GB RAM, 2.5 TB HDD) and I index 
> around 60 million documents for a day - the index size is around 26GB.

So 1 billion documents would be approximately 500GB.

...and 10 billion/day in 90 days would be 450TB.

> I do have customer-ID today and I use it for the queries. I don't split
> the customers but I get bad performance. If I will make small collection
> for each customer then I know to query only those collections and I
> get better performance - the indexes are smaller and the Solr don't
> need to keep the other customers data in the memory. I checked it
> and the performance is much better.

True when the amount of concurrent active customers is low. How many customers 
do you expect to be actively using the index at a time? If the answer is "most 
of them", you should make sure that your tests reflect that. 

If the answer is "relatively few", then your setup might scale well (if you 
create independent clouds to handle the many collection problem). First search 
for a customer will of course take a while.

> I do have 1 billion documents today but I can't index them

Why? Does it break down, take too long to index, results in too slow searches? 
Your current problems helps a lot when talking future scale.

> - so it is a real requirement for today to be ably index 1 billion and
> keep the data for 90 days.

To be clear: Would that be 1 billion index every 90 days, 1 billion each day in 
90 days = 90 billion at any given time or something third?

> What is better - 1 powerful machine or a few smaller? For example
> - one machine with 12 cores and 256GB 2.5 TB or 5 machines
> each with 4 cores and 32 GB 0.5 TB?

Depends on what you do with your data. Most of the time, IO is the bottleneck 
for Solr and for those cases it is probably more bang-for-the-buck to buy 
machines with 256GB of RAM (or maybe the 148GB you have currently) as it 
minimizes the overhead per box.

- Toke Eskildsen


Re: Big SolrCloud cluster with a lot of collections

2015-08-16 Thread yura last
I expect that the amount of concurrent customers will be low.Today I have 1 
machine so I don't have the capacity for all the data. Because of that I am 
thinking on a new "cluster" solution.Today is 1 billion each day for 90 days = 
90 billion (around 45TB data).
I should prefer a lot of machines with many RAM and not so many HDD - right? 
Thanks,Yuri  


 On Sunday, August 16, 2015 1:33 PM, Toke Eskildsen 
 wrote:
   

 yura last  wrote:
> I have one machine (6 cores, 148 GB RAM, 2.5 TB HDD) and I index 
> around 60 million documents for a day - the index size is around 26GB.

So 1 billion documents would be approximately 500GB.

...and 10 billion/day in 90 days would be 450TB.

> I do have customer-ID today and I use it for the queries. I don't split
> the customers but I get bad performance. If I will make small collection
> for each customer then I know to query only those collections and I
> get better performance - the indexes are smaller and the Solr don't
> need to keep the other customers data in the memory. I checked it
> and the performance is much better.

True when the amount of concurrent active customers is low. How many customers 
do you expect to be actively using the index at a time? If the answer is "most 
of them", you should make sure that your tests reflect that. 

If the answer is "relatively few", then your setup might scale well (if you 
create independent clouds to handle the many collection problem). First search 
for a customer will of course take a while.

> I do have 1 billion documents today but I can't index them

Why? Does it break down, take too long to index, results in too slow searches? 
Your current problems helps a lot when talking future scale.

> - so it is a real requirement for today to be ably index 1 billion and
> keep the data for 90 days.

To be clear: Would that be 1 billion index every 90 days, 1 billion each day in 
90 days = 90 billion at any given time or something third?

> What is better - 1 powerful machine or a few smaller? For example
> - one machine with 12 cores and 256GB 2.5 TB or 5 machines
> each with 4 cores and 32 GB 0.5 TB?

Depends on what you do with your data. Most of the time, IO is the bottleneck 
for Solr and for those cases it is probably more bang-for-the-buck to buy 
machines with 256GB of RAM (or maybe the 148GB you have currently) as it 
minimizes the overhead per box.

- Toke Eskildsen


  

Re: Big SolrCloud cluster with a lot of collections

2015-08-16 Thread Toke Eskildsen
yura last  wrote:
> I expect that the amount of concurrent customers will be low.
> Today I have 1 machine so I don't have the capacity for all
> the data.

You aim for 90 billion documents in the first go and want to prepare for 10 
times that. Your current test setup is 60M documents, which means you are off 
by a factor 1000. You really need to test on a larger subset.

>  Because of that I am thinking on a new "cluster" solution.Today is 1 billion 
> each day for 90 days = 90 billion (around 45TB data).

> I should prefer a lot of machines with many RAM and not so many HDD - right?

We seem to be looking at non-trivial machines, so I think you should run more 
tests at a larger scale, taking care to emulate the amount of requests and the 
amount of concurrent customer requests you expect. If you are lucky, it works 
well to swap in the data for the active customer and you will be able to get by 
with relatively modest hardware.

We have had great success with buying relatively cheap (bang-for-the-buck) 
machines with low memory (compared to index size) and  local SSDs. With static 
indexes (89 out of your 90 days would be static data, if I understand 
correctly), one of our 256GB machines holds 6 billion documents in 20TB of 
index data. You might want to investigate that option. Some details at 
https://sbdevel.wordpress.com/net-archive-search/

- Toke Eskildsen


Re: Admin Login

2015-08-16 Thread Scott Derrick

Erik,

After Walters reply I started thinking along the lines you mentioned and 
realized the folly of doing that!


Scott


On 8/15/2015 9:57 PM, Erick Erickson wrote:

Scott:

You better not even let them access Solr directly.

http://server:port/solr/admin/collections?ACTION=delete&name=collection.

Try it sometime on a collection that's not important ;)

But as Walter said, that'd be similar to allowing end users
unrestricted access to
a SOL database, that Solr URL is akin to "drop database".

Or, if you've locked down the admin stuff,

http://solr:port/solr/collection/update?commit=true&stream.body=*:*

Best
Erick

On Sat, Aug 15, 2015 at 6:57 PM, Scott Derrick  wrote:

Walter,

actually that explains it perfectly!  I will move behind my apache server...

thanks,

Scott


On 8/15/2015 6:15 PM, Walter Underwood wrote:

No one runs a public-facing Solr server. Just like no one runs a
public-facing MySQL server.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Aug 15, 2015, at 4:15 PM, Scott Derrick  wrote:


I'm somewhat puzzled there is no built in security.  I can't image
anybody is running a public facing solr server with the admin page wide
open?

I've searched and haven't found any solutions that work out of the box.

I've tried the solutions here to no avail.
https://wiki.apache.org/solr/SolrSecurity

and here.  http://wiki.eclipse.org/Jetty/Tutorial/Realms

The Solr security docs say to use the application server and if I could
run it on my tomcat server I would already be done.  But I'm told I can't do
that?

What solutions are people using?

Scott

--
Leave no stone unturned.
Euripides




---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus




---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus



Re: phonetic filter factory question

2015-08-16 Thread Jamie Johnson
Thanks, i didn't know you could do this, I'll check this out.
On Aug 15, 2015 12:54 PM, "Alexandre Rafalovitch" 
wrote:

> From the "teaching to fish" category of advice (since I don't know the
> actual answer).
>
> Did you try "Analysis" screen in the Admin UI? If you check "Verbose
> output" mark, you will see all the offsets and can easily confirm the
> detailed behavior for yourself.
>
> Regards,
>   Alex.
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 15 August 2015 at 12:22, Jamie Johnson  wrote:
> > The JavaDoc says that the PhoneticFilterFactory will "inject" tokens with
> > an offset of 0 into the stream.  I'm assuming this means an offset of 0
> > from the token that it is analyzing, is that right?  I am trying to
> > collapse some of my schema, I currently have a text field that I use for
> > general purpose text and another field with the PhoneticFilterFactory
> > applied for finding things that are similar phonetically, but if this
> does
> > inject at the current position then I could likely collapse these into a
> > single field.  As always thanks in advance!
> >
> > -Jamie
>


Re: joins

2015-08-16 Thread Nagasharath
I exactly have the same requirement



> On 13-Aug-2015, at 2:12 pm, Kiran Sai Veerubhotla  wrote:
> 
> does solr support joins?
> 
> we have a use case where two collections have to be joined and the join has
> to be on the faceted results of the two collections. is this possible?


Query term matches

2015-08-16 Thread Scott Derrick

Is there a way to get the list of terms that matched in a query response?

I realize the q parameter is returned, but I'm looking for just the list 
of terms and not the operators.


Scott

--
To those leaning on the sustaining infinite, to-day is big with blessings.
Mary Baker Eddy


Re: Query term matches

2015-08-16 Thread Toke Eskildsen
Scott Derrick  wrote:
> Is there a way to get the list of terms that matched in a query response?

Add debug=query to your request:
https://wiki.apache.org/solr/CommonQueryParameters#debug

You might also want to try
http://splainer.io/

- Toke Eskildsen


Re: Query term matches

2015-08-16 Thread Scott Derrick

with a query like

q=mar*

I tried the debugQuery=true but it just said

rawquerystring": "mar*",
"querystring": "mar*",
"parsedquery": "_text_:mar*",
"parsedquery_toString": "_text_:mar*",

I already know that!

one document match's Mary
another matches Mary and martyr

I will look at splainer.io

Scott

 Original Message 
Subject: Re: Query term matches
From: Toke Eskildsen 
To: solr-user@lucene.apache.org 
Date: 08/16/2015 11:39 AM


Scott Derrick  wrote:

Is there a way to get the list of terms that matched in a query response?


Add debug=query to your request:
https://wiki.apache.org/solr/CommonQueryParameters#debug

You might also want to try
http://splainer.io/

- Toke Eskildsen



--
No one can make you feel inferior without your consent.
Eleanor Roosevelt


Solr Cloud Security Question

2015-08-16 Thread Tarala, Magesh
I have a solr cloud with 3 nodes. I've added password protection following the 
steps here: 
http://stackoverflow.com/questions/28043957/how-to-set-apache-solr-admin-password

Now only one node is able to load the collections. The others are getting 401 
Unauthorized error when loading the collections.

Could anybody provide the instructions to configure security for solr cloud?

Thanks,
Magesh




No. of records mismatch

2015-08-16 Thread Pattabiraman, Meenakshisundaram
I did a dataimport with 'clean' set to false.
The DIH status upon completion was:

idle


1
6843427
6843427
0
2015-08-16 16:50:54

Indexing completed. Added/Updated: 6843427 documents. Deleted 0 documents.

Whereas when I query using 'query?q=*:*&rows=0', I get the following count
{
  "responseHeader":{
"status":0,
"QTime":1,
"params":{
  "q":"*:*",
  "rows":"0"}},
  "response":{"numFound":1616376,"start":0,"docs":[]
  }}

There is a difference of 5 million records. Can anyone help me understand the 
behavior? The logs look fine.
Thanks


Re: joins

2015-08-16 Thread Upayavira
You can do what are called "pseudo joins", which are eqivalent to a
nested query in SQL. You get back data from one core, based upon
criteria in the other. You cannot (yet) merge the results to create a
composite document.

Upayavira

On Sun, Aug 16, 2015, at 06:02 PM, Nagasharath wrote:
> I exactly have the same requirement
> 
> 
> 
> > On 13-Aug-2015, at 2:12 pm, Kiran Sai Veerubhotla  
> > wrote:
> > 
> > does solr support joins?
> > 
> > we have a use case where two collections have to be joined and the join has
> > to be on the faceted results of the two collections. is this possible?


Re: No. of records mismatch

2015-08-16 Thread Upayavira
You almost certainly have a non-unique ID field. Some documents are
overwritten during indexing. Try it with a clean index, and then review
the number of deleted documents (updates are a delete then insert
action). Deletes are calculated with maxDocs minus numDocs.

Upayavira

On Sun, Aug 16, 2015, at 07:18 PM, Pattabiraman, Meenakshisundaram
wrote:
> I did a dataimport with 'clean' set to false.
> The DIH status upon completion was:
> 
> idle
> 
> 
> 1
> 6843427
> 6843427
> 0
> 2015-08-16 16:50:54
> 
> Indexing completed. Added/Updated: 6843427 documents. Deleted 0
> documents.
> 
> Whereas when I query using 'query?q=*:*&rows=0', I get the following
> count
> {
>   "responseHeader":{
> "status":0,
> "QTime":1,
> "params":{
>   "q":"*:*",
>   "rows":"0"}},
>   "response":{"numFound":1616376,"start":0,"docs":[]
>   }}
> 
> There is a difference of 5 million records. Can anyone help me understand
> the behavior? The logs look fine.
> Thanks


Re: Query term matches

2015-08-16 Thread Erick Erickson
This isn't going to be easy. Why do you need to know? Especially
with wildcards this'll be "challenging".

For the specific docs that are returned, highlighting will tell you _some_
of them. Why only some? Because usually only the best N snippets are
returned, say 3 (it's configurable). And it's still possible that four terms
beginning with "mar" were in the returned doc (or N+1...).

FWIW,
Erick

On Sun, Aug 16, 2015 at 10:48 AM, Scott Derrick  wrote:
> with a query like
>
> q=mar*
>
> I tried the debugQuery=true but it just said
>
> rawquerystring": "mar*",
> "querystring": "mar*",
> "parsedquery": "_text_:mar*",
> "parsedquery_toString": "_text_:mar*",
>
> I already know that!
>
> one document match's Mary
> another matches Mary and martyr
>
> I will look at splainer.io
>
> Scott
>
>
>  Original Message 
> Subject: Re: Query term matches
> From: Toke Eskildsen 
> To: solr-user@lucene.apache.org 
> Date: 08/16/2015 11:39 AM
>
>> Scott Derrick  wrote:
>>>
>>> Is there a way to get the list of terms that matched in a query response?
>>
>>
>> Add debug=query to your request:
>> https://wiki.apache.org/solr/CommonQueryParameters#debug
>>
>> You might also want to try
>> http://splainer.io/
>>
>> - Toke Eskildsen
>>
>
> --
> No one can make you feel inferior without your consent.
> Eleanor Roosevelt


Re: joins

2015-08-16 Thread naga sharathrayapati
Is there any chance of this feature(merge the results to create a composite
document) coming out in the next release 5.3 ?

On Sun, Aug 16, 2015 at 2:08 PM, Upayavira  wrote:

> You can do what are called "pseudo joins", which are eqivalent to a
> nested query in SQL. You get back data from one core, based upon
> criteria in the other. You cannot (yet) merge the results to create a
> composite document.
>
> Upayavira
>
> On Sun, Aug 16, 2015, at 06:02 PM, Nagasharath wrote:
> > I exactly have the same requirement
> >
> >
> >
> > > On 13-Aug-2015, at 2:12 pm, Kiran Sai Veerubhotla 
> wrote:
> > >
> > > does solr support joins?
> > >
> > > we have a use case where two collections have to be joined and the
> join has
> > > to be on the faceted results of the two collections. is this possible?
>


Re: joins

2015-08-16 Thread Erick Erickson
bq: Is there any chance of this feature(merge the results to create a composite
document) coming out in the next release 5.3

In a word "no". And there aren't really any long-range plans either that I'm
aware of.

You could also explore streaming aggregation, if the need here is more
batch-oriented.

If at all possible, Solr is much more flexible if you can de-normlize your data
rather than try to make Solr work like an RDBMS. Of course it goes against
the training of all DB Admins, but it's often the best option.

So have you explored denormalizing and do you know it's not a viable option?

Best,
Erick

On Sun, Aug 16, 2015 at 12:45 PM, naga sharathrayapati
 wrote:
> Is there any chance of this feature(merge the results to create a composite
> document) coming out in the next release 5.3 ?
>
> On Sun, Aug 16, 2015 at 2:08 PM, Upayavira  wrote:
>
>> You can do what are called "pseudo joins", which are eqivalent to a
>> nested query in SQL. You get back data from one core, based upon
>> criteria in the other. You cannot (yet) merge the results to create a
>> composite document.
>>
>> Upayavira
>>
>> On Sun, Aug 16, 2015, at 06:02 PM, Nagasharath wrote:
>> > I exactly have the same requirement
>> >
>> >
>> >
>> > > On 13-Aug-2015, at 2:12 pm, Kiran Sai Veerubhotla 
>> wrote:
>> > >
>> > > does solr support joins?
>> > >
>> > > we have a use case where two collections have to be joined and the
>> join has
>> > > to be on the faceted results of the two collections. is this possible?
>>


Re: joins

2015-08-16 Thread naga sharathrayapati
https://issues.apache.org/jira/browse/SOLR-7090

I see this jira open in support of joins which might solve the problem.

On Sun, Aug 16, 2015 at 2:51 PM, Erick Erickson 
wrote:

> bq: Is there any chance of this feature(merge the results to create a
> composite
> document) coming out in the next release 5.3
>
> In a word "no". And there aren't really any long-range plans either that
> I'm
> aware of.
>
> You could also explore streaming aggregation, if the need here is more
> batch-oriented.
>
> If at all possible, Solr is much more flexible if you can de-normlize your
> data
> rather than try to make Solr work like an RDBMS. Of course it goes against
> the training of all DB Admins, but it's often the best option.
>
> So have you explored denormalizing and do you know it's not a viable
> option?
>
> Best,
> Erick
>
> On Sun, Aug 16, 2015 at 12:45 PM, naga sharathrayapati
>  wrote:
> > Is there any chance of this feature(merge the results to create a
> composite
> > document) coming out in the next release 5.3 ?
> >
> > On Sun, Aug 16, 2015 at 2:08 PM, Upayavira  wrote:
> >
> >> You can do what are called "pseudo joins", which are eqivalent to a
> >> nested query in SQL. You get back data from one core, based upon
> >> criteria in the other. You cannot (yet) merge the results to create a
> >> composite document.
> >>
> >> Upayavira
> >>
> >> On Sun, Aug 16, 2015, at 06:02 PM, Nagasharath wrote:
> >> > I exactly have the same requirement
> >> >
> >> >
> >> >
> >> > > On 13-Aug-2015, at 2:12 pm, Kiran Sai Veerubhotla <
> sai.sq...@gmail.com>
> >> wrote:
> >> > >
> >> > > does solr support joins?
> >> > >
> >> > > we have a use case where two collections have to be joined and the
> >> join has
> >> > > to be on the faceted results of the two collections. is this
> possible?
> >>
>


Re: Query term matches

2015-08-16 Thread Scott Derrick

I'm searching a collection of documents.

When I build my results page I provide  a link to each document.  If the 
user click the link I display the document with all the matched terms 
highlighted.  I need to supply my highlighter a list of words to hilight 
in the doc.


I thought the highlighter might be able to return a list of hits "per" 
document since it is highliting a fragment.


Scott

On 8/16/2015 1:44 PM, Erick Erickson wrote:

This isn't going to be easy. Why do you need to know? Especially
with wildcards this'll be "challenging".

For the specific docs that are returned, highlighting will tell you _some_
of them. Why only some? Because usually only the best N snippets are
returned, say 3 (it's configurable). And it's still possible that four terms
beginning with "mar" were in the returned doc (or N+1...).

FWIW,
Erick

On Sun, Aug 16, 2015 at 10:48 AM, Scott Derrick  wrote:

with a query like

q=mar*

I tried the debugQuery=true but it just said

rawquerystring": "mar*",
"querystring": "mar*",
"parsedquery": "_text_:mar*",
"parsedquery_toString": "_text_:mar*",

I already know that!

one document match's Mary
another matches Mary and martyr

I will look at splainer.io

Scott


 Original Message 
Subject: Re: Query term matches
From: Toke Eskildsen 
To: solr-user@lucene.apache.org 
Date: 08/16/2015 11:39 AM


Scott Derrick  wrote:

Is there a way to get the list of terms that matched in a query response?


Add debug=query to your request:
https://wiki.apache.org/solr/CommonQueryParameters#debug

You might also want to try
http://splainer.io/

- Toke Eskildsen


--
No one can make you feel inferior without your consent.
Eleanor Roosevelt



---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus



Re: Query term matches

2015-08-16 Thread Scott Derrick

splainer doesn't return anything the debug parameter can.

On 8/16/2015 11:39 AM, Toke Eskildsen wrote:

Scott Derrick  wrote:

Is there a way to get the list of terms that matched in a query response?

Add debug=query to your request:
https://wiki.apache.org/solr/CommonQueryParameters#debug

You might also want to try
http://splainer.io/

- Toke Eskildsen




---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus



RE: No. of records mismatch

2015-08-16 Thread Pattabiraman, Meenakshisundaram
" You almost certainly have a non-unique ID field." 
Yes it is not absolutely unique but do not think it is at this 1 to 6 ratio.
 
"Try it with a clean index, and then review the number of deleted documents 
(updates are a delete then insert action) "
I tried on a new instance - same effect. I do not see any deletions. Is there a 
way to determine this from the logs to confirm that the behavior is due to 
non-uniqueness? This will serve as an assurance.
Thanks 

6843469
6843469
0
2015-08-16 21:22:24

Indexing completed. Added/Updated: 6843469 documents. Deleted 0 documents.

2015-08-16 22:31:47

Whereas '*:*' 
"params":{
  "q":"*:*"}},
  "response":{"numFound":1143108,"start":0,"docs":[

-Original Message-
From: Upayavira [mailto:u...@odoko.co.uk] 
Sent: Sunday, August 16, 2015 3:18 PM
To: solr-user@lucene.apache.org
Subject: Re: No. of records mismatch

You almost certainly have a non-unique ID field. Some documents are overwritten 
during indexing. Try it with a clean index, and then review the number of 
deleted documents (updates are a delete then insert action). Deletes are 
calculated with maxDocs minus numDocs.

Upayavira

On Sun, Aug 16, 2015, at 07:18 PM, Pattabiraman, Meenakshisundaram
wrote:
> I did a dataimport with 'clean' set to false.
> The DIH status upon completion was:
> 
> idle
> 
> 
> 1 6843427 6843427 0 
> 2015-08-16 16:50:54  
> Indexing completed. Added/Updated: 6843427 documents. Deleted 0 
> documents.
> 
> Whereas when I query using 'query?q=*:*&rows=0', I get the following 
> count {
>   "responseHeader":{
> "status":0,
> "QTime":1,
> "params":{
>   "q":"*:*",
>   "rows":"0"}},
>   "response":{"numFound":1616376,"start":0,"docs":[]
>   }}
> 
> There is a difference of 5 million records. Can anyone help me 
> understand the behavior? The logs look fine.
> Thanks


xsl error

2015-08-16 Thread Scott Derrick

I'm using a dataimporthandler

class="org.apache.solr.handler.dataimport.DataImportHandler">


html-config.xml



I'm using the xsl attribute on all the entities, but this one is 
throwing an excpetion.  This xsl is used in a production document 
conversion process with no problems.


  

quote.xsl is a wrapper that calls into the real stuff which is in 
buildBookQuote.xsl





   .



   

the error is reported as

Caused by: javax.xml.transform.TransformerConfigurationException: 
solrres:/xslt/buildBookQuote.xsl: line 449: Required attribute 'test' is 
missing.


here is the code starting at line 449


 
   
   
   
   
 


there actually is no "parameter" "test", though there is the xsl  
code that uses test="..."


I have other xsl scripts I'm using on other entities that have 
 calls with no problem?


any ideas?

Scott

--
Leave no stone unturned.
Euripides


Re: Solr Cloud Security Question

2015-08-16 Thread Shawn Heisey
On 8/16/2015 12:09 PM, Tarala, Magesh wrote:
> I have a solr cloud with 3 nodes. I've added password protection following 
> the steps here: 
> http://stackoverflow.com/questions/28043957/how-to-set-apache-solr-admin-password
> 
> Now only one node is able to load the collections. The others are getting 401 
> Unauthorized error when loading the collections.
> 
> Could anybody provide the instructions to configure security for solr cloud?

Authentication and SolrCloud do not work well together, unless it's
client-certificate-based authentication (SSL).  This is because there is
currently no way to tell Solr what user/pass to use when making requests
to another node.

This was one of the early issues trying to solve the problem with
user/pass authentication and inter-node requests:

https://issues.apache.org/jira/browse/SOLR-4470

That issue is now closed as a duplicate, because Solr 5.3 will have an
authentication/authorization framework, and basic authentication is one
of the first things that has been implemented using that framework:

https://issues.apache.org/jira/browse/SOLR-7692

The release process for 5.3 is underway now.  If all goes well, the
release will happen well before the end of August.

Thanks,
Shawn



RE: Solr Cloud Security Question

2015-08-16 Thread Tarala, Magesh
Thanks Shawn!

We are on 4.10.4. Will consider 5.x upgrade shortly. 


-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: Sunday, August 16, 2015 9:05 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Cloud Security Question

On 8/16/2015 12:09 PM, Tarala, Magesh wrote:
> I have a solr cloud with 3 nodes. I've added password protection 
> following the steps here: 
> http://stackoverflow.com/questions/28043957/how-to-set-apache-solr-adm
> in-password
> 
> Now only one node is able to load the collections. The others are getting 401 
> Unauthorized error when loading the collections.
> 
> Could anybody provide the instructions to configure security for solr cloud?

Authentication and SolrCloud do not work well together, unless it's 
client-certificate-based authentication (SSL).  This is because there is 
currently no way to tell Solr what user/pass to use when making requests to 
another node.

This was one of the early issues trying to solve the problem with user/pass 
authentication and inter-node requests:

https://issues.apache.org/jira/browse/SOLR-4470

That issue is now closed as a duplicate, because Solr 5.3 will have an 
authentication/authorization framework, and basic authentication is one of the 
first things that has been implemented using that framework:

https://issues.apache.org/jira/browse/SOLR-7692

The release process for 5.3 is underway now.  If all goes well, the release 
will happen well before the end of August.

Thanks,
Shawn



Re: No. of records mismatch

2015-08-16 Thread davidphilip cherian
Hi,

You should check whether there were deletions by navigating to solr admin
core admin page. Example url
http://localhost:8983/solr/#/~cores/test_shard1_replica1, check for
numDocs, maxDocs and deletedDocs. If numDocs remains equal to maxDocs, then
you confirm that there were no updations (as recommended by Upayavira)

HTH

On Mon, Aug 17, 2015 at 4:41 AM, Pattabiraman, Meenakshisundaram <
pattabiraman.meenakshisunda...@aig.com> wrote:

> " You almost certainly have a non-unique ID field."
> Yes it is not absolutely unique but do not think it is at this 1 to 6
> ratio.
>
> "Try it with a clean index, and then review the number of deleted
> documents (updates are a delete then insert action) "
> I tried on a new instance - same effect. I do not see any deletions. Is
> there a way to determine this from the logs to confirm that the behavior is
> due to non-uniqueness? This will serve as an assurance.
> Thanks
>
> 6843469
> 6843469
> 0
> 2015-08-16 21:22:24
> 
> Indexing completed. Added/Updated: 6843469 documents. Deleted 0 documents.
> 
> 2015-08-16 22:31:47
>
> Whereas '*:*'
> "params":{
>   "q":"*:*"}},
>   "response":{"numFound":1143108,"start":0,"docs":[
>
> -Original Message-
> From: Upayavira [mailto:u...@odoko.co.uk]
> Sent: Sunday, August 16, 2015 3:18 PM
> To: solr-user@lucene.apache.org
> Subject: Re: No. of records mismatch
>
> You almost certainly have a non-unique ID field. Some documents are
> overwritten during indexing. Try it with a clean index, and then review the
> number of deleted documents (updates are a delete then insert action).
> Deletes are calculated with maxDocs minus numDocs.
>
> Upayavira
>
> On Sun, Aug 16, 2015, at 07:18 PM, Pattabiraman, Meenakshisundaram
> wrote:
> > I did a dataimport with 'clean' set to false.
> > The DIH status upon completion was:
> >
> > idle
> > 
> > 
> > 1 6843427 6843427 0
> > 2015-08-16 16:50:54 
> > Indexing completed. Added/Updated: 6843427 documents. Deleted 0
> > documents.
> > 
> > Whereas when I query using 'query?q=*:*&rows=0', I get the following
> > count {
> >   "responseHeader":{
> > "status":0,
> > "QTime":1,
> > "params":{
> >   "q":"*:*",
> >   "rows":"0"}},
> >   "response":{"numFound":1616376,"start":0,"docs":[]
> >   }}
> >
> > There is a difference of 5 million records. Can anyone help me
> > understand the behavior? The logs look fine.
> > Thanks
>