Re: Upgrade SOLR version - facets perfomance regression

2017-01-27 Thread alessandro.benedetti
Which kind of field are you faceting on ?
Cardinality ?
Field Type ?
Doc Valued ?
Which facet algorithm are you using ?
Which facet parameters ?

Cheers



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Upgrade-SOLR-version-facets-perfomance-regression-tp4315027p4317513.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Documents issue

2017-01-27 Thread alessandro.benedetti
I may be wrong and don't have time to check the code in details now, but I
would say you need to define the default in the destination field as well.

The copy field should take in input the plain content of the field ( which
is null) and then pass that content to the destination field.

Properties and attributes of the source field should not be considered at
copy field time.
So what it happens is simply you pass a null content to the destination
field.

If you define the default in the destination field, it should work as
expected.

N.B. it's a shot in the dark, not sure if you experienced a different
behavior previously.

Cheers



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-alter-the-facet-query-limit-default-tp4315939p4317514.html
Sent from the Solr - User mailing list archive at Nabble.com.


[Benchmark SOLR] JETTY VS TOMCAT

2017-01-27 Thread Gerald Reinhart

Hello,

   We are migrating our platform
   from
- Solr 5.4.1 hosted by a Tomcat
   to
- Solr 5.4.1 standalone (hosted by Jetty)

=> Jetty is 15% slower than Tomcat in the same conditions.


  Here are details about the benchmarks :

  Context :
   - Index with 9 000 000 documents
   - Gatling launch queries extracted from the real traffic
   - Server :  R410 with 16 virtual CPU and 96G mem

  Results with 20 clients in // during 10 minutes:
   For Tomcat :
   - 165 Queries per seconds
   - 120ms mean response time

   For Jetty :
   - 139 Queries per seconds
   - 142ms mean response time

We have checked :
  - the load of the server => same
  - the io wait => same
  - the memory used in the JVM => same
  - JVM GC settings => same

 For us, it's a blocker for the migration.

 Is it a known issue ? (I found that :
http://www.asjava.com/jetty/jetty-vs-tomcat-performance-comparison/)

 How can we improve the performance of Jetty ? (We have already
followed
http://www.eclipse.org/jetty/documentation/9.2.21.v20170120/optimizing.html
recommendation)

Many thanks,


Gérald Reinhart


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 158 Ter Rue du Temple 75003 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Distributed IDF in inter collections distributed queries

2017-01-27 Thread alessandro.benedetti
Hi all,
I was playing a bit with the distributed IDF, I debugged and explored a lot
the code and it is a nice feature in a shared environment.

I tried to see what is the behaviour in case we run a distributed query
across collections ( ...&collection=a,b,c)

Distributed IDF should work in this scenario as well, and the
DocumentFrequency calculated should reasonably involve a max doc count which
is the total count across all the shards of all the collections.

Using the ExactStats cache, the global collection stats are properly
calculated ( debugging I see the global stats to be coherent with what I
expect).

But this stats are lost and BM25 then uses the local stats ( with the
consequences we know).

I will continue my investigations, has anyone faced this problem before ?

Solr version I am trying is 6.3.0 .


Cheers



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Distributed-IDF-in-inter-collections-distributed-queries-tp4317519.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Distributed IDF in inter collections distributed queries

2017-01-27 Thread alessandro.benedetti
I have an update on this, I have identified at least 2 bugs :

1) Real score / Debug score is not aligned
When we operate a shard request with purpose '16388' (
GET_TOP_IDS,SET_TERM_STATS) we correctly pass the global collection stats
and we calculate the real score.

When we operate a shard request with purpose '320' ( GET_FIELDS,GET_DEBUG )
we don't pass the global collection stats so the debug score calculus and
rendering is not the same as the real score. 
This can be really confusing and not easy to spot.

Proposed solution : we pass the global stats for debugging as well

2) Using the ExactStatCache in a solo collection VS multi collection
scenario doesn't work really well with caching.
Specifically if we first execute the multi collection query, the global
stats cached will be the multi collections one, even if we then operate a
single collection query.
Vice versa applies.

Proposed solution : the list of collections involved in the ExactStatsCache
should be affecting the hashing ( and consequent caching of the same global
stats)

I think we should raise 2 separate bugs in the Solr Jira.

What do you think ?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Distributed-IDF-in-inter-collections-distributed-queries-tp4317519p4317531.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Distributed IDF in inter collections distributed queries

2017-01-27 Thread Markus Jelsma
On 1), https://issues.apache.org/jira/browse/SOLR-7759

 
 
-Original message-
> From:alessandro.benedetti 
> Sent: Friday 27th January 2017 13:22
> To: solr-user@lucene.apache.org
> Subject: Re: Distributed IDF in inter collections distributed queries
> 
> I have an update on this, I have identified at least 2 bugs :
> 
> 1) Real score / Debug score is not aligned
> When we operate a shard request with purpose '16388' (
> GET_TOP_IDS,SET_TERM_STATS) we correctly pass the global collection stats
> and we calculate the real score.
> 
> When we operate a shard request with purpose '320' ( GET_FIELDS,GET_DEBUG )
> we don't pass the global collection stats so the debug score calculus and
> rendering is not the same as the real score. 
> This can be really confusing and not easy to spot.
> 
> Proposed solution : we pass the global stats for debugging as well
> 
> 2) Using the ExactStatCache in a solo collection VS multi collection
> scenario doesn't work really well with caching.
> Specifically if we first execute the multi collection query, the global
> stats cached will be the multi collections one, even if we then operate a
> single collection query.
> Vice versa applies.
> 
> Proposed solution : the list of collections involved in the ExactStatsCache
> should be affecting the hashing ( and consequent caching of the same global
> stats)
> 
> I think we should raise 2 separate bugs in the Solr Jira.
> 
> What do you think ?
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Distributed-IDF-in-inter-collections-distributed-queries-tp4317519p4317531.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 


RE: Distributed IDF in inter collections distributed queries

2017-01-27 Thread alessandro.benedetti
Thanks Markus, I commented the Jira issue with a very naive approach to solve
that.
It's a shot in the dark, I will double check if it makes sense at all :)

Cheers



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Distributed-IDF-in-inter-collections-distributed-queries-tp4317519p4317535.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: [Benchmark SOLR] JETTY VS TOMCAT

2017-01-27 Thread William Bell
Did you try:

Set your acceptor count, SelectChannelConnector.setAcceptors(int)

to
be a a value between 1 and (number_of_cpu_cores - 1).

On Fri, Jan 27, 2017 at 3:22 AM, Gerald Reinhart  wrote:

> Hello,
>
>We are migrating our platform
>from
> - Solr 5.4.1 hosted by a Tomcat
>to
> - Solr 5.4.1 standalone (hosted by Jetty)
>
> => Jetty is 15% slower than Tomcat in the same conditions.
>
>
>   Here are details about the benchmarks :
>
>   Context :
>- Index with 9 000 000 documents
>- Gatling launch queries extracted from the real traffic
>- Server :  R410 with 16 virtual CPU and 96G mem
>
>   Results with 20 clients in // during 10 minutes:
>For Tomcat :
>- 165 Queries per seconds
>- 120ms mean response time
>
>For Jetty :
>- 139 Queries per seconds
>- 142ms mean response time
>
> We have checked :
>   - the load of the server => same
>   - the io wait => same
>   - the memory used in the JVM => same
>   - JVM GC settings => same
>
>  For us, it's a blocker for the migration.
>
>  Is it a known issue ? (I found that :
> http://www.asjava.com/jetty/jetty-vs-tomcat-performance-comparison/)
>
>  How can we improve the performance of Jetty ? (We have already
> followed
> http://www.eclipse.org/jetty/documentation/9.2.21.v20170120/
> optimizing.html
> recommendation)
>
> Many thanks,
>
>
> Gérald Reinhart
>
>
> Kelkoo SAS
> Société par Actions Simplifiée
> Au capital de € 4.168.964,30
> Siège social : 158 Ter Rue du Temple 75003 Paris
> 425 093 069 RCS Paris
>
> Ce message et les pièces jointes sont confidentiels et établis à
> l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
> destinataire de ce message, merci de le détruire et d'en avertir
> l'expéditeur.
>



-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: [Benchmark SOLR] JETTY VS TOMCAT

2017-01-27 Thread Yago Riveiro
Solr run tests with jetty.

I ran in nasty bugs in solr in the past with tomcat.

My advise it’s that speed is only one more metric, robustness and reliability 
matter too.

--

/Yago Riveiro

On 27 Jan 2017 15:38 +, William Bell , wrote:
> Did you try:
>
> Set your acceptor count, SelectChannelConnector.setAcceptors(int)
>  to
> be a a value between 1 and (number_of_cpu_cores - 1).
>
> On Fri, Jan 27, 2017 at 3:22 AM, Gerald Reinhart  > wrote:
>
> > Hello,
> >
> > We are migrating our platform
> > from
> > - Solr 5.4.1 hosted by a Tomcat
> > to
> > - Solr 5.4.1 standalone (hosted by Jetty)
> >
> > => Jetty is 15% slower than Tomcat in the same conditions.
> >
> >
> > Here are details about the benchmarks :
> >
> > Context :
> > - Index with 9 000 000 documents
> > - Gatling launch queries extracted from the real traffic
> > - Server : R410 with 16 virtual CPU and 96G mem
> >
> > Results with 20 clients in // during 10 minutes:
> > For Tomcat :
> > - 165 Queries per seconds
> > - 120ms mean response time
> >
> > For Jetty :
> > - 139 Queries per seconds
> > - 142ms mean response time
> >
> > We have checked :
> > - the load of the server => same
> > - the io wait => same
> > - the memory used in the JVM => same
> > - JVM GC settings => same
> >
> > For us, it's a blocker for the migration.
> >
> > Is it a known issue ? (I found that :
> > http://www.asjava.com/jetty/jetty-vs-tomcat-performance-comparison/)
> >
> > How can we improve the performance of Jetty ? (We have already
> > followed
> > http://www.eclipse.org/jetty/documentation/9.2.21.v20170120/
> > optimizing.html
> > recommendation)
> >
> > Many thanks,
> >
> >
> > Gérald Reinhart
> >
> >
> > Kelkoo SAS
> > Société par Actions Simplifiée
> > Au capital de € 4.168.964,30
> > Siège social : 158 Ter Rue du Temple 75003 Paris
> > 425 093 069 RCS Paris
> >
> > Ce message et les pièces jointes sont confidentiels et établis à
> > l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
> > destinataire de ce message, merci de le détruire et d'en avertir
> > l'expéditeur.
> >
>
>
>
> --
> Bill Bell
> billnb...@gmail.com
> cell 720-256-8076


Re: Documents issue

2017-01-27 Thread Comcast
Why would this behavior changed from one day to the next? I ran crawl and index 
several times with no issues, changed the schematic xml definition of a couple 
of fields, ran crawl and index and produced the dataset with missing copyfields.

Sent from my iPhone

> On Jan 27, 2017, at 4:07 AM, alessandro.benedetti  
> wrote:
> 
> I may be wrong and don't have time to check the code in details now, but I
> would say you need to define the default in the destination field as well.
> 
> The copy field should take in input the plain content of the field ( which
> is null) and then pass that content to the destination field.
> 
> Properties and attributes of the source field should not be considered at
> copy field time.
> So what it happens is simply you pass a null content to the destination
> field.
> 
> If you define the default in the destination field, it should work as
> expected.
> 
> N.B. it's a shot in the dark, not sure if you experienced a different
> behavior previously.
> 
> Cheers
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-to-alter-the-facet-query-limit-default-tp4315939p4317514.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Documents issue

2017-01-27 Thread alessandro.benedetti
Hi Khris,
can you paste here the diff between the OK status and the KO status ?
Has only the name of the destination field changed ? ( a replace of '.' with
'_' ?)
Do you have any dynamic field defined ?
Cheers



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-alter-the-facet-query-limit-default-tp4315939p4317566.html
Sent from the Solr - User mailing list archive at Nabble.com.


Splitting large non-stored field

2017-01-27 Thread deansg
Hi,
Many of our documents contain a unique, non-indexed text field that contains
html-content which we display to our users (let's call it "html_content").
The reason we store this field in Solr in the first place is because of
Solr's highlighting capabilities: the query itself is against the non-html
version of the field. Unfortunately, sometimes the field is very large and
causes the users' browsers to crash when displayed. 

Currently, our proposed solution is to implement an update processor which
will split this field  into several other non-indexed fields (such as
"html_content1", "html_content2" and so on), so that the displaying website
will be able to query and show only a portion of the full content at any
given part.
Can anyone suggest a more elegant solution?

Again, we are storing the field in Solr because we need highlighting. Also,
the UI team cannot implement the "paging" of this field in their side
because their server cannot manage user sessions and should remain
stateless.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Splitting-large-non-stored-field-tp4317570.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Documents issue

2017-01-27 Thread Comcast
Only change was rename to remove periods from the fieldnames

Sent from my iPhone

> On Jan 27, 2017, at 11:53 AM, alessandro.benedetti  
> wrote:
> 
> Hi Khris,
> can you paste here the diff between the OK status and the KO status ?
> Has only the name of the destination field changed ? ( a replace of '.' with
> '_' ?)
> Do you have any dynamic field defined ?
> Cheers
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-to-alter-the-facet-query-limit-default-tp4315939p4317566.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Does DIH queues up requests

2017-01-27 Thread William Bell
However you can create multiple DIH configs under a core/collection. You
can run them each in parallel and commit at the end.

SELECT *
 FROM existingtable
 WHERE column >= 1 AND column <= 2000;
SELECT *
 FROM existingtable
 WHERE column >= 2001 AND column <= 4000;


Something like that works for us to speed it up.

On Wed, Jan 25, 2017 at 4:01 PM, Davis, Daniel (NIH/NLM) [C] <
daniel.da...@nih.gov> wrote:

> DIH is not multi-threaded, and so the idea of "queueing" up requests is a
> misnomer.   You might be better off using something other than
> DataImportHandler.
> LogStash can pull what it calls "events" from a database and then push
> them into Solr, and you have some of the same row transformation
> capabilities that DataImportHandler has.
>
> This is also the bread and butter of ETL tools such as
> Kettle/Talend/MuleSoft/etc.
>
> That said, what I have done in the past is to take different streams of
> data and divide them into different requestHandlers, all using
> DataImportHandler.
> Each of these request handlers has its own context as to whether it is
> busy or not, and so each can be separately active/inactive.
>
>   
>class="solr.DataImportHandler">
> 
>   health-topics-conf.xml
> 
>   
>
>   
>   
> 
>   drugs-conf.xml
> 
>   
>
>
> Both of the above or XML imports, but with database imports, I also
> one-time implemented a sort of multithreading by having 4 request handlers
> and 4 data-config files, each taking their own slice of data:
>
> data-config-0.xml
> ...
>  query="SELECT * FROM (SELECT t.*, Mod(RowNum, 4) threadid FROM
> my_data_view t) WHERE threadid = 0"
> transformer="TemplateTransformer,LogTransformer"
> logTemplate="topic thread 0">
> ...
>
> data-config-1.xml:
> ...
>  query="SELECT * FROM (SELECT t.*, Mod(RowNum, 4) threadid FROM
> my_data_view t) WHERE threadid = 1"
> transformer="TemplateTransformer,LogTransformer"
> logTemplate="topic thread 1" logLevel="debug">
> ...
>
> And so on...
>
> -Original Message-
> From: William Bell [mailto:billnb...@gmail.com]
> Sent: Wednesday, January 25, 2017 5:39 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Does DIH queues up requests
>
> What we do is :
>
> Run URL to delete *:*, but do not commit.
>
> 1. Kick off indexing on DIH1, clean=false, commit=false.
> 2. Kick off indexing on DIH2, clean=false, commit=false
>
> Then we manually commit.
>
> On Wed, Jan 25, 2017 at 2:57 PM, Nkeet Shah 
> wrote:
>
> > Hi,
> > I have a multi-thread application that makes DIH request to perform
> > indexing. What I could not gather from the documentation is that does
> > DIH requests are queued up.
> >
> > In essence if a made a request to say DIH1 and it has accepted the
> > request and is working on the indexing. What would happen if another
> > request is made to the same DIH1. Will it be queued or rejected/
> >
> > Thanks
> > Ankit!
> >
> >
>
>
> --
> Bill Bell
> billnb...@gmail.com
> cell 720-256-8076
>



-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: After migrating to SolrCloud

2017-01-27 Thread Chris Hostetter

That error means that some client talking to your server is attempting to 
use an antiquated HTTP protocol version, which was (evidently) supported 
by the jetty used in 3.6, but is no longer supported by the jetty used in 
6.2.

(some details: https://stackoverflow.com/a/32302263/689372 )

If it's happening once a second, that sounds like perhaps some sort of 
monitoring agent? or perhaps you have a load balancer with an antiquated 
health check mechanism?


NUANCE NOTE: even though the error specifically mentions "HTTP/0.9" it's 
possible that the problematic client is actually attempting to use 
HTTP/1.0, but for a vairety of esoteric reasons related to how broadly 
formatted HTTP/0.9 requests can be, some HTTP/1.0 requests will look like 
"valid" (but unsupported) HTTP/0.9 requests to the jetty server -- hence 
that error message...

https://github.com/eclipse/jetty.project/issues/370




: Date: Thu, 26 Jan 2017 08:49:06 -0700 (MST)
: From: marotosg 
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: After migrating to SolrCloud
: 
: Hi All,
: I have migrated Solr from older versio 3.6 to SolrCloud 6.2 and all good but
: there are almost every second some WARN messages in the logs. 
: 
: HttpParser
: bad HTTP parsed: 400 HTTP/0.9 not supported for
: HttpChannelOverHttp@16a84451{r=0,​c=false,​a=IDLE,​uri=null}
: 
: Anynone knows where are these coming from?
: 
: Thanks
: 
: 
: 
: 
: --
: View this message in context: 
http://lucene.472066.n3.nabble.com/After-migrating-to-SolrCloud-tp4315943.html
: Sent from the Solr - User mailing list archive at Nabble.com.
: 

-Hoss
http://www.lucidworks.com/


Re: Streaming Expressions result-set fields not in order

2017-01-27 Thread Joel Bernstein
The issue is that fields are held in HashMaps internally so field order is
not maintained. The thinking behind this was that field order was not so
important as Tuples are mainly accessed by keys. But I think it's worth
looking into an approach for maintaining field order. Feel free to create
jira about this issue and update this thread with the issue number.

Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Jan 25, 2017 at 9:59 AM, Zheng Lin Edwin Yeo 
wrote:

> Hi,
>
> I'm trying out the Streaming Expressions in Solr 6.3.0.
>
> Currently, I'm facing the issue of not being able to get the fields in the
> result-set to be displayed in the same order as what I put in the query.
>
> For example, when I execute this query:
>
>  http://localhost:8983/solr/collection1/stream?expr=facet(collection1,
>   q="*:*",
>   buckets="id,cost,quantity",
>   bucketSorts="cost desc",
>   bucketSizeLimit=100,
>   sum(cost),
>   sum(quantity),
>   min(cost),
>   min(quantity),
>   max(cost),
>   max(quantity),
>   avg(cost),
>   avg(quantity),
>   count(*))&indent=true
>
>
> I get the following in the result-set.
>
>{
>   "result-set":{"docs":[
> {
> "min(quantity)":12.21,
> "avg(quantity)":12.21,
> "sum(cost)":256.33,
> "max(cost)":256.33,
> "count(*)":1,
> "min(cost)":256.33,
> "cost":256.33,
> "avg(cost)":256.33,
> "quantity":12.21,
> "id":"01",
> "sum(quantity)":12.21,
> "max(quantity)":12.21},
> {
> "EOF":true,
> "RESPONSE_TIME":359}]}}
>
>
> The fields are displayed randomly all over the place, instead of the order
> sum, min, max, avg as in the query. Is there any way which I can do to the
> fields in the result-set to be displayed in the same order as the query?
>
>
> Regards,
> Edwin
>