Re: Solr vs ElasticSearch

2011-07-26 Thread Peter
Have a look:

http://stackoverflow.com/questions/2271600/elasticsearch-sphinx-lucene-solr-xapian-which-fits-for-which-usage

http://karussell.wordpress.com/2011/05/12/elasticsearch-vs-solr-lucene/

Regards,
Peter.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-vs-ElasticSearch-tp3009181p3200492.html
Sent from the Solr - User mailing list archive at Nabble.com.


Fundamental questions of how to build up solr for huge portals

2010-01-16 Thread Peter

Hello!

Our team wants to use solr for an community portal built up out of 3 and 
more sub portals. We are unsure in which way we sould build up the whole 
architecture, because we have more than one portal and we want to make 
them all connected and searchable by solr. Could some experts help us on 
these questions?


- whats the best way to use solr to get the best performance for an huge 
portal with >5000 users that might expense fastly?
- which client to use (Java,PHP...)? Now the portal is almost PHP/MySQL 
based. But we want to make solr as best as it could be in all ways 
(performace, accesibility, way of good programming, using the whole 
features of lucene - like tagging, facetting and so on...)



We are thankful of every suggestions :)

Thanks,
Peter


Re: Solr High Availability

2015-12-16 Thread Peter Tan
Hi Jack,

Appreciate you helping me to clear this up.

For replicationFactor = 1, that means only keeping one copy of document in
the cluster.

Currently, for our SolrCloud setup, we have two replicas (primary and
replica) per each shard (total of 5 shards).  This should achieve the HA
already, correct?



On Tue, Dec 15, 2015 at 10:09 PM, Jack Krupansky 
wrote:

> There is no HA with a single replica for each shard. Replication factor
> must be at least 2 for HA.
>
> -- Jack Krupansky
>
> On Wed, Dec 16, 2015 at 12:38 AM, Peter Tan  wrote:
>
> > Hi Jack, What happens when there is only one replica setup?
> >
> > On Tue, Dec 15, 2015 at 9:32 PM, Jack Krupansky <
> jack.krupan...@gmail.com>
> > wrote:
> >
> > > Solr Cloud provides HA when you configure at least two replicas for
> each
> > > shard and have at least 3 zookeepers. That's it. No deck or detail
> > document
> > > is needed.
> > >
> > >
> > >
> > > -- Jack Krupansky
> > >
> > > On Tue, Dec 15, 2015 at 9:07 PM, 
> > > wrote:
> > >
> > > > Hi Team,
> > > >
> > > > Can you help me in understanding in achieving the Solr High
> > Availability
> > > .
> > > >
> > > > Appreciate you have a detail document or Deck on more details.
> > > >
> > > > Thank you
> > > > Viswanath Bharathi
> > > > Accenture | Delivery Centres for Technology in India
> > > > CDC 2, Chennai, India
> > > > Mobile: +91 9886259010
> > > > www.accenture.com<http://www.accenture.com/> | www.avanade.com<
> > > > http://www.avanade.com/>
> > > >
> > > >
> > > > 
> > > >
> > > > This message is for the designated recipient only and may contain
> > > > privileged, proprietary, or otherwise confidential information. If
> you
> > > have
> > > > received it in error, please notify the sender immediately and delete
> > the
> > > > original. Any other use of the e-mail by you is prohibited. Where
> > allowed
> > > > by local law, electronic communications with Accenture and its
> > > affiliates,
> > > > including e-mail and instant messaging (including content), may be
> > > scanned
> > > > by our systems for the purposes of information security and
> assessment
> > of
> > > > internal compliance with Accenture policy.
> > > >
> > > >
> > >
> >
> __
> > > >
> > > > www.accenture.com
> > > >
> > >
> >
>


Re: Solr High Availability

2015-12-16 Thread Peter Tan
Thanx for the response.

There were few occurrences of our SolrCloud cluster where when a primary
went down in a shard, the replica didn't get promoted which eventually led
to downtime.  We had to restart zookeeper services (we have three zookeeper
nodes) to promote the replica into primary.

But I just want to make sure our setup is correct

On Wed, Dec 16, 2015 at 10:01 AM, Upayavira  wrote:

> If you have two replicas (one leader/one replica) for each shard of your
> collection, and you ensure that no two replicas are on the same node,
> and you have three independent Zookeeper nodes, then yes, you should
> have HA.
>
> Upayavira
>
> On Wed, Dec 16, 2015, at 05:48 PM, Peter Tan wrote:
> > Hi Jack,
> >
> > Appreciate you helping me to clear this up.
> >
> > For replicationFactor = 1, that means only keeping one copy of document
> > in
> > the cluster.
> >
> > Currently, for our SolrCloud setup, we have two replicas (primary and
> > replica) per each shard (total of 5 shards).  This should achieve the HA
> > already, correct?
> >
> >
> >
> > On Tue, Dec 15, 2015 at 10:09 PM, Jack Krupansky
> > 
> > wrote:
> >
> > > There is no HA with a single replica for each shard. Replication factor
> > > must be at least 2 for HA.
> > >
> > > -- Jack Krupansky
> > >
> > > On Wed, Dec 16, 2015 at 12:38 AM, Peter Tan 
> wrote:
> > >
> > > > Hi Jack, What happens when there is only one replica setup?
> > > >
> > > > On Tue, Dec 15, 2015 at 9:32 PM, Jack Krupansky <
> > > jack.krupan...@gmail.com>
> > > > wrote:
> > > >
> > > > > Solr Cloud provides HA when you configure at least two replicas for
> > > each
> > > > > shard and have at least 3 zookeepers. That's it. No deck or detail
> > > > document
> > > > > is needed.
> > > > >
> > > > >
> > > > >
> > > > > -- Jack Krupansky
> > > > >
> > > > > On Tue, Dec 15, 2015 at 9:07 PM, <
> k.viswanath.bhara...@accenture.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Team,
> > > > > >
> > > > > > Can you help me in understanding in achieving the Solr High
> > > > Availability
> > > > > .
> > > > > >
> > > > > > Appreciate you have a detail document or Deck on more details.
> > > > > >
> > > > > > Thank you
> > > > > > Viswanath Bharathi
> > > > > > Accenture | Delivery Centres for Technology in India
> > > > > > CDC 2, Chennai, India
> > > > > > Mobile: +91 9886259010
> > > > > > www.accenture.com<http://www.accenture.com/> | www.avanade.com<
> > > > > > http://www.avanade.com/>
> > > > > >
> > > > > >
> > > > > > 
> > > > > >
> > > > > > This message is for the designated recipient only and may contain
> > > > > > privileged, proprietary, or otherwise confidential information.
> If
> > > you
> > > > > have
> > > > > > received it in error, please notify the sender immediately and
> delete
> > > > the
> > > > > > original. Any other use of the e-mail by you is prohibited. Where
> > > > allowed
> > > > > > by local law, electronic communications with Accenture and its
> > > > > affiliates,
> > > > > > including e-mail and instant messaging (including content), may
> be
> > > > > scanned
> > > > > > by our systems for the purposes of information security and
> > > assessment
> > > > of
> > > > > > internal compliance with Accenture policy.
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> __
> > > > > >
> > > > > > www.accenture.com
> > > > > >
> > > > >
> > > >
> > >
>


Solr Timeouts during query aggregation

2015-12-23 Thread Peter Lee
Greetings,

I'm having a hard time locating information about how solr handles timeouts (if 
it DOES handle them) during a search across multiple shards in a solr cloud 
configuration.

I have found information and have confirmed through empirical testing as to how 
solr handles timeouts on each shard when a query is made to a collection. What 
I have NOT been able to find is information or settings related to the time it 
takes Solr to aggregate the results returned from multiple shards before 
returning the response to the user. Does Solr not have any sort of timeout on 
this operation?

For clarity, I'll repeat my question and try to explain it in more detail.

If I send a query to a solr cloud setup that has 6 shards, the query will be 
sent to each of the 6 shards that will each return some number of hits. The 
answers from each of the shards is sent back to the server that originally 
caught the query, and that original server must then aggregate the data from 
all of the different shards to produce a single set of hits to return to the 
user. I see how to use "timeAllowed" to limit the time of the search on each 
shard...but I was wondering if there was a separate timeout for the 
"aggregation" step just before the response is returned.

I am asking this question because our existing search technology has this 
behavior and setting, and I and I am trying to determine if there is a related 
feature within the solr technology. At this point, since I have not seen any 
documentation nor configuration settings for this feature, I am ready to take 
it as truth that Solr does NOT include this functionality. However, I thought I 
should ask the mailing list to see if I've missed something.

Thank you!

Peter S. Lee, Software Engineer Lead
ProQuest | 789 E. Eisenhower Parkway | Ann Arbor, MI, 48106-1346 USA
www.proquest.com



File based spell check with weights

2016-04-08 Thread Peter Lee
In an older wiki post (which I know is outdated...but still...) located here: 
https://wiki.apache.org/solr/FileBasedSpellChecker, the end of the first 
paragraph  (the intro) indicates that "...it isn't all that hard to create an 
index from a file and have it weight the terms."

I've been thinking about this, and it would be PERFECT for what we need 
(because we HAVE a dictionary from a different search engine WITH weights on it 
that have been crafted/tweaked over a long period of time). However, as I've 
read the spellcheck documentation, I'm really not certain what this person 
(Mark Bennett) was talking about.

Question #1: Has anyone pursued this possibility? If so, can you offer some 
insight as to how you approached this?

Question #2: I'd be grateful for someone's "hint" as to where to start looking 
to try to create this workaround if you have some experience with it. I'm 
guessing that the only way to do this is to dive into the File based spell 
check dictionary code and then clone/modify what is there. I was just wondering 
if I was missing something or if someone was aware of a quicker way to scale 
this technological hill. I realize I could dissect the format of the spellcheck 
dictionary and load it as I desire...but I'm a bit concerned about this as this 
is for a long-lived project that is expected to live a lot longer, and this 
approach seems to be a bit "brittle." It will break the first time Apache makes 
the slightest change to the structure of the spell check index.

If anyone has any insights to offer I'd appreciate it.

Thanks.

Peter S. Lee



Re: Basic auth

2015-07-22 Thread Peter Sturge
if you're using Jetty you can use the standard realms mechanism for Basic
Auth, and it works the same on Windows or UNIX. There's plenty of docs on
the Jetty site about getting this working, although it does vary somewhat
depending on the version of Jetty you're running (N.B. I would suggest
using Jetty 9, and not 8, as 8 is missing some key authentication classes).
If, when you execute a search query to your Solr instance you get a
username and password popup, then Jetty's auth is setup. If you don't then
something's wrong in the Jetty config.

it's worth noting that if you're doing distributed searches Basic Auth on
its own will not work for you. This is because Solr sends distributed
requests to remote instances on behalf of the user, and it has no knowledge
of the web container's auth mechanics. We got 'round this by customizing
Solr to receive credentials and use them for authentication to remote
instances - SOLR-1861 is an old implementation for a previous release, and
there has been some significant refactoring of SearchHandler since then,
but the concept works well for distributed queries.

Thanks,
Peter



On Wed, Jul 22, 2015 at 11:18 AM, O. Klein  wrote:

> Steven White wrote
> > Thanks for updating the wiki page.  However, my issue remains, I cannot
> > get
> > Basic auth working.  Has anyone got it working, on Windows?
>
> Doesn't work for me on Linux either.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Basic-auth-tp4218053p4218519.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Basic auth

2015-07-22 Thread Peter Sturge
Try adding the "start" call in your jetty.xml:
Realm Name
/etc/realm.properties
5



On Wed, Jul 22, 2015 at 2:53 PM, O. Klein  wrote:

> Yeah I can't get it to work on Jetty 9 either on Linux.
>
> Just trying to password protect the admin pages.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Basic-auth-tp4218053p4218565.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Basic Auth (again)

2015-07-23 Thread Peter Sturge
Hi Steve,

What version of Jetty are you using?

Have you got a webdefault.xml in your etc folder?
If so, does it have an entry like this:

  
BASIC
Realm Name as specified in jetty.xml
  

It's been a few years since I set this up, but I believe you also need an
auth-constraint in webdefault.xml - this tells jetty which apps are using
which realms:

  

  A web application name
  /*


  default-role

  

Your realm.properties should then have user account entries for the role
similar to:

admin: some-cred, default-role


Hope this helps,
Peter


On Thu, Jul 23, 2015 at 7:41 PM, Steven White  wrote:

> (re-posting as new email thread to see if this will make it to the list)
>
>
> That didn't help.  I still get the same result and virtually no log to help
> me figure out where / what things are going wrong.
>
> Here is all that I see in C:\Solr\solr-5.2.1\server\logs\solr.log:
>
>   INFO  - 2015-07-23 05:29:12.065; [   ] org.eclipse.jetty.util.log.Log;
> Logging initialized @286ms
>   INFO  - 2015-07-23 05:29:12.231; [   ] org.eclipse.jetty.server.Server;
> jetty-9.2.10.v20150310
>   WARN  - 2015-07-23 05:29:12.240; [   ]
> org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
>   INFO  - 2015-07-23 05:29:12.255; [   ]
> org.eclipse.jetty.server.AbstractConnector; Started
> ServerConnector@5a5fae16
> {HTTP/1.1}{0.0.0.0:8983}
>   INFO  - 2015-07-23 05:29:12.256; [   ] org.eclipse.jetty.server.Server;
> Started @478ms
>
> Does anyone know where / what logs I should turn on to debug this?  Should
> I be posting this issue on the Jetty mailing list?
>
> Steve
>
>
> On Wed, Jul 22, 2015 at 10:34 AM, Peter Sturge 
>  wrote:
>
> > Try adding the "start" call in your jetty.xml:
> > Realm Name
> >  > default="."/>/etc/realm.properties
> > 5
> > 
>


Re: Basic Auth (again)

2015-07-23 Thread Peter Sturge
Hi Steve,

We've not yet moved to Solr 5, but we do use Jetty 9. In any case, Basic
Auth is a Jetty thing, not a Solr thing.
We do use this mechanism to great effect to secure things like index
writers and such, and it does work well once it's setup.
Jetty, as with all containers, is a bit fussy about everything being in its
place (sorry to state the obvious :-).

I see you've got a non-global url pattern - is this definitely definitely
correct? In 100% of cases, Solr should be the only app running, so a global
url is standard practice.
Your Jetty's got Solr security-constraint set to /db/*, but your url is
http://localhost:8983/solr/ - you'll need a corresponding 
entry if you want to use /db/* (and the url will change accordingly to
http://localhost:8983/db/solr/)
To simplify things - even if just to get things working initially, can you
set it to a /* url-pattern and use default-role? You can always tweak it
later on.

I take it from your url that you're not using any sharding/multi-core
stuff. If you are using multi-core, include the core name in the url (e.g.
localhost:8983/solr/mycore/select?q=*:*).

You can also set the jetty-logging.properties file as described in:
http://www.eclipse.org/jetty/documentation/9.2.7.v20150116/configuring-logging.html
.
A 404 would suggest that Solr hasn't loaded, possibly due to missing
mappings in the xml. You can run netstat -a on your Windows box to see if
Solr is listening on port 8983.

Thanks,
Peter


On Thu, Jul 23, 2015 at 9:39 PM, Steven White  wrote:

> Hi Petter,
>
> I'm on Solr 5.2.1 which comes with Jetty 9.2.  I'm setting this up on
> Windows 2012 but will need to do the same on Linux too.
>
> I followed the step per this link:
> https://wiki.apache.org/solr/SolrSecurity#Jetty_realm_example very much to
> the book.  Here are the changes I made:
>
> File: C:\Solr\solr-5.2.1\server\etc\webdefault.xml
>
>   
> 
>   Solr authenticated
> application
>   /db/*
> 
>
>   db-role
> 
>   
>
> 
>   BASIC
>   Test Realm
> 
>
> File: E:\Solr\solr-5.2.1\server\etc\jetty.xml
>
> 
>   Test Realm
>default="."/>/etc/realm.properties
>   0
>
> 
>
> File: E:\Solr\solr-5.2.1\server\etc\realm.properties
>
> admin: admin, db-role
>
> I then restarted Solr.  After this, accessing http://localhost:8983/solr/
> gives me:
>
> HTTP ERROR: 404
>
> Problem accessing /solr/. Reason:
>
> Not Found
> Powered by Jetty://
>
> In a previous post, I asked if anyone has setup Solr 5.2.1 or any 5.x with
> Basic Auth and got it working, I have not heard back.  Either this feature
> is not tested or not in use.  If it is not in use, how do folks secure
> their Solr instance?
>
> Thanks
>
> Steve
>
> On Thu, Jul 23, 2015 at 2:52 PM, Peter Sturge 
> wrote:
>
> > Hi Steve,
> >
> > What version of Jetty are you using?
> >
> > Have you got a webdefault.xml in your etc folder?
> > If so, does it have an entry like this:
> >
> >   
> > BASIC
> > Realm Name as specified in jetty.xml
> >   
> >
> > It's been a few years since I set this up, but I believe you also need an
> > auth-constraint in webdefault.xml - this tells jetty which apps are using
> > which realms:
> >
> >   
> > 
> >   A web application name
> >   /*
> > 
> > 
> >   default-role
> > 
> >   
> >
> > Your realm.properties should then have user account entries for the role
> > similar to:
> >
> > admin: some-cred, default-role
> >
> >
> > Hope this helps,
> > Peter
> >
> >
> > On Thu, Jul 23, 2015 at 7:41 PM, Steven White 
> > wrote:
> >
> > > (re-posting as new email thread to see if this will make it to the
> list)
> > >
> > >
> > > That didn't help.  I still get the same result and virtually no log to
> > help
> > > me figure out where / what things are going wrong.
> > >
> > > Here is all that I see in C:\Solr\solr-5.2.1\server\logs\solr.log:
> > >
> > >   INFO  - 2015-07-23 05:29:12.065; [   ]
> org.eclipse.jetty.util.log.Log;
> > > Logging initialized @286ms
> > >   INFO  - 2015-07-23 05:29:12.231; [   ]
> org.eclipse.jetty.server.Server;
> > > jetty-9.2.10.v20150310
> > >   WARN  - 2015-07-23 05:29:12.240; [   ]
> > > org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
> > >   INFO  -

Collapsing Query Parser returns one record per shard...was not expecting this...

2015-08-03 Thread Peter Lee
>From my reading of the solr docs (e.g. 
>https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results 
>and https://cwiki.apache.org/confluence/display/solr/Result+Grouping), I've 
>been under the impression that these two methods (result grouping and 
>collapsing query parser) can both be used to eliminate duplicates from a 
>result set (in our case, we have a duplication field that contains a 
>'signature' that identifies duplicates. We use our own signature for a variety 
>of reasons that are tied to complex business requirements.).

In a test environment I scattered 15 duplicate records (with another 10 unique 
records) across a test system running Solr Cloud (Solr version 5.2.1) that had 
4 shards and a replication factor of 2. I tried both result grouping and the 
collapsing query parser to remove duplicates. The result grouping worked as 
expected...the collapsing query parser did not.

My results in using the collapsing query parser showed that Solr was in fact 
including into the result set one of the duplicate records from each shard 
(that is, I received FOUR duplicate records...and turning on debug showed that 
each of the four records came from a  unique shard)...when I was expecting solr 
to do the collapsing on the aggregated result and return only ONE of the 
duplicated records across ALL shards. It appears that solr is performing the 
collapsing query parsing on each individual shard, but then NOT performing the 
operation on the aggregated results from each shard.

I have searched through the forums and checked the documentation as carefully 
as I can. I find no documentation or mention of this effect (one record being 
returned per shard) when using collapsing query parsing.

Is this a known behavior? Am I just doing something wrong? Am I missing some 
search parameter? Am I simply not understanding correctly how this is supposed 
to work?

For reference, I am including below the search url and the response I received. 
Any insights would be appreciated.

Query: 
http://172.26.250.150:8983/solr/AcaColl/select?q=*%3A*&wt=json&indent=true&rows=1000&fq={!collapse%20field=dupid_s}&debugQuery=true

Response (note that dupid_s = 900 is the duplicate value and that I have added 
comments in the output ** pointing out which shard responses came 
from):

{
  "responseHeader":{
"status":0,
"QTime":31,
"params":{
  "debugQuery":"true",
  "indent":"true",
  "q":"*:*",
  "wt":"json",
  "fq":"{!collapse field=dupid_s}",
  "rows":"1000"}},
  "response":{"numFound":14,"start":0,"maxScore":1.0,"docs":[
  {
"storeid_s":"1002",
"dupid_s":"900", ***AcaColl_shard2_replica2***
"title_pqth":["Dupe Record #2"],
"_version_":1508241005512491008,
"indexTime_dt":"2015-07-31T19:25:09.914Z"},
  {
"storeid_s":"8020",
"dupid_s":"2005",
"title_pqth":["Unique Record #5"],
"_version_":1508241005539753984,
"indexTime_dt":"2015-07-31T19:25:09.94Z"},
  {
"storeid_s":"8023",
"dupid_s":"2008",
"title_pqth":["Unique Record #8"],
"_version_":1508241005540802560,
"indexTime_dt":"2015-07-31T19:25:09.94Z"},
  {
"storeid_s":"8024",
"dupid_s":"2009",
"title_pqth":["Unique Record #9"],
"_version_":1508241005541851136,
"indexTime_dt":"2015-07-31T19:25:09.94Z"},
  {
"storeid_s":"1007",
"dupid_s":"900", ***AcaColl_shard4_replica2***
"title_pqth":["Dupe Record #7"],
"_version_":1508241005515636736,
"indexTime_dt":"2015-07-31T19:25:09.91Z"},
  {
"storeid_s":"8016",
"dupid_s":"2001",
"title_pqth":["Unique Record #1"],
"_version_":1508241005526122496,
"indexTime_dt":"2015-07-31T19:25:09.91Z"},
  {
"storeid_s":"8019",
"dupid_s":"2004",
"title_pqth":["Unique Record #4"],
"_version_":1508241005528219648,
"indexTime_dt":"2015-07-31T19:25:09.91Z"},
  {
"storeid_s":"1003",
"dupid_s":"900", ***AcaColl_shard1_replica1***
"title_pqth":["Dupe Record #3"],
"_version_":1508241005515636736,
"indexTime_dt":"2015-07-31T19:25:09.917Z"},
  {
"storeid_s":"8017",
"dupid_s":"2002",
"title_pqth":["Unique Record #2"],
"_version_":1508241005518782464,
"indexTime_dt":"2015-07-31T19:25:09.917Z"},
  {
"storeid_s":"8018",
"dupid_s":"2003",
"title_pqth":["Unique Record #3"],
"_version_":1508241005519831040,
"indexTime_dt":"2015-07-31T19:25:09.917Z"},
  {
"storeid_s":"1001",
"dupid_s":"900", ***AcaColl_shard3_replica1***
"title_pqth":["Dupe Record #1"],
"_version_":1508241005511442432,
"indexTime_dt":"201

RE: Collapsing Query Parser returns one record per shard...was not expecting this...

2015-08-04 Thread Peter Lee
Joel,

Thank you for that  information.

I had not heard of composite ID routing, and found a post (by you) on the 
feature that was most instructive 
(https://lucidworks.com/blog/solr-cloud-document-routing/).

Thanks for clearing up the behavior of the collapsing query parser. Sadly, I 
doubt co-locating the records is going to be possible for us. The dupe field we 
use to eliminate duplicates at search time has the property that it can CHANGE 
over time in a way that is not really predictable as it changes based upon the 
other content in the system. If we went that route we'd have to put into place 
a complex mechanism to move/relocate records after they've been indexed...and I 
don't think that is going to be the solution for us.

On another note, I've been away from Solr since version 4.2 and am now 
returning to version 5.2.1. Back in the day, grouping gave us a HORRIBLE 
performance hit, even after we spent a lot of time trying to tune the system 
for it. It appears now from what we are seeing from testing is that the 
grouping performance has been greatly improved. I know it is and always will be 
a computationally intensive task, but it is good news that it has seen such 
performance improvements.

Again, thanks for the information and for the heads up regarding composite id 
routing. I'll have to take a closer look at that feature and see if we can take 
advantage of it in some way.

Thank you.

-Original Message-
From: Joel Bernstein [mailto:joels...@gmail.com] 
Sent: Monday, August 03, 2015 10:51 PM
To: solr-user@lucene.apache.org
Subject: Re: Collapsing Query Parser returns one record per shard...was not 
expecting this...

One of things to keep in mind with Grouping is that if you are relying on an 
accurate group count (ngroups) then you will also have to collocate documents 
based on the grouping field.

The main advantage to the Collapsing qparser plugin is it provides fast field 
collapsing on high cardinality fields with an accurate group count.

If you don't need ngroups, then Grouping is usually just as fast if not faster.


Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Aug 3, 2015 at 10:14 PM, Joel Bernstein  wrote:

> Your findings are the expected behavior for the Collapsing qparser. 
> The Collapsing qparser requires records in the same collapsed field to 
> be located on the same shard. The typical approach for this is to use 
> composite Id routing to ensure that documents with the same collapse 
> field land on the same shard.
>
> We should make this clear in the documentation.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Mon, Aug 3, 2015 at 4:20 PM, Peter Lee  wrote:
>
>> From my reading of the solr docs (e.g.
>> https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+
>> Results and 
>> https://cwiki.apache.org/confluence/display/solr/Result+Grouping),
>> I've been under the impression that these two methods (result 
>> grouping and collapsing query parser) can both be used to eliminate 
>> duplicates from a result set (in our case, we have a duplication 
>> field that contains a 'signature' that identifies duplicates. We use 
>> our own signature for a variety of reasons that are tied to complex business 
>> requirements.).
>>
>> In a test environment I scattered 15 duplicate records (with another 
>> 10 unique records) across a test system running Solr Cloud (Solr 
>> version
>> 5.2.1) that had 4 shards and a replication factor of 2. I tried both 
>> result grouping and the collapsing query parser to remove duplicates. 
>> The result grouping worked as expected...the collapsing query parser did not.
>>
>> My results in using the collapsing query parser showed that Solr was 
>> in fact including into the result set one of the duplicate records 
>> from each shard (that is, I received FOUR duplicate records...and 
>> turning on debug showed that each of the four records came from a  
>> unique shard)...when I was expecting solr to do the collapsing on the 
>> aggregated result and return only ONE of the duplicated records 
>> across ALL shards. It appears that solr is performing the collapsing 
>> query parsing on each individual shard, but then NOT performing the 
>> operation on the aggregated results from each shard.
>>
>> I have searched through the forums and checked the documentation as 
>> carefully as I can. I find no documentation or mention of this effect 
>> (one record being returned per shard) when using collapsing query parsing.
>>
>> Is this a known behavior? Am I just doing something wrong? Am I 
>> missing some search parameter? Am I simply not understanding 
>> correctly how this is supposed to work?
>>
>> For refe

Grouping facets: Possible to get facet results for each Group?

2015-10-11 Thread Peter Sturge
Hello Solr Forum,

Been trying to coerce Group faceting to give some faceting back for each
group, but maybe this use case isn't catered for in Grouping? :

So the Use Case is this:
Let's say I do a grouped search that returns say, 9 distinct groups, and in
these groups are various numbers of unique field values that need faceting
- but the faceting needs to be within each group:


user:*&facet=true&facet.field=user&group=true&group.field=host&group.facet=true

This query gives back grouped facets for each 'host' value (i.e. the facet
counts are 'collapsed') - but the facet counts (unique values of 'user'
field) are aggregated for all the groups, not on a 'per-group' basis (i.e.
returned as 'global facets' - outside of the grouped results).
The results from the query above doesn't say which unique values for
'users' are in which group. If the number of doc hits is very large (can
easily be in the 100's of thousands) it's not practical to iterate through
the docs looking for unique values.
This Use Case necessitates the unique values within each group, rather than
the total doc hits.

Is this possible with grouping, or inconjunction with another module?

Many thanks,
+Peter


Fwd: Grouping facets: Possible to get facet results for each Group?

2015-10-12 Thread Peter Sturge
Hello Solr Forum,

Been trying to coerce Group faceting to give some faceting back for each
group, but maybe this use case isn't catered for in Grouping? :

So the Use Case is this:
Let's say I do a grouped search that returns say, 9 distinct groups, and in
these groups are various numbers of unique field values that need faceting
- but the faceting needs to be within each group:


user:*&facet=true&facet.field=user&group=true&group.field=host&group.facet=true

This query gives back grouped facets for each 'host' value (i.e. the facet
counts are 'collapsed') - but the facet counts (unique values of 'user'
field) are aggregated for all the groups, not on a 'per-group' basis (i.e.
returned as 'global facets' - outside of the grouped results).
The results from the query above doesn't say which unique values for
'users' are in which group. If the number of doc hits is very large (can
easily be in the 100's of thousands) it's not practical to iterate through
the docs looking for unique values.
This Use Case necessitates the unique values within each group, rather than
the total doc hits.

Is this possible with grouping, or inconjunction with another module?

Many thanks,
+Peter


Re: Grouping facets: Possible to get facet results for each Group?

2015-10-13 Thread Peter Sturge
Hi,
Thanks for your response.
I did have a look at pivots, and they could work in a way. We're still on
Solr 4.3, so I'll have to wait for sub-facets - but they sure look pretty
cool!
Peter


On Tue, Oct 13, 2015 at 12:30 PM, Alessandro Benedetti <
benedetti.ale...@gmail.com> wrote:

> Can you model your business domain with Solr nested Docs ? In the case you
> can use Yonik article about nested facets.
>
> Cheers
>
> On 13 October 2015 at 05:05, Alexandre Rafalovitch 
> wrote:
>
> > Could you use the new nested facets syntax?
> > http://yonik.com/solr-subfacets/
> >
> > Regards,
> >Alex.
> > 
> > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> > http://www.solr-start.com/
> >
> > On 11 October 2015 at 09:51, Peter Sturge 
> wrote:
> > > Been trying to coerce Group faceting to give some faceting back for
> each
> > > group, but maybe this use case isn't catered for in Grouping? :
> > >
> > > So the Use Case is this:
> > > Let's say I do a grouped search that returns say, 9 distinct groups,
> and
> > in
> > > these groups are various numbers of unique field values that need
> > faceting
> > > - but the faceting needs to be within each group:
> >
>
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card - http://about.me/alessandro_benedetti
> Blog - http://alexbenedetti.blogspot.co.uk
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>


Re: Grouping facets: Possible to get facet results for each Group?

2015-10-14 Thread Peter Sturge
Yes, you are right about that - I've used pivots before and they do need to
be used judiciously.
Fortunately, we only ever use single-value fields, as it gives some good
advantages in a heavily sharded environment.
Our document structure is, by it's very nature always flat, so it could be
an impediment to nested facets, but I don't know enough about them to know
for sure.
Thanks,
Peter


On Wed, Oct 14, 2015 at 9:44 AM, Alessandro Benedetti <
benedetti.ale...@gmail.com> wrote:

> mmm let's say that nested facets are a subset of Pivot Facets.
> if pivot faceting works with the classic flat document structure, the sub
> facet are working with any nested structure.
> So be careful about pivot faceting in a flat document with multi valued
> fields, because you lose the relation across the different fields value.
>
> Cheers
>
> On 13 October 2015 at 18:06, Peter Sturge  wrote:
>
> > Hi,
> > Thanks for your response.
> > I did have a look at pivots, and they could work in a way. We're still on
> > Solr 4.3, so I'll have to wait for sub-facets - but they sure look pretty
> > cool!
> > Peter
> >
> >
> > On Tue, Oct 13, 2015 at 12:30 PM, Alessandro Benedetti <
> > benedetti.ale...@gmail.com> wrote:
> >
> > > Can you model your business domain with Solr nested Docs ? In the case
> > you
> > > can use Yonik article about nested facets.
> > >
> > > Cheers
> > >
> > > On 13 October 2015 at 05:05, Alexandre Rafalovitch  >
> > > wrote:
> > >
> > > > Could you use the new nested facets syntax?
> > > > http://yonik.com/solr-subfacets/
> > > >
> > > > Regards,
> > > >Alex.
> > > > 
> > > > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> > > > http://www.solr-start.com/
> > > >
> > > > On 11 October 2015 at 09:51, Peter Sturge 
> > > wrote:
> > > > > Been trying to coerce Group faceting to give some faceting back for
> > > each
> > > > > group, but maybe this use case isn't catered for in Grouping? :
> > > > >
> > > > > So the Use Case is this:
> > > > > Let's say I do a grouped search that returns say, 9 distinct
> groups,
> > > and
> > > > in
> > > > > these groups are various numbers of unique field values that need
> > > > faceting
> > > > > - but the faceting needs to be within each group:
> > > >
> > >
> > >
> > >
> > > --
> > > --
> > >
> > > Benedetti Alessandro
> > > Visiting card - http://about.me/alessandro_benedetti
> > > Blog - http://alexbenedetti.blogspot.co.uk
> > >
> > > "Tyger, tyger burning bright
> > > In the forests of the night,
> > > What immortal hand or eye
> > > Could frame thy fearful symmetry?"
> > >
> > > William Blake - Songs of Experience -1794 England
> > >
> >
>
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card - http://about.me/alessandro_benedetti
> Blog - http://alexbenedetti.blogspot.co.uk
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>


Re: Grouping facets: Possible to get facet results for each Group?

2015-10-15 Thread Peter Sturge
Great - can't wait to try this out! Many thanks for your help on pointing
me towards this new faceting feature.
Thanks,
Peter


On Thu, Oct 15, 2015 at 10:04 AM, Alessandro Benedetti <
benedetti.ale...@gmail.com> wrote:

> It will not be an impediment, if you have a flat document with single
> valued field interested, you can use Pivot Facets and apply stats over the
> facets as well.
> Take a look to the modern Json faceting approach Yonik introduced.
> Since I start using it I strongly recommend it, it's amazingly clear to
> define your faceting structure, store it in a file in Json and use it at
> query time !
>
> I am a strong supporter of this approach, it is young but already powerful.
> Pretty sure it will help you.
>
> Cheers
>
> [1] http://yonik.com/json-facet-api/
> [2] http://yonik.com/solr-facet-functions/
> [3] http://yonik.com/solr-subfacets/
>
> On 14 October 2015 at 22:12, Peter Sturge  wrote:
>
> > Yes, you are right about that - I've used pivots before and they do need
> to
> > be used judiciously.
> > Fortunately, we only ever use single-value fields, as it gives some good
> > advantages in a heavily sharded environment.
> > Our document structure is, by it's very nature always flat, so it could
> be
> > an impediment to nested facets, but I don't know enough about them to
> know
> > for sure.
> > Thanks,
> > Peter
> >
> >
> > On Wed, Oct 14, 2015 at 9:44 AM, Alessandro Benedetti <
> > benedetti.ale...@gmail.com> wrote:
> >
> > > mmm let's say that nested facets are a subset of Pivot Facets.
> > > if pivot faceting works with the classic flat document structure, the
> sub
> > > facet are working with any nested structure.
> > > So be careful about pivot faceting in a flat document with multi valued
> > > fields, because you lose the relation across the different fields
> value.
> > >
> > > Cheers
> > >
> > > On 13 October 2015 at 18:06, Peter Sturge 
> > wrote:
> > >
> > > > Hi,
> > > > Thanks for your response.
> > > > I did have a look at pivots, and they could work in a way. We're
> still
> > on
> > > > Solr 4.3, so I'll have to wait for sub-facets - but they sure look
> > pretty
> > > > cool!
> > > > Peter
> > > >
> > > >
> > > > On Tue, Oct 13, 2015 at 12:30 PM, Alessandro Benedetti <
> > > > benedetti.ale...@gmail.com> wrote:
> > > >
> > > > > Can you model your business domain with Solr nested Docs ? In the
> > case
> > > > you
> > > > > can use Yonik article about nested facets.
> > > > >
> > > > > Cheers
> > > > >
> > > > > On 13 October 2015 at 05:05, Alexandre Rafalovitch <
> > arafa...@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > Could you use the new nested facets syntax?
> > > > > > http://yonik.com/solr-subfacets/
> > > > > >
> > > > > > Regards,
> > > > > >Alex.
> > > > > > 
> > > > > > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> > > > > > http://www.solr-start.com/
> > > > > >
> > > > > > On 11 October 2015 at 09:51, Peter Sturge <
> peter.stu...@gmail.com>
> > > > > wrote:
> > > > > > > Been trying to coerce Group faceting to give some faceting back
> > for
> > > > > each
> > > > > > > group, but maybe this use case isn't catered for in Grouping? :
> > > > > > >
> > > > > > > So the Use Case is this:
> > > > > > > Let's say I do a grouped search that returns say, 9 distinct
> > > groups,
> > > > > and
> > > > > > in
> > > > > > > these groups are various numbers of unique field values that
> need
> > > > > > faceting
> > > > > > > - but the faceting needs to be within each group:
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > --
> > > > >
> > > > > Benedetti Alessandro
> > > > > Visiting card - http://about.me/alessandro_benedetti
> > > > > Blog - http://alexbenedetti.blogspot.co.uk
> > > > >
> > > > > "Tyger, tyger burning bright
> > > > > In the forests of the night,
> > > > > What immortal hand or eye
> > > > > Could frame thy fearful symmetry?"
> > > > >
> > > > > William Blake - Songs of Experience -1794 England
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > --
> > >
> > > Benedetti Alessandro
> > > Visiting card - http://about.me/alessandro_benedetti
> > > Blog - http://alexbenedetti.blogspot.co.uk
> > >
> > > "Tyger, tyger burning bright
> > > In the forests of the night,
> > > What immortal hand or eye
> > > Could frame thy fearful symmetry?"
> > >
> > > William Blake - Songs of Experience -1794 England
> > >
> >
>
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card - http://about.me/alessandro_benedetti
> Blog - http://alexbenedetti.blogspot.co.uk
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>


Re: Solr High Availability

2015-12-15 Thread Peter Tan
Hi Jack, What happens when there is only one replica setup?

On Tue, Dec 15, 2015 at 9:32 PM, Jack Krupansky 
wrote:

> Solr Cloud provides HA when you configure at least two replicas for each
> shard and have at least 3 zookeepers. That's it. No deck or detail document
> is needed.
>
>
>
> -- Jack Krupansky
>
> On Tue, Dec 15, 2015 at 9:07 PM, 
> wrote:
>
> > Hi Team,
> >
> > Can you help me in understanding in achieving the Solr High Availability
> .
> >
> > Appreciate you have a detail document or Deck on more details.
> >
> > Thank you
> > Viswanath Bharathi
> > Accenture | Delivery Centres for Technology in India
> > CDC 2, Chennai, India
> > Mobile: +91 9886259010
> > www.accenture.com | www.avanade.com<
> > http://www.avanade.com/>
> >
> >
> > 
> >
> > This message is for the designated recipient only and may contain
> > privileged, proprietary, or otherwise confidential information. If you
> have
> > received it in error, please notify the sender immediately and delete the
> > original. Any other use of the e-mail by you is prohibited. Where allowed
> > by local law, electronic communications with Accenture and its
> affiliates,
> > including e-mail and instant messaging (including content), may be
> scanned
> > by our systems for the purposes of information security and assessment of
> > internal compliance with Accenture policy.
> >
> >
> __
> >
> > www.accenture.com
> >
>


File paths in Zookeeper managed config files

2015-06-10 Thread Peter Scholze

Hi all,

I'm using Zookeeper 3.4.6 in the context of SolrCloud 5. When uploading 
a config file containing the following, I get an "Invalid Path String" 
error.


words="/netapp/dokubase/seeval/dicts/stopwords/stopwords_de.txt" 
ignoreCase="true"/>


leads obviously to

Invalid path string 
\"/configs/newspapers//netapp/dokubase/seeval/dicts/stopwords/stopwords_de.txt\" 
caused by empty node name specified @20"


Is there any way to prevent Zookeeper from doing so?

Thanks in advance,
best regards

Peter



Streaming Expressions (/stream) StreamHandler java.lang.NullPointerException

2016-06-25 Thread Peter Sh
I've got an exception below running
curl --data-urlencode
'expr=search(EventsAndDCF,q="*:*",fl="AccessPath",sort="AccessPath
asc",qt="/export")' "http://localhost:8983/solr/EventsAndDCF/stream";
Solr responce:
{"result-set":{"docs":[
{"EXCEPTION":null,"EOF":true}]}}


My collection EventsAndDCF exists. and I succeed to run GET queries like:
http://localhost:8983/solr/EventsAndDCF/export?fl=AccessPath&q=*:*&sort=AccessPath
desc&wt=json

Solr version: 6.0.1. Single node



2016-06-25 21:15:44.147 ERROR (qtp1514322932-16) [   x:EventsAndDCF]
o.a.s.h.StreamHandler java.lang.NullPointerException
at
org.apache.solr.client.solrj.io.stream.expr.StreamExpressionParser.generateStreamExpression(StreamExpressionParser.java:46)
at
org.apache.solr.client.solrj.io.stream.expr.StreamExpressionParser.parse(StreamExpressionParser.java:37)
at
org.apache.solr.client.solrj.io.stream.expr.StreamFactory.constructStream(StreamFactory.java:178)
at
org.apache.solr.handler.StreamHandler.handleRequestBody(StreamHandler.java:164)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:155)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2053)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:652)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:460)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:229)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:184)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1668)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1160)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1092)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.eclipse.jetty.server.Server.handle(Server.java:518)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:308)
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:244)
at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
at
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:246)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:156)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
at java.lang.Thread.run(Unknown Source)

2016-06-25 21:15:44.147 INFO  (qtp1514322932-16) [   x:EventsAndDCF]
o.a.s.c.S.Request [EventsAndDCF]  webapp=/solr path=/stream
params={'expr=search(EventsAndDCF,q%3D*:*,fl%3DAccessPath,sort%3DAccessPath+asc,qt%3D/export)'}
status=0 QTime=2


DIH problem with multiple (types of) resources

2016-11-14 Thread Peter Blokland
hi,

I'm porting an old data-import configuratie from 4.x to 6.3.0. a minimal config
is this :


  

  

  


  http://site/nl/${page.pid}"; format="text">

  



  


  




when I try to do a full import with this, I get :

2016-11-14 12:31:52.173 INFO  (Thread-68) [   x:meulboek] 
o.a.s.u.p.LogUpdateProcessorFactory [meulboek]  webapp=/solr path=/dataimport 
params={core=meulboek&optimize=false&indent=on&commit=true&clean=true&wt=json&command=full-import&_=1479122291861&verbose=true}
 status=0 QTime=11{deleteByQuery=*:* 
(-1550976769832517632),add=[ed99517c-ece9-40c6-9682-c9ec74173241 
(1550976769976172544), 9283532a-2395-43eb-bcb8-fd30c5ebfd08 
(1550976770348417024), 87b75d5c-a12a-4538-bc29-ceb13d6a9d1c 
(1550976770455371776), 476b5da3-3752-4867-bdb3-4264403c5c2d 
(1550976770787770368), 71cdaadb-62ba-4753-ad1b-01ba7fd75bfa 
(1550976770875850752), 02f41269-4a28-4001-aab9-7b1feb51e332 
(1550976770954493952), 6216ec48-2abd-465b-8d6b-60907c7f49db 
(1550976771047817216), 4317b308-dc88-47e1-9240-0d7d94646de6 
(1550976771136946176), 159ee092-2f72-45f6-970e-9dfd6d635bdf 
(1550976771221880832), bdfa48c4-23e2-483f-9b63-e0c5753d60a5 
(1550976771336175616)]} 0 1465
2016-11-14 12:31:52.173 ERROR (Thread-68) [   x:meulboek] 
o.a.s.h.d.DataImporter Full Import failed:java.lang.RuntimeException: 
java.lang.RuntimeException: 
org.apache.solr.handler.dataimport.DataImportHandlerException: Exception in 
invoking url null Processing Document # 11
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:270)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:416)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:475)
at 
org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:458)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: 
org.apache.solr.handler.dataimport.DataImportHandlerException: Exception in 
invoking url null Processing Document # 11
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:416)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:329)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:232)
... 4 more
Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: 
Exception in invoking url null Processing Document # 11
at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:69)
at 
org.apache.solr.handler.dataimport.BinURLDataSource.getData(BinURLDataSource.java:89)
at 
org.apache.solr.handler.dataimport.BinURLDataSource.getData(BinURLDataSource.java:38)
at 
org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
at 
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:244)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:475)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:414)
... 6 more
Caused by: java.net.MalformedURLException: no protocol: nullselect edition from 
editions
at java.net.URL.(URL.java:593)
at java.net.URL.(URL.java:490)
at java.net.URL.(URL.java:439)
at 
org.apache.solr.handler.dataimport.BinURLDataSource.getData(BinURLDataSource.java:81)
... 12 more


note that this failure occurrs with the second entity, and judging from this
line :

Caused by: java.net.MalformedURLException: no protocol: nullselect edition from 
editions

it seems solr tries to use the datasource named "web" (the BinURLDataSource)
instead of the configured "db" datasource (the JdbcDataSource). am I doing
something wrong, or is this a bug ? 

-- 
CUL8R, Peter.

www.desk.nl

Your excuse is: Communist revolutionaries taking over the server room and 
demanding all the computers in the building or they shoot the sysadmin. Poor 
misguided fools.


Re: DIH problem with multiple (types of) resources

2016-11-15 Thread Peter Blokland
hi,

On Tue, Nov 15, 2016 at 02:54:49AM +1100, Alexandre Rafalovitch wrote:

>> 
>> 
 
> Attribute names are case sensitive as far as I remember. Try
> 'dataSource' for the second definition.

oh wow... that's sneaky. in the old version the case didn't seem to matter,
but now it certainly does. thx :)

-- 
CUL8R, Peter.

www.desk.nl

Your excuse is: It is a layer 8 problem


Re: Multiple rollups/facets in one streaming aggregation?

2017-07-30 Thread Peter Shmukler
I need to improve user experience on facets calculation.
Let’s assume we’ve got a time partitioned collections.
Partition1, Partition2, Partition3 …..
AliasAllPartitions unify all partitions together.
Running facets on AliasAllPartitions is very heavy synchronous operation,
user have to wait a lot of time for first result.

My suggestion is to run Partition after partition and return partial results
on some points.
It can be relevant for any aggregate, faceting and count distinct functions.
Actually I need some estimation of facets so I can use “Count Min Sketch”
and HLL in order to keep memory consumption reasonable.
Interface can be like below:
CMSFacet(
list(
search(partition1,q=*:*,fl="author,name,price",qt="/export",sort="name
asc"),
search(partition2,q=*:*,fl="author,name,price",qt="/export",sort="name
asc"),
search(partition3,q=*:*,fl="author,name,price",qt="/export",sort="name
asc"),
search(partition4,q=*:*,fl="author,name,price",qt="/export",sort="name
asc"),
search(partition5,q=*:*,fl="author,name,price",qt="/export",sort="name asc")
),
bucketSizeLimit=150, sizeLimit=400,sum(price),min(price), CMScount(name)
)

Expected output:
{
  "result-set": {
"docs": [
  {
"min(price)": "215464",
"sum(price)": "23545846",
"CMScount(name)": {“rows”:149,”facet”:[{“A Clash of Kings28”:4},{“A
Clash of Kings16”:4},{“A Clash of Kings27”:4},{“A Clash of Kings15”:4},{“A
Clash of Kings26”:4},{“A Clash of Kings14”:4},{“A Clash of Kings25”:4},{“A
Clash of Kings19”:4},{“A Clash of Kings18”:4},{“A Clash of Kings29”:4},{“A
Game of Thrones18”:6},{“A Clash of Kings20”:4},{“A Clash of Kings13”:4},{“A
Clash of Kings24”:4},{“A Clash of Kings12”:4},{“A Clash of Kings23”:4},{“A
Clash of Kings22”:4},{“A Clash of Kings10”:4},{“A Clash of Kings21”:4},{“A
Clash of Kings5”:4},]}
  },
  {
"min(price)": "655464",
"sum(price)": "3584684646846",
"CMScount(name)": {“rows”:299,”facet”:[{“A Storm of Swords18”:8},{“A
Game of Thrones18”:8},{“A Game of Thrones28”:7},{“A Game of
Thrones27”:7},{“A Game of Thrones24”:5},{“A Game of Thrones3”:11},{“A Game
of Thrones4”:10},{“A Game of Thrones6”:8},{“A Storm of Swords20”:7},{“A Game
of Thrones8”:6},{“A Game of Thrones9”:7},{“A Storm of Swords11”:8},{“A Storm
of Swords22”:8},{“A Storm of Swords21”:10},{“A Storm of Swords13”:8},{“A
Storm of Swords24”:8},{“A Storm of Swords23”:13},{“A Storm of
Swords15”:7},{“A Storm of Swords26”:8},{“A Storm of Swords27”:7},]}
  },
  {
"min(price)": -214.87158,
"sum(price)": -40523.873622472,
"CMScount(name)": {“rows”:399,”facet”:[{“A Storm of
Swords18”:12},{“A Game of Thrones18”:12},{“A Game of Thrones28”:11},{“A Game
of Thrones27”:11},{“A Game of Thrones24”:15},{“A Game of Thrones3”:11},{“A
Game of Thrones4”:10},{“A Game of Thrones6”:12},{“A Storm of
Swords20”:7},{“A Game of Thrones8”:6},{“A Game of Thrones9”:7},{“A Storm of
Swords11”:12},{“A Storm of Swords22”:8},{“A Storm of Swords21”:10},{“A Storm
of Swords13”:12},{“A Storm of Swords24”:12},{“A Storm of Swords23”:13},{“A
Storm of Swords15”:7},{“A Storm of Swords26”:12},{“A Storm of
Swords27”:11},]}
  },
  {
"EOF": true,
"RESPONSE_TIME": 4381
  }
]
  }
}
I wrote some prototype for this functionality on base of Solr 7. 
I implemented class CMSFacetStream extends TupleStream implements
Expressible and class CMSMetric extends Metric.

My current issues:
-   I return results tuples as soon as I achieve bucketSizeLimit, but I 
don’t
see response of partial result. 
-   How can I return Json object from Metric class?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multiple-rollups-facets-in-one-streaming-aggregation-tp4291952p4348260.html
Sent from the Solr - User mailing list archive at Nabble.com.


generate field name in query

2017-08-02 Thread Peter Kirk
Hi - is it possible to create a query (or fq) which generates the field to 
search on, based on whether or not the document has that field?

Eg. Search for documents with prices in the range 100 - 200, using either the 
field "price_owner_float" or "price_customer_float" (if a document has a field 
"price_owner_float" then use that, otherwise use the field 
"price_customer_float").

This gives a syntax error:
fq=if(exists(price_owner_float),price_owner_float,price_customer_float):[100 TO 
200]

Thanks,
Peter




Re: MongoDb vs Solr

2017-08-05 Thread Peter Sturge
*And insults are not something I'd like to see in this mailing list, at all*
+1
Everyone is entitled to their opinion..

Solr can and does work extremely well as a database - it depends on your db
requirements.
For distributed/replicated search via REST API that is read heavy, Solr is
a great choice.

If you need joins or stored procedure-like functionality, don't choose any
of the mentioned ones - stick with SQL.

Security-wise, Solr is pretty much like all db access tools - you will need
a robust front-end to keep your data secure.
It's just that with an easy-to-use API like Solr, it's easier to
accidentally 'let it run free'. If you're using Solr for db rather than
search, you will need a secure front-end.

Joy and good will to all, regardless of what tool you choose!

Peter


On Sat, Aug 5, 2017 at 5:08 PM, Walter Underwood 
wrote:

> I read the seven year old slides just now. The Guardian was using Solr to
> deliver the content. Their repository (see slide 38) is an RDBMS.
>
> https://www.slideshare.net/matwall/no-sql-at-the-guardian
>
> In slide 37, part of “Is Solr a database?”, they note “Search index not
> really persistence”. To me, that means “not a database”.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Aug 5, 2017, at 4:59 AM, Dave  wrote:
> >
> > And to add to the conversation, 7 year old blog posts are not a reason
> to make decisions for your tech stack.
> >
> > And insults are not something I'd like to see in this mailing list, at
> all, so please do not repeat any such disrespect or condescending
> statements in your contributions to the mailing list that's supposed to
> serve as a source of help, which, you asked for.
> >
> >> On Aug 5, 2017, at 7:54 AM, Dave  wrote:
> >>
> >> Also I wouldn't really recommend mongodb at all, it should only to be
> used as a fast front end to an acid compliant relational db same with
> memcahed for example. If you're going to stick to open source, as I do, you
> should use the correct tool for the job.
> >>
> >>> On Aug 5, 2017, at 7:32 AM, GW  wrote:
> >>>
> >>> Insults for Walter only.. sorry..
> >>>
> >>>> On 5 August 2017 at 06:28, GW  wrote:
> >>>>
> >>>> For The Guardian, Solr is the new database | Lucidworks
> >>>> <https://www.google.ca/url?sa=t&rct=j&q=&esrc=s&source=web&;
> cd=2&cad=rja&uact=8&ved=0ahUKEwiR1rn6_b_VAhVB7IMKHWGKBj4QFgguMAE&url=
> https%3A%2F%2Flucidworks.com%2F2010%2F04%2F29%2Ffor-the-
> guardian-solr-is-the-new-database%2F&usg=AFQjCNE6CwwFRMvNhgzvEZu-Sryu_
> vtL8A>
> >>>> https://lucidworks.com/2010/04/29/for-the-guardian-solr-
> >>>> is-the-new-database/
> >>>> Apr 29, 2010 - For The Guardian, *Solr* is the new *database*. I
> blogged
> >>>> a few days ago about how open search source is disrupting the
> relationship
> >>>> between ...
> >>>>
> >>>> You are arrogant and probably lame as a programmer.
> >>>>
> >>>> All offense intended
> >>>>
> >>>>> On 5 August 2017 at 06:23, GW  wrote:
> >>>>>
> >>>>> Watch their videos
> >>>>>
> >>>>> On 4 August 2017 at 23:26, Walter Underwood 
> >>>>> wrote:
> >>>>>
> >>>>>> MarkLogic can do many-to-many. I worked there six years ago. They
> use
> >>>>>> search engine index structure with generational updates, including
> segment
> >>>>>> level caches. With locking. Pretty good stuff.
> >>>>>>
> >>>>>> A many to many relationship is an intersection across posting lists,
> >>>>>> with transactions. Straightforward, but not easy to do it fast.
> >>>>>>
> >>>>>> The “Inside MarkLogic Server” paper does a good job of explaining
> the
> >>>>>> guts.
> >>>>>>
> >>>>>> Now, back to our regularly scheduled Solr presentations.
> >>>>>>
> >>>>>> wunder
> >>>>>> Walter Underwood
> >>>>>> wun...@wunderwood.org
> >>>>>> http://observer.wunderwood.org/  (my blog)
> >>>>>>
> >>>>>>
> >>>>>>> On Aug 4, 2017, at 8:13 PM, David Hastings 
> >>>>>> wrote:
> >>>>>>>
>

RE: SolrCloud - leader updates not updating followers

2017-08-08 Thread Peter Lancaster
Hi Erik,

Thanks for your quick reply. It's given me a few things to research and work on 
tomorrow.

In the meantime, in case it triggers any other thoughts,  just to say that our 
AutoCommit settings are

180
30


1


When I ingest data I don't see data appearing on the follower in the log.

It really seems like the data isn't being sent from the leader. As I said it 
could easily be something stupid that I've done along the way but I can't see 
what it is.

Thanks again,
Peter.

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: 08 August 2017 18:23
To: solr-user 
Subject: Re: SolrCloud - leader updates not updating followers

This _better_ be a problem with your configuration or all my assumptions are 
false ;)

What are you autocommit settings? The documents should be forwarded to each 
replica from the leader during ingestion. However, they are not visible on the 
follower until a hard commit(openSearcher=true) or soft commit is triggered. 
Long blog on all this here:

https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

An easy way to check that docs are being sent to the follower is to tail 
solr.log and send a doc to the leader. You should see the doc arrive at the 
follower. If you do see that then it's your autocommit settings.

NOTE: Solr promises eventual consistency. Due to the fact that autocommits will 
execute at different wall-clock times on the leaders and followers you can be 
out of sync by up to your autocommit interval.

You can force a commit by a URL like
"http://solr:port/solr/collection/update?commit=true"; that will commit on all 
replicas in a collection that may also be useful for seeing if it's an 
autocommit issue.

Best,
Erick

On Tue, Aug 8, 2017 at 9:49 AM, lancasp22  
wrote:
> Hi,
>
> I've recently created a solr cloud on solr 5.5.2 with a separate
> zookeeper cluster. I write to the cloud by posting to update/json and
> the documents appear fine in the leader.
>
> The problem I have is that new documents added to the cloud aren't
> then being automatically applied to the followers of each leader. If I
> query a core then I can get different counts depending on which
> version of the core the query ran over and the solr admin statistics
> page confirms that the followers have fewer documents and are behind the 
> leader.
>
> If I restart the solr core for a follower, it does recover quickly and
> brings itself up-to-date with the leader. Looking at the logs for the
> follower you see that on re-start it identifies the leader and gets
> the changes from that leader.
>
> I think when documents are added to the leader these should be pushed
> to the followers so maybe it's this push process that isn't being triggered.
>
> It’s quite possible that I’ve made a simple error in the set-up but
> I’m baffled as to what it is. Please can anyone advise on any
> configuration that I need to check that might be causing these symptoms.
>
> Thanks in anticipation,
> Peter Lancaster.
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/SolrCloud-leader-updates-not-updati
> ng-followers-tp4349618.html Sent from the Solr - User mailing list
> archive at Nabble.com.


This message is confidential and may contain privileged information. You should 
not disclose its contents to any other person. If you are not the intended 
recipient, please notify the sender named above immediately. It is expressly 
declared that this e-mail does not constitute nor form part of a contract or 
unilateral obligation. Opinions, conclusions and other information in this 
message that do not relate to the official business of findmypast shall be 
understood as neither given nor endorsed by it.


__

This email has been checked for virus and other malicious content prior to 
leaving our network.
__


RE: Solr 6 and IDF

2017-08-08 Thread Peter Lancaster
Hi Webster,

If you're not worried about using BM25 searcher then you should just be able to 
continue as you were before by providing your own similarity class that extends 
ClassicSimilarity and then override the idf method to always return 1,  then 
reference that in your schema
e.g.


As far as I know you've been able to have different similarities per field in 
solr for a while now. https://wiki.apache.org/solr/SchemaXml#Similarity

Cheers,
Peter Lancaster.


-Original Message-
From: Webster Homer [mailto:webster.ho...@sial.com]
Sent: 08 August 2017 20:39
To: solr-user@lucene.apache.org
Subject: Solr 6 and IDF

Our most common use for solr is searching for products, not text search. My 
company is in the process of migrating away from an Endeca search engine,  the 
goal to keep the business happy is to make sure that search results from the 
different engines be fairly similar, one area that we have found that 
suppresses a result from being as good as it was in the old system is the idf.

We are using Solr 6. After moving to it, a lot of our results got better, but 
idf still seems to deaden some results. Given that our focus is product 
searching I really don't see a need for idf at all. Previous to Solr 6 you 
could suppress idf by providing a custom similarity class. Looking over the 
newer documentation a lot of things have improved, but I'm not sure I see a 
simple way to turn off idf in Solr 6's BM25 searcher.

How do I disable IDF in Solr 6?

We also do have needs for text searching so it would be nice if we could 
suppress IDF on a field or schema level

--


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, Spanish 
and Portuguese versions of this disclaimer.


This message is confidential and may contain privileged information. You should 
not disclose its contents to any other person. If you are not the intended 
recipient, please notify the sender named above immediately. It is expressly 
declared that this e-mail does not constitute nor form part of a contract or 
unilateral obligation. Opinions, conclusions and other information in this 
message that do not relate to the official business of findmypast shall be 
understood as neither given nor endorsed by it.


__

This email has been checked for virus and other malicious content prior to 
leaving our network.
__


Building Solr index from AEM using and ELB

2017-08-09 Thread Wahlgren Peter
I am looking for lessons learned or problems seen when building a Solr index 
from AEM using a Solr cluster with content passing through an ELB.

Our configuration is AEM 6.1 indexing to a cluster of Solr servers running 
version 4.7.1. When building an index with a smaller data set - 4 million 
items, AEM sends the content in about 3 minutes and the index is built without 
issue. When building an index for 14 million items, AEM sends the content in 
about 9 minutes. The Solr server error log records errors of EofException. When 
a single Solr server is used and the ELB bypassed, the index is built in about 
1.75 hours with no errors.

Thanks for your comments and suggestions.

Pete


if exists in an fq

2017-09-13 Thread Peter Kirk
Hi

I want to formulate an fq which filters on fields depending on what fields 
exist in each document.

For example:
if the document has a "field1" then use "field1:[1 TO 100]";
but if there is no "field1", then check if there is a "field2";
if there is a "field2" then use "field2:[1 TO 100];
but if there is no "field2", then use "field3:[1 TO 100].

Something like:
?q=*:*&fq=if(exists(field1),field1:[1 TO 100],if(exists(field2),field2:[1 TO 
100], field3:[1 TO 100]))


But is this does not work.
Is it even possible?

Thanks,
Peter


RE: Phrase suggester - field limit and order

2017-11-09 Thread Peter Lancaster
Hi,

The weight field in combination with the BlenderType will determine the order, 
so yes you can control the order.

I don't think you can return only the matched phrase, but I would guess that 
highlighting would enable you to pick off the phrase that was matched in your 
client.

Cheers,
Peter.


-Original Message-
From: ruby [mailto:rshoss...@gmail.com]
Sent: 09 November 2017 19:29
To: solr-user@lucene.apache.org
Subject: Phrase suggester - field limit and order

I'm using the BlendedInfixLookupFactory to get phrase suggestions. It returns 
the entire field content. I've tried the others and they do the same.

  AnalyzingInfixSuggester
  BlendedInfixLookupFactory
  DocumentDictionaryFactory
  title
  price
  text_en


Is there a way to only return a fraction of the phrase containing the matched 
phrase? Also is there a way to control in which order the suggestions are 
returned?

Thanks



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


This message is confidential and may contain privileged information. You should 
not disclose its contents to any other person. If you are not the intended 
recipient, please notify the sender named above immediately. It is expressly 
declared that this e-mail does not constitute nor form part of a contract or 
unilateral obligation. Opinions, conclusions and other information in this 
message that do not relate to the official business of findmypast shall be 
understood as neither given nor endorsed by it.


__

This email has been checked for virus and other malicious content prior to 
leaving our network.
__


RE: Search suggester - threshold parameter

2017-11-17 Thread Peter Lancaster
Hi Ruby,

The documentation says that threshold is available for the 
HighFrequencyDictionaryFactory implementation. Since you're using 
DocumentDictionaryFactory I guess it will be ignored.

Cheers,
Peter.

-Original Message-
From: ruby [mailto:rshoss...@gmail.com]
Sent: 17 November 2017 15:41
To: solr-user@lucene.apache.org
Subject: Search suggester - threshold parameter

Does any of the phrase suggesters in  Solr 6.1 honor the threshold parameter?

I made following changes to enable phrase suggestion in my environment.
Played with different threshold values but looks like the parameter is not 
being used.


  
mySuggester
FuzzyLookupFactory
suggester_fuzzy_dir



DocumentDictionaryFactory
title
suggestType
false
false
*0.005*
  



  
true
10
mySuggester
  
  
suggest
  




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


This message is confidential and may contain privileged information. You should 
not disclose its contents to any other person. If you are not the intended 
recipient, please notify the sender named above immediately. It is expressly 
declared that this e-mail does not constitute nor form part of a contract or 
unilateral obligation. Opinions, conclusions and other information in this 
message that do not relate to the official business of findmypast shall be 
understood as neither given nor endorsed by it.


__

This email has been checked for virus and other malicious content prior to 
leaving our network.
__


Re: Java profiler?

2017-12-06 Thread Peter Sturge
Hi,
We'be been using JPRofiler (www.ej-technologies.com) for years now.
Without a doubt, the most comprehensive and useful profiler for java.
Works very well, supports remote profiling and includes some very neat heap
walking/gc profiling.
Peter


On Tue, Dec 5, 2017 at 3:21 PM, Walter Underwood 
wrote:

> Anybody have a favorite profiler to use with Solr? I’ve been asked to look
> at why out queries are slow on a detail level.
>
> Personally, I think they are slow because they are so long, up to 40 terms.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
>


Query/Field Index Analysis corrected but return no docs in search

2017-02-04 Thread Peter Liu
hi all:
   I was using solr 3.6 and tried to solve a recall-problem today , but
encountered a weird problem.

   There's doc with field value : 均匀肤色, (just treated that word as a symbol
if you don't know it, I just want to describe the problem as exact as
possible).


   And below was the analysis result ( tokenization) :

  [image: Inline image 2]

  ( and text-version if need.

Index Analyzer
均匀肤色 均匀 匀肤 肤色
均匀肤色 均匀 匀肤 肤色
均匀肤色 均匀 匀肤 肤色
Query Analyzer
均匀肤色
均匀肤色
均匀肤色
均匀肤色


​ The tokenization result indicate the query will recall/hit the doc
​undoubtedly. But the doc did not appear in the result if I search with
"均匀肤色". I tried to simplify the qf/bf/fq/q, just test it with single field
and single document, to make sure it was not caused by other problems but
failed.


​It's knotty to debug because it only reproduced in
​
​product environments, I tried same config/index/query but not produce in
dev ​environment. I'm here ask for helps if you met similar problem, or any
clues/debug-method will be really helped.😶


Re: Configurable collectors for custom ranking

2014-03-07 Thread Peter Keegan
Hi Joel,

Although I solved this issue with a custom CollectorFactory, I also have a
solution that uses a PostFilter and and optional ValueSource.
Could you take a look at SOLR-5831 and see if I've got this right?

Thanks,
Peter



On Mon, Dec 23, 2013 at 6:37 PM, Joel Bernstein  wrote:

> Peter,
>
> You actually only need the current score being collected to be in the
> request context. So you don't need a map, you just need an object wrapper
> around a mutable float.
>
> If you have a page size of X, only the top X scores need to be held onto,
> because all the other scores wouldn't have made it into that page anyway so
> they might as well be 0. Because the QueryResultCache caches's a larger
> window then the page size you should keep enough scores so the cached
> docList is correct. But if you're only dealing with 150K of results you
> could just keep all the scores in a FloatArrayList and not worry about the
> keeping the top X scores in a priority queue.
>
> During the collect hang onto the docIds and scores and build your scaling
> info.
>
> During the finish iterate your docIds and scale the scores as you go.
>
> Set your scaled score into the object wrapper that is in the request
> context before you collect each document.
>
> When you call collect on the delegate collectors they will call the custom
> value source for each document to perform the sort. Your custom value
> source will return whatever the float value is in the request context at
> that time.
>
> If you're also going to run this postfilter when you're doing a standard
> rank by score you'll also need to send down a dummy scorer to the delegate
> collectors. Spend some time with the CollapsingQParserPlugin in trunk to
> see how the dummy scorer works.
>
> I'll be adding value source collapse criteria to the
> CollapsingQParserPlugin this week and it will have a similar interaction
> between a PostFilter and value source. So you may want to watch SOLR-5536
> to see an example of this.
>
> Joel
>
>
>
>
>
>
>
>
>
>
>
>
> Joel Bernstein
> Search Engineer at Heliosearch
>
>
> On Mon, Dec 23, 2013 at 4:03 PM, Peter Keegan  >wrote:
>
> > Hi Joel,
> >
> > Could you clarify what would be in the key,value Map added to the
> > SearchRequest context? It seems that all the docId/score tuples need to
> be
> > there, including the ones not in the 'top N ScoreDocs' PriorityQueue
> > (score=0). If so would the Map be something like:
> > "scaled_scores",Map ?
> >
> > Also, what is the reason for passing score=0 for documents that aren't in
> > the PriorityQueue? Will these docs get filtered out before a normal sort
> by
> > score?
> >
> > Thanks,
> > Peter
> >
> >
> > On Thu, Dec 12, 2013 at 11:13 AM, Joel Bernstein 
> > wrote:
> >
> > > The sorting is going to happen in the lower level collectors. You need
> a
> > > value source that returns the score of the document being collected.
> > >
> > > Here is how you can make this happen:
> > >
> > > 1) Create an object in your PostFilter that simply holds the current
> > score.
> > > Place this object in the SearchRequest context map. Update object.score
> > as
> > > you pass the docs and scores to the lower collectors.
> > >
> > > 2) Create a values source that checks the SearchRequest context for the
> > > object that's holding the current score. Use this object to return the
> > > current score when called. For example if you give the value source a
> > > handle called "score" a compound function call will look like this:
> > > sum(score(), field(x))
> > >
> > > Joel
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Thu, Dec 12, 2013 at 9:58 AM, Peter Keegan  > > >wrote:
> > >
> > > > Regarding my original goal, which is to perform a math function using
> > the
> > > > scaled score and a field value, and sort on the result, how does this
> > fit
> > > > in? Must I implement another custom PostFilter with a higher cost
> than
> > > the
> > > > scale PostFilter?
> > > >
> > > > Thanks,
> > > > Peter
> > > >
> > > >
> > > > On Wed, Dec 11, 2013 at 4:01 PM, Peter Keegan <
> peterlkee...@gmail.com
> > > > >wrote:
> > > >
> > > > > Thanks very much for the guid

solr-user@lucene.apache.org

2014-04-08 Thread Peter Kirk
Hi

How to search for Solr special characters like '(' and '&'?

I am trying to execute searches for "products" in my Solr (3.6.1) index, based 
on the "categories" to which these products belong.
The categories are stored in a multistring field for the products, and are 
hierarchical, and are fed to the index like:
A
A|B
A|B|C

So this product would actually belong to category named "C", which is a child 
of "B", which is a child of !"A".

I am able to execute queries for simple category names like this (eg. 
fq=categories_string:A|B|C).

But some categories have Solr special characters in their names, like: "D (E & 
F)"
(Real example: "Power supplies (Battery and Solar)").

A query like fq=categories_string:A|B|D (E & F) simply fails.
But even if I try 
fq=categories_string:A|B|D%20\(E%20%26amp%3B%20F\)
(where I try to escape the special characters) does not find the products in 
this category, and actually finds other unrelated categories.

What am I doing wrong?

Thanks,
Peter



solr-user@lucene.apache.org

2014-04-08 Thread Peter Kirk
Thanks for the comments, and for the idea for the term query parser.
This seems to work well (except I still can't get '&' in a category name to 
work - I can get the (one and only) customer to change the category names).

I'll look into fixing the indexing side of things - could be an idea to strip 
out the "special characters".
I'm working on the search side of things.

/Peter


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 8. april 2014 16:15
To: solr-user@lucene.apache.org; Ahmet Arslan
Subject: Re: Solr special characters like '(' and '&'?

I'd seriously consider filtering these characters out when you index and 
search, this is quite likely very brittle. The same item, say from two 
different vendors, might have D (E & F) or D E & F. If you just stripped all of 
the non alpha-num characters you'd likely get less brittle results.

You know your problem domain better than I do though, so whatever makes most 
sense.

Best,
Erick

On Tue, Apr 8, 2014 at 6:55 AM, Ahmet Arslan  wrote:
> Hi Peter,
>
> TermQueryParser is useful in your case.
> q={!term f=categories_string}A|B|D (E & F)
>
>
>
> On Tuesday, April 8, 2014 4:37 PM, Peter Kirk  wrote:
> Hi
>
> How to search for Solr special characters like '(' and '&'?
>
> I am trying to execute searches for "products" in my Solr (3.6.1) index, 
> based on the "categories" to which these products belong.
> The categories are stored in a multistring field for the products, and are 
> hierarchical, and are fed to the index like:
> A
> A|B
> A|B|C
>
> So this product would actually belong to category named "C", which is a child 
> of "B", which is a child of !"A".
>
> I am able to execute queries for simple category names like this (eg. 
> fq=categories_string:A|B|C).
>
> But some categories have Solr special characters in their names, like: "D (E 
> & F)"
> (Real example: "Power supplies (Battery and Solar)").
>
> A query like fq=categories_string:A|B|D (E & F) simply fails.
> But even if I try
> fq=categories_string:A|B|D%20\(E%20%26amp%3B%20F\)
> (where I try to escape the special characters) does not find the products in 
> this category, and actually finds other unrelated categories.
>
> What am I doing wrong?
>
> Thanks,
> Peter
>


Distributed commits in CloudSolrServer

2014-04-15 Thread Peter Keegan
I have a SolrCloud index, 1 shard, with a leader and one replica, and 3
ZKs. The Solr indexes are behind a load balancer. There is one
CloudSolrServer client updating the indexes. The index schema includes 3
ExternalFileFields. When the CloudSolrServer client issues a hard commit, I
observe that the commits occur sequentially, not in parallel, on the leader
and replica. The duration of each commit is about a minute. Most of this
time is spent reloading the 3 ExternalFileField files. Because of the
sequential commits, there is a period of time (1 minute+) when the index
searchers will return different results, which can cause a bad user
experience. This will get worse as replicas are added to handle
auto-scaling. The goal is to keep all replicas in sync w.r.t. the user
queries.

My questions:

1. Is there a reason that the distributed commits are done in sequence, not
in parallel? Is there a way to change this behavior?

2. If instead, the commits were done in parallel by a separate client via a
GET to each Solr instance, how would this client get the host/port values
for each Solr instance from zookeeper? Are there any downsides to doing
commits this way?

Thanks,
Peter


Re: Distributed commits in CloudSolrServer

2014-04-16 Thread Peter Keegan
Are distributed commits also done in parallel across shards?

Peter


On Tue, Apr 15, 2014 at 3:50 PM, Mark Miller  wrote:

> Inline responses below.
> --
> Mark Miller
> about.me/markrmiller
>
> On April 15, 2014 at 2:12:31 PM, Peter Keegan (peterlkee...@gmail.com)
> wrote:
>
> I have a SolrCloud index, 1 shard, with a leader and one replica, and 3
> ZKs. The Solr indexes are behind a load balancer. There is one
> CloudSolrServer client updating the indexes. The index schema includes 3
> ExternalFileFields. When the CloudSolrServer client issues a hard commit,
> I
> observe that the commits occur sequentially, not in parallel, on the
> leader
> and replica. The duration of each commit is about a minute. Most of this
> time is spent reloading the 3 ExternalFileField files. Because of the
> sequential commits, there is a period of time (1 minute+) when the index
> searchers will return different results, which can cause a bad user
> experience. This will get worse as replicas are added to handle
> auto-scaling. The goal is to keep all replicas in sync w.r.t. the user
> queries.
>
> My questions:
>
> 1. Is there a reason that the distributed commits are done in sequence,
> not
> in parallel? Is there a way to change this behavior?
>
>
> The reason is that updates are currently done this way - it’s the only
> safe way to do it without solving some more problems. I don’t think you can
> easily change this. I think we should probably file a JIRA issue to track a
> better solution for commit handling. I think there are some complications
> because of how commits can be added on update requests, but its something
> we probably want to try and solve before tackling *all* updates to replicas
> in parallel with the leader.
>
>
>
> 2. If instead, the commits were done in parallel by a separate client via
> a
> GET to each Solr instance, how would this client get the host/port values
> for each Solr instance from zookeeper? Are there any downsides to doing
> commits this way?
>
> Not really, other than the extra management.
>
>
>
>
>
> Thanks,
> Peter
>


Re: Distributed commits in CloudSolrServer

2014-04-16 Thread Peter Keegan
>Are distributed commits also done in parallel across shards?
I meant 'sequentially' across shards.


On Wed, Apr 16, 2014 at 9:08 AM, Peter Keegan wrote:

> Are distributed commits also done in parallel across shards?
>
> Peter
>
>
> On Tue, Apr 15, 2014 at 3:50 PM, Mark Miller wrote:
>
>> Inline responses below.
>> --
>> Mark Miller
>> about.me/markrmiller
>>
>> On April 15, 2014 at 2:12:31 PM, Peter Keegan (peterlkee...@gmail.com)
>> wrote:
>>
>> I have a SolrCloud index, 1 shard, with a leader and one replica, and 3
>> ZKs. The Solr indexes are behind a load balancer. There is one
>> CloudSolrServer client updating the indexes. The index schema includes 3
>> ExternalFileFields. When the CloudSolrServer client issues a hard commit,
>> I
>> observe that the commits occur sequentially, not in parallel, on the
>> leader
>> and replica. The duration of each commit is about a minute. Most of this
>> time is spent reloading the 3 ExternalFileField files. Because of the
>> sequential commits, there is a period of time (1 minute+) when the index
>> searchers will return different results, which can cause a bad user
>> experience. This will get worse as replicas are added to handle
>> auto-scaling. The goal is to keep all replicas in sync w.r.t. the user
>> queries.
>>
>> My questions:
>>
>> 1. Is there a reason that the distributed commits are done in sequence,
>> not
>> in parallel? Is there a way to change this behavior?
>>
>>
>> The reason is that updates are currently done this way - it’s the only
>> safe way to do it without solving some more problems. I don’t think you can
>> easily change this. I think we should probably file a JIRA issue to track a
>> better solution for commit handling. I think there are some complications
>> because of how commits can be added on update requests, but its something
>> we probably want to try and solve before tackling *all* updates to replicas
>> in parallel with the leader.
>>
>>
>>
>> 2. If instead, the commits were done in parallel by a separate client via
>> a
>> GET to each Solr instance, how would this client get the host/port values
>> for each Solr instance from zookeeper? Are there any downsides to doing
>> commits this way?
>>
>> Not really, other than the extra management.
>>
>>
>>
>>
>>
>> Thanks,
>> Peter
>>
>
>


DataImportHandler atomic updates

2014-05-16 Thread Peter Pišljar
Hello,

i am trying to import data from my db into solr.
in db i have two tables
- orders [order_id, user_id, created_at, store]
- order_items [order_id, item_id] (one - to - many relation )

i would like to import this into my solr collection
- orders [user_id (unique), item_id (multivalue) ]

i select all the orders from my table and process them
i select all the items for each order and process them,.

now the problem is, that without atomic updates, each order for the user id
'10' will overwrite its previous orders. (instead just add the item_ids)

so i tried to use scripttransformer to set the "add" parameter for atomic
update of item_id field.

its still not working correctly.
1) i dont get all the items, but there are more than just from the last
order ... lets say last 5 orders
2) first few items are saved as add:ID, the others are ok.
(lets say that i have 20 itrems (1...20) for my user_id '10', something
like this would get in the index: [ add:14, add:15, 16, 17,18,19,20 ] (note
that items from 1...13 are missing, and item 14 and 15 have "add:" infront
of them.

here is my full script: http://pastebin.com/6EnW8Heg

please note that above i simplified the description a little bit, but still
everything applies.

i would really appreciate any kind of help.
---


Autoscaling Solr instances in AWS

2014-05-20 Thread Peter Keegan
We are running Solr 4.6.1 in AWS:
- 2 Solr instances (1 shard, 1 leader, 1 replica)
- 1 CloudSolrServer SolrJ client updating the index.
- 3 Zookeepers

The Solr instances are behind a load balanceer and also in an auto scaling
group. The ScaleUpPolicy will add up to 9 additional instances (replicas),
1 per minute. Later, the 9 replicas are terminated with the ScaleDownPolicy.

Problem: during the ScaleUpPolicy, when the Solr Leader is under heavy
query load, the SolrJ indexing client issues a commit which hangs and never
returns. Note that the index schema contains 3 ExternalFileFields wich slow
down the commit process. Here's the stack trace:

Thread 1959: (state = IN_NATIVE)
 - java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[],
int, int, int) @bci=0 (Compiled frame; information may be imprecise)
 - java.net.SocketInputStream.read(byte[], int, int, int) @bci=79, line=150
(Compiled frame)
 - java.net.SocketInputStream.read(byte[], int, int) @bci=11, line=121
(Compiled frame)
 - org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer() @bci=71,
line=166 (Compiled frame)
 - org.apache.http.impl.io.SocketInputBuffer.fillBuffer() @bci=1, line=90
(Compiled frame)
 -
org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(org.apache.http.util.CharArrayBuffer)
@bci=137, line=281 (Compiled frame)
 -
org.apache.http.impl.conn.LoggingSessionInputBuffer.readLine(org.apache.http.util.CharArrayBuffer)
@bci=5, line=115 (Compiled frame)
 -
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer)
@bci=16, line=92 (Compiled frame)
 -
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer)
@bci=2, line=62 (Compiled frame)
 - org.apache.http.impl.io.AbstractMessageParser.parse() @bci=38, line=254
(Compiled frame)
 -
org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader()
@bci=8, line=289 (Compiled frame)
 -
org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader()
@bci=1, line=252 (Compiled frame)
 -
org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader()
@bci=6, line=191 (Compiled frame)
 -
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(org.apache.http.HttpRequest,
org.apache.http.HttpClientConnection, org.apache.http.protocol.HttpContext)
@bci=62, line=300 (Compiled frame)
 -
org.apache.http.protocol.HttpRequestExecutor.execute(org.apache.http.HttpRequest,
org.apache.http.HttpClientConnection, org.apache.http.protocol.HttpContext)
@bci=60, line=127 (Compiled frame)
 -
org.apache.http.impl.client.DefaultRequestDirector.tryExecute(org.apache.http.impl.client.RoutedRequest,
org.apache.http.protocol.HttpContext) @bci=198, line=717 (Compiled frame)
 -
org.apache.http.impl.client.DefaultRequestDirector.execute(org.apache.http.HttpHost,
org.apache.http.HttpRequest, org.apache.http.protocol.HttpContext)
@bci=597, line=522 (Compiled frame)
 -
org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.HttpHost,
org.apache.http.HttpRequest, org.apache.http.protocol.HttpContext)
@bci=344, line=906 (Compiled frame)
 -
org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.client.methods.HttpUriRequest,
org.apache.http.protocol.HttpContext) @bci=21, line=805 (Compiled frame)
 -
org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.client.methods.HttpUriRequest)
@bci=6, line=784 (Compiled frame)
 -
org.apache.solr.client.solrj.impl.HttpSolrServer.request(org.apache.solr.client.solrj.SolrRequest,
org.apache.solr.client.solrj.ResponseParser) @bci=1175, line=395 (Compiled
frame)
 -
org.apache.solr.client.solrj.impl.HttpSolrServer.request(org.apache.solr.client.solrj.SolrRequest)
@bci=17, line=199 (Compiled frame)
 -
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(org.apache.solr.client.solrj.impl.LBHttpSolrServer$Req)
@bci=132, line=285 (Compiled frame)
 -
org.apache.solr.client.solrj.impl.CloudSolrServer.request(org.apache.solr.client.solrj.SolrRequest)
@bci=838, line=640 (Compiled frame)
 -
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(org.apache.solr.client.solrj.SolrServer)
@bci=17, line=117 (Compiled frame)
 - org.apache.solr.client.solrj.SolrServer.commit(boolean, boolean)
@bci=16, line=168 (Interpreted frame)
 - org.apache.solr.client.solrj.SolrServer.commit() @bci=3, line=146
(Interpreted frame)

 The Solr leader log shows many connection timeout exceptions from the
other Solr replicas during this period. Some of these timeouts may have
been caused by replicas disappearing from the ScaleDownPolicy. From the
search client application's point of view, everything looked fine, but
indexing stopped until I restarted the SolrJ client.

 Does this look like a case where a timeout value needs to be increased
somewhere? If so, which one?

 Thanks,
 Peter


Custom QueryComponent to rewrite dismax query

2014-06-10 Thread Peter Keegan
We are using the 'edismax' query parser for its many benefits over the
standard Lucene parser. For queries with more than 5 or 6 keywords (which
is a lot for our typical user), the recall can be very high (sometimes
matching 75% or more of the documents). This high recall, when coupled with
some custom PostFilter scoring, is hurting the query performance.  I tried
varying the 'mm' (minimum match) parameter, but at values less than 100%,
the response time didn't improve much, and at 100%, there were often no
results, which is unacceptable.

So, I wrote a custom QueryComponent which rewrites the DisMax query.
Initially, the MinShouldMatch value is set to 100%. If the search returns 0
results, MinShouldMatch is set to 1 and the search is retried. This
improved the QPS throughput by about 2.5X. However, this only worked with
an unsharded index. With a sharded index, each shard returned only the
results from the first search (mm=100%). In the debugger, I could see 2
'response/ResultContext' NV-Pairs in the SolrQueryResponse object, so I
added code to remove the first pair if there were 2 pair present, which
fixed this problem. My question: is removing the extra ResultContext a
reasonable solution to this problem? It just seems a little brittle to me.

Thanks,
Peter


Question about solrcloud recovery process

2014-07-03 Thread Peter Keegan
I bring up a new Solr node with no index and watch the index being
replicated from the leader. The index size is 12G and the replication takes
about 6 minutes, according to the replica log (from 'Starting recovery
process' to 'Finished recovery process). However, shortly after the
replication begins, while the index files are being copied, I am able to
query the index on the replica and see q=*:* find all of the documents.
But, from the core admin screen, numDocs = 0, and in the cloud screen the
replica is in 'recovering' mode. How can this be?

Peter


Re: Question about solrcloud recovery process

2014-07-03 Thread Peter Keegan
No, we're not doing NRT. The search clients aren't using CloudSolrServer
and they are behind an AWS load balancer, which calls the Solr ping handler
(implemented with ClusterStateAwarePingRequestHandler) to determine when
the node is active. This ping handler also responds during the index copy,
which doesn't seem right. I'll have to figure out why it does this before
the replica is really active.

Peter


On Thu, Jul 3, 2014 at 9:36 AM, Mark Miller  wrote:

> I don’t know offhand about the num docs issue - are you doing NRT?
>
> As far as being able to query the replica, I’m not sure anyone ever got to
> making that fail if you directly query a node that is not active. It
> certainly came up, but I have no memory of anyone tackling it. Of course in
> many other cases, information is being pulled from zookeeper and recovering
> nodes are ignored. If this is the issue I think it is, it should only be an
> issue when you directly query recovery node.
>
> The CloudSolrServer client works around this issue as well.
>
> --
> Mark Miller
> about.me/markrmiller
>
> On July 3, 2014 at 8:42:48 AM, Peter Keegan (peterlkee...@gmail.com)
> wrote:
>
> I bring up a new Solr node with no index and watch the index being
> replicated from the leader. The index size is 12G and the replication takes
> about 6 minutes, according to the replica log (from 'Starting recovery
> process' to 'Finished recovery process). However, shortly after the
> replication begins, while the index files are being copied, I am able to
> query the index on the replica and see q=*:* find all of the documents.
> But, from the core admin screen, numDocs = 0, and in the cloud screen the
> replica is in 'recovering' mode. How can this be?
>
> Peter
>


Re: Question about solrcloud recovery process

2014-07-03 Thread Peter Keegan
Aha, you are right wrdrvf! The query is forwarded to any of the active
shards (I saw the query alternate between both of mine). Nice feature.
Also, looking at 'ClusterStateAwarePingRequestHandler' (which I downloaded
from www.manning.com/SolrinAction), it is checking zookeeper to see if the
logical shard is active, not the specific 'this' replica, which is in
'recovering' state. I'll post a patch once I figure out the zookeeper api.

Thanks,
Peter


On Thu, Jul 3, 2014 at 12:03 PM, wrdrvr  wrote:

> Try querying the recovering core with distrib=false, you should get the
> count
> of docs in it.
>
> Most likely, since the replica is recovering it is forwarding all queries
> to
> the active replica, this can be verified in the core logs.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Question-about-solrcloud-recovery-process-tp4145450p4145491.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Question about ReRankQuery

2014-07-23 Thread Peter Keegan
I'm looking at how 'ReRankQuery' works. If the main query has a Sort
criteria, it is only used to sort the first pass results. The QueryScorer
used in the second pass only reorders the ScoreDocs based on score and
docid, but doesn't use the original Sort fields. If the Sort criteria is
'score desc, myfield asc', I would expect 'myfield' to break score ties
from the second pass after rescoring.

Is this a bug or the intended behavior?

Thanks,
Peter


Re: Question about ReRankQuery

2014-07-23 Thread Peter Keegan
See http://heliosearch.org/solrs-new-re-ranking-feature/


On Wed, Jul 23, 2014 at 11:27 AM, Erick Erickson 
wrote:

> I'm having a little trouble understanding the use-case here. Why use
> re-ranking?
> Isn't this just combining the original query with the second query with an
> AND
> and using the original sort?
>
> At the end, you have your original list in it's original order, with
> (potentially) some
> documents removed that don't satisfy the secondary query.
>
> Or I'm missing the boat entirely.
>
> Best,
> Erick
>
>
> On Wed, Jul 23, 2014 at 6:31 AM, Peter Keegan 
> wrote:
>
> > I'm looking at how 'ReRankQuery' works. If the main query has a Sort
> > criteria, it is only used to sort the first pass results. The QueryScorer
> > used in the second pass only reorders the ScoreDocs based on score and
> > docid, but doesn't use the original Sort fields. If the Sort criteria is
> > 'score desc, myfield asc', I would expect 'myfield' to break score ties
> > from the second pass after rescoring.
> >
> > Is this a bug or the intended behavior?
> >
> > Thanks,
> > Peter
> >
>


Re: Question about ReRankQuery

2014-07-23 Thread Peter Keegan
> The ReRankingQParserPlugin uses the Lucene QueryRescorer, which only uses
the score from the re-rank query when re-ranking the top N documents.

Understood, but if the re-rank scores produce new ties, wouldn't you want
to resort them with the FieldSortedHitQueue?

Anyway, I was looking to reimplement the ScaleScoreQParser PostFilter
plugin with RankQuery, and would need to implement the behavior of the
DelegateCollector there for handling multiple sort fields.

Peter

On Wednesday, July 23, 2014, Joel Bernstein  wrote:

> The ReRankingQParserPlugin uses the Lucene QueryRescorer, which only uses
> the score from the re-rank query when re-ranking the top N documents.
>
> The ReRanklingQParserPlugin is built as a RankQuery plugin so you can swap
> in your own implementation. Patches are also welcome for the existing
> implementation.
>
> Joel Bernstein
> Search Engineer at Heliosearch
>
>
> On Wed, Jul 23, 2014 at 11:37 AM, Peter Keegan  >
> wrote:
>
> > See http://heliosearch.org/solrs-new-re-ranking-feature/
> >
> >
> > On Wed, Jul 23, 2014 at 11:27 AM, Erick Erickson <
> erickerick...@gmail.com >
> > wrote:
> >
> > > I'm having a little trouble understanding the use-case here. Why use
> > > re-ranking?
> > > Isn't this just combining the original query with the second query with
> > an
> > > AND
> > > and using the original sort?
> > >
> > > At the end, you have your original list in it's original order, with
> > > (potentially) some
> > > documents removed that don't satisfy the secondary query.
> > >
> > > Or I'm missing the boat entirely.
> > >
> > > Best,
> > > Erick
> > >
> > >
> > > On Wed, Jul 23, 2014 at 6:31 AM, Peter Keegan  >
> > > wrote:
> > >
> > > > I'm looking at how 'ReRankQuery' works. If the main query has a Sort
> > > > criteria, it is only used to sort the first pass results. The
> > QueryScorer
> > > > used in the second pass only reorders the ScoreDocs based on score
> and
> > > > docid, but doesn't use the original Sort fields. If the Sort criteria
> > is
> > > > 'score desc, myfield asc', I would expect 'myfield' to break score
> ties
> > > > from the second pass after rescoring.
> > > >
> > > > Is this a bug or the intended behavior?
> > > >
> > > > Thanks,
> > > > Peter
> > > >
> > >
> >
>


ExternalFileFieldReloader and commit

2014-08-05 Thread Peter Keegan
When there are multiple 'external file field' files available, Solr will
reload the last one (lexicographically) with a commit, but only if changes
were made to the index. Otherwise, it skips the reload and logs: "No
uncommitted changes. Skipping IW.commit."  Has anyone else noticed this? It
seems like a bug to me. (yes, I do have firstSearcher and newSearcher event
listeners in solrconfig.xml)

Peter


Re: ExternalFileFieldReloader and commit

2014-08-06 Thread Peter Keegan
I entered SOLR-6326 <https://issues.apache.org/jira/browse/SOLR-6326>

thanks,
Peter


On Tue, Aug 5, 2014 at 6:50 PM, Koji Sekiguchi  wrote:

> Hi Peter,
>
> It seems like a bug to me, too. Please file a JIRA ticket if you can
> so that someone can take it.
>
> Koji
> --
> http://soleami.com/blog/comparing-document-classification-functions-of-
> lucene-and-mahout.html
>
>
> (2014/08/05 22:34), Peter Keegan wrote:
>
>> When there are multiple 'external file field' files available, Solr will
>> reload the last one (lexicographically) with a commit, but only if changes
>> were made to the index. Otherwise, it skips the reload and logs: "No
>> uncommitted changes. Skipping IW.commit."  Has anyone else noticed this?
>> It
>> seems like a bug to me. (yes, I do have firstSearcher and newSearcher
>> event
>> listeners in solrconfig.xml)
>>
>> Peter
>>
>>
>
>
>


Re: ExternalFileFieldReloader and commit

2014-08-06 Thread Peter Keegan
The use case is:

1. A SolrJ client updates the main index (and replicas) and issues a commit
at regular intervals.
2. Another component updates the external files at other intervals.

Usually, the commits result in a new searcher which triggers the
org.apache.solr.schema.ExternalFileFieldReloader, but only if there were
changes to the main index.

Using ReloadCacheRequestHandler in (2) above would result in the loss of
index/replica synchronization provided by the commit in (1), and reloading
the core is slow and overkill. I think it would be easier to have the SolrJ
client in (1) always update a dummy document during each commit interval to
force a new searcher.

Thanks,
Peter


On Wed, Aug 6, 2014 at 8:43 AM, Mikhail Khludnev  wrote:

> Peter,
>
> Providing SOLR-6326 is about a bug in ExternalFileFieldReloader, I'm asking
> here:
> Did you try to use
> org.apache.solr.search.function.FileFloatSource.ReloadCacheRequestHandler ?
> Let's me know if you need help with it.
> As a workaround you can reload the core via REST or click a button at
> SolrAdmin, your questions are welcome.
>
>
>
> On Wed, Aug 6, 2014 at 4:02 PM, Peter Keegan 
> wrote:
>
> > I entered SOLR-6326 <https://issues.apache.org/jira/browse/SOLR-6326>
> >
> > thanks,
> > Peter
> >
> >
> > On Tue, Aug 5, 2014 at 6:50 PM, Koji Sekiguchi 
> wrote:
> >
> > > Hi Peter,
> > >
> > > It seems like a bug to me, too. Please file a JIRA ticket if you can
> > > so that someone can take it.
> > >
> > > Koji
> > > --
> > >
> http://soleami.com/blog/comparing-document-classification-functions-of-
> > > lucene-and-mahout.html
> > >
> > >
> > > (2014/08/05 22:34), Peter Keegan wrote:
> > >
> > >> When there are multiple 'external file field' files available, Solr
> will
> > >> reload the last one (lexicographically) with a commit, but only if
> > changes
> > >> were made to the index. Otherwise, it skips the reload and logs: "No
> > >> uncommitted changes. Skipping IW.commit."  Has anyone else noticed
> this?
> > >> It
> > >> seems like a bug to me. (yes, I do have firstSearcher and newSearcher
> > >> event
> > >> listeners in solrconfig.xml)
> > >>
> > >> Peter
> > >>
> > >>
> > >
> > >
> > >
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  
>


Re: Solr 4.4.0 on Ubuntu 10.04 with Jetty 6.1 from package Repository

2013-10-14 Thread Peter Schmidt
I downloaded the Linux 64bit version jdk-7u40-linux-x64.tar.gz


2013/10/11 Guido Medina 

> Then I think you downloaded the wrong JDK 7 (32bits JDK?), if you are
> running JDK 7 64bits the -server flag should be recognized. According to
> the stackoverflow link you mentioned before.
>
> Guido.
>
>
> On 11/10/13 15:48, Peter Schmidt wrote:
>
>> no it is 64bit and just a development VM. In production the solr will use
>> multicore, also 64bit and some gb ram.
>>
>>
>> 2013/10/11 Guido Medina 
>>
>>  If your single core is at 32bits use Oracle JDK 7u25 or Ubuntu Open JDK
>>> 7,
>>> the JDK 7u40 for 32bits will corrupt indexes as stated on the lucene bug
>>> report.
>>>
>>> Guido.
>>>
>>>
>>> On 11/10/13 15:13, Peter Schmidt wrote:
>>>
>>>  Oh, i got it 
>>> http://stackoverflow.com/a/5273166/326905<http://stackoverflow.com/a/**5273166/326905>
>>>> <http://**stackoverflow.com/a/5273166/**326905<http://stackoverflow.com/a/5273166/326905>
>>>> >
>>>>
>>>>
>>>> "at least 2 cores and at least 2 GB physical memory"
>>>>
>>>> Until know i'm using a VM with single core and 1GB RAM.
>>>>
>>>> So this will be later for production :)
>>>>
>>>> Thank you Guido.
>>>>
>>>>
>>>> 2013/10/11 Peter Schmidt 
>>>>
>>>>   Strange. When i add "-server" to the arguments, i got everytime the
>>>> error
>>>>
>>>>> on jetty startup
>>>>>
>>>>>
>>>>> Invalid option -server
>>>>> Cannot parse command line arguments
>>>>>
>>>>>
>>>>> 2013/10/11 Guido Medina 
>>>>>
>>>>>   It is JVM parameter, example:
>>>>>
>>>>>> JAVA_OPTIONS="-Djava.awt.**headless=true -Dfile.encoding=UTF-8
>>>>>>
>>>>>> -server
>>>>>>
>>>>>> -Xms256m -Xmx256m"
>>>>>>
>>>>>> If you want to concatenate more JVM parameters you do it like this:
>>>>>> JAVA_OPTIONS="-Dsolr.solr.**home=/usr/share/solr $JAVA_OPTIONS"
>>>>>>
>>>>>>
>>>>>>
>>>>>> Take a good look at the format,
>>>>>>
>>>>>> Guido.
>>>>>>
>>>>>>
>>>>>> On 11/10/13 13:37, Peter Schmidt wrote:
>>>>>>
>>>>>>   @Guido: Itried it before and than i thought you marked just the
>>>>>> server
>>>>>>
>>>>>>> options
>>>>>>>
>>>>>>> Because the -sever causes a:
>>>>>>>
>>>>>>> sudo service jetty start
>>>>>>> * Starting Jetty servlet engine.
>>>>>>> jetty
>>>>>>> Invalid option -server
>>>>>>> Cannot parse command line arguments
>>>>>>>
>>>>>>> Or should i substitute server with ...?
>>>>>>>
>>>>>>> Options with -server:
>>>>>>>
>>>>>>>
>>>>>>> JAVA_OPTIONS="-Djava.awt.**headless=true -Dfile.encoding=UTF-8
>>>>>>>
>>>>>>> -server
>>>>>>>
>>>>>>> -Xms256m -Xmx256m -XX:+UseG1GC -XX:MaxGCPauseMillis=50
>>>>>>> -XX:+OptimizeStringConcat -XX:+UseStringCache
>>>>>>> -Dsolr.solr.home=/usr/share/**solr $JAVA_OPTIONS"
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2013/10/11 Guido Medina 
>>>>>>>
>>>>>>>Remember the "-server" which for Java webapps or dedicated Java
>>>>>>> services
>>>>>>>
>>>>>>>  will improve things.
>>>>>>>>
>>>>>>>> Guido.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 11/10/13 12:26, Peter Schmidt wrote:
>>>>>>>>
>>>>>>>>I can report that jetty is running now with this options:
>>>>>>>>
>>>>>>>>  JAVA_

Re: Solr 4.4.0 on Ubuntu 10.04 with Jetty 6.1 from package Repository

2013-10-14 Thread Peter Schmidt
It is necessary to configure the update-alternatives for Oracle Java JDK 7.
Afterwards i can use the -server flag


2013/10/14 Peter Schmidt 

> I downloaded the Linux 64bit version jdk-7u40-linux-x64.tar.gz
>
>
> 2013/10/11 Guido Medina 
>
>> Then I think you downloaded the wrong JDK 7 (32bits JDK?), if you are
>> running JDK 7 64bits the -server flag should be recognized. According to
>> the stackoverflow link you mentioned before.
>>
>> Guido.
>>
>>
>> On 11/10/13 15:48, Peter Schmidt wrote:
>>
>>> no it is 64bit and just a development VM. In production the solr will use
>>> multicore, also 64bit and some gb ram.
>>>
>>>
>>> 2013/10/11 Guido Medina 
>>>
>>>  If your single core is at 32bits use Oracle JDK 7u25 or Ubuntu Open JDK
>>>> 7,
>>>> the JDK 7u40 for 32bits will corrupt indexes as stated on the lucene bug
>>>> report.
>>>>
>>>> Guido.
>>>>
>>>>
>>>> On 11/10/13 15:13, Peter Schmidt wrote:
>>>>
>>>>  Oh, i got it 
>>>> http://stackoverflow.com/a/5273166/326905<http://stackoverflow.com/a/**5273166/326905>
>>>>> <http://**stackoverflow.com/a/5273166/**326905<http://stackoverflow.com/a/5273166/326905>
>>>>> >
>>>>>
>>>>>
>>>>> "at least 2 cores and at least 2 GB physical memory"
>>>>>
>>>>> Until know i'm using a VM with single core and 1GB RAM.
>>>>>
>>>>> So this will be later for production :)
>>>>>
>>>>> Thank you Guido.
>>>>>
>>>>>
>>>>> 2013/10/11 Peter Schmidt 
>>>>>
>>>>>   Strange. When i add "-server" to the arguments, i got everytime the
>>>>> error
>>>>>
>>>>>> on jetty startup
>>>>>>
>>>>>>
>>>>>> Invalid option -server
>>>>>> Cannot parse command line arguments
>>>>>>
>>>>>>
>>>>>> 2013/10/11 Guido Medina 
>>>>>>
>>>>>>   It is JVM parameter, example:
>>>>>>
>>>>>>> JAVA_OPTIONS="-Djava.awt.**headless=true -Dfile.encoding=UTF-8
>>>>>>>
>>>>>>> -server
>>>>>>>
>>>>>>> -Xms256m -Xmx256m"
>>>>>>>
>>>>>>> If you want to concatenate more JVM parameters you do it like this:
>>>>>>> JAVA_OPTIONS="-Dsolr.solr.**home=/usr/share/solr $JAVA_OPTIONS"
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Take a good look at the format,
>>>>>>>
>>>>>>> Guido.
>>>>>>>
>>>>>>>
>>>>>>> On 11/10/13 13:37, Peter Schmidt wrote:
>>>>>>>
>>>>>>>   @Guido: Itried it before and than i thought you marked just the
>>>>>>> server
>>>>>>>
>>>>>>>> options
>>>>>>>>
>>>>>>>> Because the -sever causes a:
>>>>>>>>
>>>>>>>> sudo service jetty start
>>>>>>>> * Starting Jetty servlet engine.
>>>>>>>> jetty
>>>>>>>> Invalid option -server
>>>>>>>> Cannot parse command line arguments
>>>>>>>>
>>>>>>>> Or should i substitute server with ...?
>>>>>>>>
>>>>>>>> Options with -server:
>>>>>>>>
>>>>>>>>
>>>>>>>> JAVA_OPTIONS="-Djava.awt.**headless=true -Dfile.encoding=UTF-8
>>>>>>>>
>>>>>>>> -server
>>>>>>>>
>>>>>>>> -Xms256m -Xmx256m -XX:+UseG1GC -XX:MaxGCPauseMillis=50
>>>>>>>> -XX:+OptimizeStringConcat -XX:+UseStringCache
>>>>>>>> -Dsolr.solr.home=/usr/share/**solr $JAVA_OPTIONS"
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2013/10/11 Guido Medina 
>>>>>>>>
>>>>>>>>

Re: Solr 4.4.0 on Ubuntu 10.04 with Jetty 6.1 from package Repository

2013-10-14 Thread Peter Schmidt
But the flag is not listed under the Dashboard->Args in Solr Admin
Interface.


2013/10/14 Peter Schmidt 

> It is necessary to configure the update-alternatives for Oracle Java JDK
> 7. Afterwards i can use the -server flag
>
>
> 2013/10/14 Peter Schmidt 
>
>> I downloaded the Linux 64bit version jdk-7u40-linux-x64.tar.gz
>>
>>
>> 2013/10/11 Guido Medina 
>>
>>> Then I think you downloaded the wrong JDK 7 (32bits JDK?), if you are
>>> running JDK 7 64bits the -server flag should be recognized. According to
>>> the stackoverflow link you mentioned before.
>>>
>>> Guido.
>>>
>>>
>>> On 11/10/13 15:48, Peter Schmidt wrote:
>>>
>>>> no it is 64bit and just a development VM. In production the solr will
>>>> use
>>>> multicore, also 64bit and some gb ram.
>>>>
>>>>
>>>> 2013/10/11 Guido Medina 
>>>>
>>>>  If your single core is at 32bits use Oracle JDK 7u25 or Ubuntu Open
>>>>> JDK 7,
>>>>> the JDK 7u40 for 32bits will corrupt indexes as stated on the lucene
>>>>> bug
>>>>> report.
>>>>>
>>>>> Guido.
>>>>>
>>>>>
>>>>> On 11/10/13 15:13, Peter Schmidt wrote:
>>>>>
>>>>>  Oh, i got it 
>>>>> http://stackoverflow.com/a/5273166/326905<http://stackoverflow.com/a/**5273166/326905>
>>>>>> <http://**stackoverflow.com/a/5273166/**326905<http://stackoverflow.com/a/5273166/326905>
>>>>>> >
>>>>>>
>>>>>>
>>>>>> "at least 2 cores and at least 2 GB physical memory"
>>>>>>
>>>>>> Until know i'm using a VM with single core and 1GB RAM.
>>>>>>
>>>>>> So this will be later for production :)
>>>>>>
>>>>>> Thank you Guido.
>>>>>>
>>>>>>
>>>>>> 2013/10/11 Peter Schmidt 
>>>>>>
>>>>>>   Strange. When i add "-server" to the arguments, i got everytime the
>>>>>> error
>>>>>>
>>>>>>> on jetty startup
>>>>>>>
>>>>>>>
>>>>>>> Invalid option -server
>>>>>>> Cannot parse command line arguments
>>>>>>>
>>>>>>>
>>>>>>> 2013/10/11 Guido Medina 
>>>>>>>
>>>>>>>   It is JVM parameter, example:
>>>>>>>
>>>>>>>> JAVA_OPTIONS="-Djava.awt.**headless=true -Dfile.encoding=UTF-8
>>>>>>>>
>>>>>>>> -server
>>>>>>>>
>>>>>>>> -Xms256m -Xmx256m"
>>>>>>>>
>>>>>>>> If you want to concatenate more JVM parameters you do it like this:
>>>>>>>> JAVA_OPTIONS="-Dsolr.solr.**home=/usr/share/solr $JAVA_OPTIONS"
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Take a good look at the format,
>>>>>>>>
>>>>>>>> Guido.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 11/10/13 13:37, Peter Schmidt wrote:
>>>>>>>>
>>>>>>>>   @Guido: Itried it before and than i thought you marked just the
>>>>>>>> server
>>>>>>>>
>>>>>>>>> options
>>>>>>>>>
>>>>>>>>> Because the -sever causes a:
>>>>>>>>>
>>>>>>>>> sudo service jetty start
>>>>>>>>> * Starting Jetty servlet engine.
>>>>>>>>> jetty
>>>>>>>>> Invalid option -server
>>>>>>>>> Cannot parse command line arguments
>>>>>>>>>
>>>>>>>>> Or should i substitute server with ...?
>>>>>>>>>
>>>>>>>>> Options with -server:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> JAVA_OPTIONS="-Djava.awt.**headless=true -Dfile.encoding=UTF-8
>>>>>>>>>
>>>>>>>>> -server
>

Re: Solr 4.4.0 on Ubuntu 10.04 with Jetty 6.1 from package Repository

2013-10-14 Thread Peter Schmidt
But it's used, it's in the JAVA_OPTIONS listet by service jetty check



2013/10/14 Peter Schmidt 

> But the flag is not listed under the Dashboard->Args in Solr Admin
> Interface.
>
>
> 2013/10/14 Peter Schmidt 
>
>> It is necessary to configure the update-alternatives for Oracle Java JDK
>> 7. Afterwards i can use the -server flag
>>
>>
>> 2013/10/14 Peter Schmidt 
>>
>>> I downloaded the Linux 64bit version jdk-7u40-linux-x64.tar.gz
>>>
>>>
>>> 2013/10/11 Guido Medina 
>>>
>>>> Then I think you downloaded the wrong JDK 7 (32bits JDK?), if you are
>>>> running JDK 7 64bits the -server flag should be recognized. According to
>>>> the stackoverflow link you mentioned before.
>>>>
>>>> Guido.
>>>>
>>>>
>>>> On 11/10/13 15:48, Peter Schmidt wrote:
>>>>
>>>>> no it is 64bit and just a development VM. In production the solr will
>>>>> use
>>>>> multicore, also 64bit and some gb ram.
>>>>>
>>>>>
>>>>> 2013/10/11 Guido Medina 
>>>>>
>>>>>  If your single core is at 32bits use Oracle JDK 7u25 or Ubuntu Open
>>>>>> JDK 7,
>>>>>> the JDK 7u40 for 32bits will corrupt indexes as stated on the lucene
>>>>>> bug
>>>>>> report.
>>>>>>
>>>>>> Guido.
>>>>>>
>>>>>>
>>>>>> On 11/10/13 15:13, Peter Schmidt wrote:
>>>>>>
>>>>>>  Oh, i got it 
>>>>>> http://stackoverflow.com/a/5273166/326905<http://stackoverflow.com/a/**5273166/326905>
>>>>>>> <http://**stackoverflow.com/a/5273166/**326905<http://stackoverflow.com/a/5273166/326905>
>>>>>>> >
>>>>>>>
>>>>>>>
>>>>>>> "at least 2 cores and at least 2 GB physical memory"
>>>>>>>
>>>>>>> Until know i'm using a VM with single core and 1GB RAM.
>>>>>>>
>>>>>>> So this will be later for production :)
>>>>>>>
>>>>>>> Thank you Guido.
>>>>>>>
>>>>>>>
>>>>>>> 2013/10/11 Peter Schmidt 
>>>>>>>
>>>>>>>   Strange. When i add "-server" to the arguments, i got everytime
>>>>>>> the error
>>>>>>>
>>>>>>>> on jetty startup
>>>>>>>>
>>>>>>>>
>>>>>>>> Invalid option -server
>>>>>>>> Cannot parse command line arguments
>>>>>>>>
>>>>>>>>
>>>>>>>> 2013/10/11 Guido Medina 
>>>>>>>>
>>>>>>>>   It is JVM parameter, example:
>>>>>>>>
>>>>>>>>> JAVA_OPTIONS="-Djava.awt.**headless=true -Dfile.encoding=UTF-8
>>>>>>>>>
>>>>>>>>> -server
>>>>>>>>>
>>>>>>>>> -Xms256m -Xmx256m"
>>>>>>>>>
>>>>>>>>> If you want to concatenate more JVM parameters you do it like this:
>>>>>>>>> JAVA_OPTIONS="-Dsolr.solr.**home=/usr/share/solr
>>>>>>>>> $JAVA_OPTIONS"
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Take a good look at the format,
>>>>>>>>>
>>>>>>>>> Guido.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 11/10/13 13:37, Peter Schmidt wrote:
>>>>>>>>>
>>>>>>>>>   @Guido: Itried it before and than i thought you marked just the
>>>>>>>>> server
>>>>>>>>>
>>>>>>>>>> options
>>>>>>>>>>
>>>>>>>>>> Because the -sever causes a:
>>>>>>>>>>
>>>>>>>>>> sudo service jetty start
>>>>>>>>>> * Starting Jetty servlet engine.
>>>>>>>>>> jetty
>>>>>>>>>> Invalid option -server
>

Re: limiting deep pagination

2013-10-17 Thread Peter Keegan
Yes, right now this constraint could be implemented in either the web app
or Solr. I see now that many of the QTimes on these queries are <10 ms
(probably due to caching), so I'm a bit less concerned.


On Wed, Oct 16, 2013 at 2:13 AM, Furkan KAMACI wrote:

> I just wonder that: Don't you implement a custom API that interacts with
> Solr and limits such kinds of requestst? (I know that you are asking about
> how to do that in Solr but I handle such situations at my custom search
> APIs and want to learn what fellows do)
>
>
> 9 Ekim 2013 Çarşamba tarihinde Michael Sokolov <
> msoko...@safaribooksonline.com> adlı kullanıcı şöyle yazdı:
> > On 10/8/13 6:51 PM, Peter Keegan wrote:
> >>
> >> Is there a way to configure Solr 'defaults/appends/invariants' such that
> >> the product of the 'start' and 'rows' parameters doesn't exceed a given
> >> value? This would be to prevent deep pagination.  Or would this require
> a
> >> custom requestHandler?
> >>
> >> Peter
> >>
> > Just wondering -- isn't it the sum that you should be concerned about
> rather than the product?  Actually I think what we usually do is limit both
> independently, with slightly different concerns, since. eg start=1,
> rows=1000 causes memory problems if you have large fields in your results,
> where start=1000, rows=1 may not actually be a problem
> >
> > -Mike
> >
>


Re: Solr timeout after reboot

2013-10-21 Thread Peter Keegan
Have you tried this old trick to warm the FS cache?
cat ...//data/index/* >/dev/null

Peter


On Mon, Oct 21, 2013 at 5:31 AM, michael.boom  wrote:

> Thank you, Otis!
>
> I've integrated the SPM on my Solr instances and now I have access to
> monitoring data.
> Could you give me some hints on which metrics should I watch?
>
> Below I've added my query configs. Is there anything I could tweak here?
>
> 
> 1024
>
>   size="1000"
>  initialSize="1000"
>  autowarmCount="0"/>
>
>   size="1000"
>  initialSize="1000"
>  autowarmCount="0"/>
>
> size="1000"
>initialSize="1000"
>autowarmCount="0"/>
>
>
>  size="1000"
> initialSize="1000"
> autowarmCount="0" />
>
>
> true
>
>20
>
>100
>
> 
>   
> 
>   active:true
> 
>   
> 
>
> false
>
> 10
>
>   
>
>
>
> -
> Thanks,
> Michael
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-timeout-after-reboot-tp4096408p4096780.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Solr timeout after reboot

2013-10-21 Thread Peter Keegan
I found this warming to be especially necessary after starting an instance
of those m3.xlarge servers, else the response times for the first minutes
was terrible.

Peter


On Mon, Oct 21, 2013 at 8:39 AM, François Schiettecatte <
fschietteca...@gmail.com> wrote:

> To put the file data into file system cache which would make for faster
> access.
>
> François
>
>
> On Oct 21, 2013, at 8:33 AM, michael.boom  wrote:
>
> > Hmm, no, I haven't...
> >
> > What would be the effect of this ?
> >
> >
> >
> > -
> > Thanks,
> > Michael
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-timeout-after-reboot-tp4096408p4096809.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>
>


fq with { or } in Solr 4.3.1

2013-10-23 Thread Peter Kirk
Hi

If I do a search like 
/search?q=catid:{123}

I get the results I expect.

But if I do
/search?q=*:*&fq=catid{123}

I get an error from Solr like:
org.apache.solr.search.SyntaxError: Cannot parse 'catid:{123}': Encountered " 
"}" "} "" at line 1, column 58. Was expecting one of: "TO" ...  
...  ...


Can I not use { or } in an fq?

Thanks,
Peter


RE: fq with { or } in Solr 4.3.1

2013-10-23 Thread Peter Kirk
Sorry, that was just a typo.

/ search?q=*:*&fq=catid:{123}

Gives me the error. 

I think that { and } must be used in ranges for fq, and that's why I can't use 
them directly like this.

/Peter



-Original Message-
From: Upayavira [mailto:u...@odoko.co.uk] 
Sent: 23. oktober 2013 10:52
To: solr-user@lucene.apache.org
Subject: Re: fq with { or } in Solr 4.3.1

Missing a colon before the curly bracket in the fq?

On Wed, Oct 23, 2013, at 09:42 AM, Peter Kirk wrote:
> Hi
> 
> If I do a search like 
> /search?q=catid:{123}
> 
> I get the results I expect.
> 
> But if I do
> /search?q=*:*&fq=catid{123}
> 
> I get an error from Solr like:
> org.apache.solr.search.SyntaxError: Cannot parse 'catid:{123}':
> Encountered " "}" "} "" at line 1, column 58. Was expecting one of: "TO"
> ...  ...  ...
> 
> 
> Can I not use { or } in an fq?
> 
> Thanks,
> Peter


SV: fq with { or } in Solr 4.3.1

2013-10-23 Thread Peter Kirk
Thanks.

The data for the "catid" comes from another system, and is actually a string 
with a start { and an end }.

I was confused that it works in a q parameter but not fq.

I think the easiest for me, is simply to strip the start and end characters 
when I feed to the index.


Thanks


Fra: Jack Krupansky 
Sendt: 23. oktober 2013 12:59
Til: solr-user@lucene.apache.org
Emne: Re: fq with { or } in Solr 4.3.1

Are you using the edismax query parser? It traps the syntax error and then
escapes or ignores special characters.

Curly braces are used for exclusive range queries (square brackets are
inclusive ranges). The proper syntax is "{term1 TO term2}".

So, what were your intentions with "catid:{123}"? If you are simply trying
to pass the braces as literal characters for a string field, either escape
them with backslash or enclose the entire term in quotes:

catid:\{123\}

catid:"{123}"

-- Jack Krupansky

-Original Message-
From: Peter Kirk
Sent: Wednesday, October 23, 2013 4:57 AM
To: solr-user@lucene.apache.org
Subject: RE: fq with { or } in Solr 4.3.1

Sorry, that was just a typo.

/ search?q=*:*&fq=catid:{123}

Gives me the error.

I think that { and } must be used in ranges for fq, and that's why I can't
use them directly like this.

/Peter



-Original Message-
From: Upayavira [mailto:u...@odoko.co.uk]
Sent: 23. oktober 2013 10:52
To: solr-user@lucene.apache.org
Subject: Re: fq with { or } in Solr 4.3.1

Missing a colon before the curly bracket in the fq?

On Wed, Oct 23, 2013, at 09:42 AM, Peter Kirk wrote:
> Hi
>
> If I do a search like
> /search?q=catid:{123}
>
> I get the results I expect.
>
> But if I do
> /search?q=*:*&fq=catid{123}
>
> I get an error from Solr like:
> org.apache.solr.search.SyntaxError: Cannot parse 'catid:{123}':
> Encountered " "}" "} "" at line 1, column 58. Was expecting one of: "TO"
> ...  ...  ...
>
>
> Can I not use { or } in an fq?
>
> Thanks,
> Peter



How to reinitialize a solrcloud replica

2013-10-25 Thread Peter Keegan
I'm running 4.3 in solrcloud mode and trying to test index recovery, but
it's failing.
I have one shard, 2 replicas:
Leader: 10.159.8.105
Replica: 10.159.6.73

To test, I stopped the replica, deleted the 'data' directory and restarted
solr. Here is the replica's logging:

INFO  - 2013-10-25 12:19:40.773; org.apache.solr.cloud.ZkController; We are
http://10.159.6.73:8983/solr/collection/ and leader is
http://10.159.8.105:8983/solr/collection/
INFO  - 2013-10-25 12:19:40.774; org.apache.solr.cloud.ZkController; No
LogReplay needed for core=collection baseURL=http://10.159.6.73:8983/solr
INFO  - 2013-10-25 12:19:40.774; org.apache.solr.cloud.ZkController; Core
needs to recover:collection
INFO  - 2013-10-25 12:19:40.774;
org.apache.solr.update.DefaultSolrCoreState; Running recovery - first
canceling any ongoing recovery
INFO  - 2013-10-25 12:19:40.778; org.apache.solr.cloud.RecoveryStrategy;
Starting recovery process.  core=collection recoveringAfterStartup=true
...
ERROR - 2013-10-25 12:20:25.281; org.apache.solr.common.SolrException;
Error while trying to recover.
core=collection:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
I was asked to wait on state recovering for 10.159.6.73:8983_solr but I
still do not see the requested state. I see state: down live:true
...
ERROR - 2013-10-25 12:20:25.281; org.apache.solr.cloud.RecoveryStrategy;
Recovery failed - trying again... (5) core=collection
ERROR - 2013-10-25 12:20:25.281; org.apache.solr.common.SolrException;
Recovery failed - interrupted. core=collection
ERROR - 2013-10-25 12:20:25.282; org.apache.solr.common.SolrException;
Recovery failed - I give up. core=collection
INFO  - 2013-10-25 12:20:25.282; org.apache.solr.cloud.ZkController;
publishing core=collection state=recovery_failed

Here is the Leader's logging:

INFO  - 2013-10-25 12:19:40.883;
org.apache.solr.handler.admin.CoreAdminHandler; Going to wait for
coreNodeName: 10.159.6.73:8983_solr_collection, state: recovering,
checkLive: true, onlyIfLeader: true
INFO  - 2013-10-25 12:19:55.886;
org.apache.solr.common.cloud.ZkStateReader; Updating cloud state from
ZooKeeper...
ERROR - 2013-10-25 12:20:25.277; org.apache.solr.common.SolrException;
org.apache.solr.common.SolrException: I was asked to wait on state
recovering for 10.159.6.73:8983_solr but I still do not see the requested
state. I see state: down live:true
(repeats every minute)

Is it valid to simply delete the 'data' directory, or does a znode have to
be modified, too?
What's the right way to reinitialize and re-synch a core?

Peter


Re: How to get similarity score between 0 and 1 not relative score

2013-11-01 Thread Peter Keegan
There's another use case for scaling the score. Suppose I want to compute a
custom score based on the weighted sum of:

- product(0.75, relevance score)
- product(0.25, value from another field)

For this to work, both fields must have values between 0-1, for example.
Toby's example using the scale function seems to work, but you have to use
fq to eliminate results with score=0. It seems this is somewhat expensive,
since the scaling can't be done until all results have been collected to
get the max score. Then, are the results resorted? I haven't looked
closely, yet.

Peter


Peter




On Thu, Oct 31, 2013 at 7:48 PM, Toby Lazar  wrote:

> I think you are looking for something like this, though you can omit the fq
> section:
>
>
>
> http://localhost:8983/solr/collection/select?abc=text:bob&q={!func}scale(product(query($abc),1),0,1)&fq={
> !
> frange l=0.9}$q
>
> Also, I don't understand all the fuss about normalized scores.  In the
> linked example, I can see an interest in searching for "apple bannana",
> "zzz yyy xxx qqq kkk ttt rrr 111", etc. and wanting only close matches for
> that point in time.  Would this be a good use for this approach?  I
> understand that the results can change if the documents in the index
> change.
>
> Thanks,
>
> Toby
>
>
>
> On Thu, Oct 31, 2013 at 12:56 AM, Anshum Gupta  >wrote:
>
> > Hi Susheel,
> >
> > Have a look at this:
> > http://wiki.apache.org/lucene-java/ScoresAsPercentages
> >
> > You may really want to reconsider doing that.
> >
> >
> >
> >
> > On Thu, Oct 31, 2013 at 9:41 AM, sushil sharma  > >wrote:
> >
> > > Hi,
> > >
> > > We have a requirement where user would like to see a score (between 0
> to
> > > 1) which can tell how close the input search string is with result
> > string.
> > > So if input was very close but not exact matach, score could be .90
> etc.
> > >
> > > I do understand that we can get score from solr & divide by highest
> score
> > > but that will always show 1 even if we match was not exact.
> > >
> > > Regards,
> > > Susheel
> >
> >
> >
> >
> > --
> >
> > Anshum Gupta
> > http://www.anshumgupta.net
> >
>


Re: Data Import Handler

2013-11-06 Thread Peter Keegan
I've done this by adding an attribute to the entity element (e.g.
myconfig="myconfig.xml"), and reading it in the 'init' method with
context.getResolvedEntityAttribute("myconfig").

Peter


On Wed, Nov 6, 2013 at 8:25 AM, Ramesh  wrote:

> Hi Folks,
>
>
>
> Can anyone suggest me how can customize dataconfig.xml file
>
> I want to provide database details like( db_url,uname,password ) from my
> own
> properties file instead of dataconfig.xaml file
>
>


Function query matching

2013-11-07 Thread Peter Keegan
Why does this function query return docs that don't match the embedded
query?
select?qq=text:news&q={!func}sum(query($qq),0)


Re: Function query matching

2013-11-07 Thread Peter Keegan
I'm trying to used a normalized score in a query as I described in a recent
thread titled "Re: How to get similarity score between 0 and 1 not relative
score"

I'm using this query:
select?qq={!edismax v='news' qf='title^2
body'}&scaledQ=scale(product(query($qq),1),0,1)&q={!func}sum(product(0.75,$scaledQ),product(0.25,field(myfield)))&fq={!frange
l=0.001}$q

Is there another way to accomplish this using dismax boosting?



On Thu, Nov 7, 2013 at 12:55 PM, Jason Hellman <
jhell...@innoventsolutions.com> wrote:

> You can, of course, us a function range query:
>
> select?q=text:news&fq={!frange l=0 u=100}sum(x,y)
>
>
> http://lucene.apache.org/solr/4_5_1/solr-core/org/apache/solr/search/FunctionRangeQParserPlugin.html
>
> This will give you a bit more flexibility to meet your goal.
>
> On Nov 7, 2013, at 7:26 AM, Erik Hatcher  wrote:
>
> > Function queries score (all) documents, but don't filter them.  All
> documents effectively match a function query.
> >
> >   Erik
> >
> > On Nov 7, 2013, at 1:48 PM, Peter Keegan  wrote:
> >
> >> Why does this function query return docs that don't match the embedded
> >> query?
> >> select?qq=text:news&q={!func}sum(query($qq),0)
> >
>
>


Re: Function query matching

2013-11-11 Thread Peter Keegan
I replaced the frange filter with the following filter and got the correct
no. of results and it was 3X faster:

select?qq={!edismax v='news' qf='title^2
body'}&scaledQ=scale(product(query($qq),1),0,1)&q={!func}sum(product(0.75,$scaledQ),product(0.25,field(myfield)))&fq={!edismax
v='news' qf='title^2 body'}

Then, I tried to simplify the query with parameter substitution, but 'fq'
didn't parse correctly:

select?qq={!edismax v='news' qf='title^2
body'}&scaledQ=scale(product(query($qq),1),0,1)&q={!func}sum(product(0.75,$scaledQ),product(0.25,field(myfield)))&fq=$qq

What is the proper syntax?

Thanks,
Peter


On Thu, Nov 7, 2013 at 2:16 PM, Peter Keegan  wrote:

> I'm trying to used a normalized score in a query as I described in a
> recent thread titled "Re: How to get similarity score between 0 and 1 not
> relative score"
>
> I'm using this query:
> select?qq={!edismax v='news' qf='title^2
> body'}&scaledQ=scale(product(query($qq),1),0,1)&q={!func}sum(product(0.75,$scaledQ),product(0.25,field(myfield)))&fq={!frange
> l=0.001}$q
>
> Is there another way to accomplish this using dismax boosting?
>
>
>
> On Thu, Nov 7, 2013 at 12:55 PM, Jason Hellman <
> jhell...@innoventsolutions.com> wrote:
>
>> You can, of course, us a function range query:
>>
>> select?q=text:news&fq={!frange l=0 u=100}sum(x,y)
>>
>>
>> http://lucene.apache.org/solr/4_5_1/solr-core/org/apache/solr/search/FunctionRangeQParserPlugin.html
>>
>> This will give you a bit more flexibility to meet your goal.
>>
>> On Nov 7, 2013, at 7:26 AM, Erik Hatcher  wrote:
>>
>> > Function queries score (all) documents, but don't filter them.  All
>> documents effectively match a function query.
>> >
>> >   Erik
>> >
>> > On Nov 7, 2013, at 1:48 PM, Peter Keegan 
>> wrote:
>> >
>> >> Why does this function query return docs that don't match the embedded
>> >> query?
>> >> select?qq=text:news&q={!func}sum(query($qq),0)
>> >
>>
>>
>


Re: Function query matching

2013-11-11 Thread Peter Keegan
Thanks


On Mon, Nov 11, 2013 at 11:46 AM, Yonik Seeley wrote:

> On Mon, Nov 11, 2013 at 11:39 AM, Peter Keegan 
> wrote:
> > fq=$qq
> >
> > What is the proper syntax?
>
> fq={!query v=$qq}
>
> -Yonik
> http://heliosearch.com -- making solr shine
>


Re: Function query matching

2013-11-27 Thread Peter Keegan
Hi,

So, this query does just what I want, but it's typically 3 times slower
than the edismax query  without the functions:

select?qq={!edismax v='news' qf='title^2 body'}&scaledQ=scale(product(
query($qq),1),0,1)&q={!func}sum(product(0.75,$scaledQ),
product(0.25,field(myfield)))&fq={!query v=$qq}

Is there any way to speed this up? Would writing a custom function query
that compiled all the function queries together be any faster?

Thanks,
Peter


On Mon, Nov 11, 2013 at 1:31 PM, Peter Keegan wrote:

> Thanks
>
>
> On Mon, Nov 11, 2013 at 11:46 AM, Yonik Seeley wrote:
>
>> On Mon, Nov 11, 2013 at 11:39 AM, Peter Keegan 
>> wrote:
>> > fq=$qq
>> >
>> > What is the proper syntax?
>>
>> fq={!query v=$qq}
>>
>> -Yonik
>> http://heliosearch.com -- making solr shine
>>
>
>


Re: Function query matching

2013-11-27 Thread Peter Keegan
Although the 'scale' is a big part of it, here's a closer breakdown. Here
are 4 queries with increasing functions, and theei response times (caching
turned off in solrconfig):

100 msec:
select?q={!edismax v='news' qf='title^2 body'}

135 msec:
select?qq={!edismax v='news' qf='title^2
body'}q={!func}product(field(myfield),query($qq)&fq={!query v=$qq}

200 msec:
select?qq={!edismax v='news' qf='title^2
body'}q={!func}sum(product(0.75,query($qq)),product(0.25,field(myfield&fq={!query
v=$qq}

320 msec:
select?qq={!edismax v='news' qf='title^2
body'}&scaledQ=scale(product(query($qq),1),0,1)&q={!func}sum(product(0.75,$scaledQ),product(0.25,field(myfield)))&fq={!query
v=$qq}

Btw, that no-op product is necessary, else you get this exception:

org.apache.lucene.search.BooleanQuery$BooleanWeight cannot be cast to
org.apache.lucene.queries.function.valuesource.ScaleFloatFunction$ScaleInfo

thanks,

peter



On Wed, Nov 27, 2013 at 1:30 PM, Chris Hostetter
wrote:

>
> : So, this query does just what I want, but it's typically 3 times slower
> : than the edismax query  without the functions:
>
> that's because the scale() function is inhernetly slow (it has to
> compute the min & max value for every document in order to know how to
> scale them)
>
> what you are seeing is the price you have to pay to get that query with a
> "normalized" 0-1 value.
>
> (you might be able to save a little bit of time by eliminating that
> no-Op multiply by 1: "product(query($qq),1)" ... but i doubt you'll even
> notice much of a chnage given that scale function.
>
> : Is there any way to speed this up? Would writing a custom function query
> : that compiled all the function queries together be any faster?
>
> If you can find a faster implementation for scale() then by all means let
> us konw, and we can fold it back into Solr.
>
>
> -Hoss
>


Re: Function query matching

2013-11-29 Thread Peter Keegan
Instead of using a function query, could I use the edismax query (plus some
low cost filters not shown in the example) and implement the
scale/sum/product computation in a PostFilter? Is the query's maxScore
available there?

Thanks,
Peter


On Wed, Nov 27, 2013 at 1:58 PM, Peter Keegan wrote:

> Although the 'scale' is a big part of it, here's a closer breakdown. Here
> are 4 queries with increasing functions, and theei response times (caching
> turned off in solrconfig):
>
> 100 msec:
> select?q={!edismax v='news' qf='title^2 body'}
>
> 135 msec:
> select?qq={!edismax v='news' qf='title^2
> body'}q={!func}product(field(myfield),query($qq)&fq={!query v=$qq}
>
> 200 msec:
> select?qq={!edismax v='news' qf='title^2
> body'}q={!func}sum(product(0.75,query($qq)),product(0.25,field(myfield&fq={!query
> v=$qq}
>
> 320 msec:
> select?qq={!edismax v='news' qf='title^2
> body'}&scaledQ=scale(product(query($qq),1),0,1)&q={!func}sum(product(0.75,$scaledQ),product(0.25,field(myfield)))&fq={!query
> v=$qq}
>
> Btw, that no-op product is necessary, else you get this exception:
>
> org.apache.lucene.search.BooleanQuery$BooleanWeight cannot be cast to 
> org.apache.lucene.queries.function.valuesource.ScaleFloatFunction$ScaleInfo
>
> thanks,
>
> peter
>
>
>
> On Wed, Nov 27, 2013 at 1:30 PM, Chris Hostetter  > wrote:
>
>>
>> : So, this query does just what I want, but it's typically 3 times slower
>> : than the edismax query  without the functions:
>>
>> that's because the scale() function is inhernetly slow (it has to
>> compute the min & max value for every document in order to know how to
>> scale them)
>>
>> what you are seeing is the price you have to pay to get that query with a
>> "normalized" 0-1 value.
>>
>> (you might be able to save a little bit of time by eliminating that
>> no-Op multiply by 1: "product(query($qq),1)" ... but i doubt you'll even
>> notice much of a chnage given that scale function.
>>
>> : Is there any way to speed this up? Would writing a custom function query
>> : that compiled all the function queries together be any faster?
>>
>> If you can find a faster implementation for scale() then by all means let
>> us konw, and we can fold it back into Solr.
>>
>>
>> -Hoss
>>
>
>


Re: Function query matching

2013-12-02 Thread Peter Keegan
I'm persuing this possible PostFilter solution, I can see how to collect
all the hits and recompute the scores in a PostFilter, after all the hits
have been collected (for scaling). Now, I can't see how to get the custom
doc/score values back into the main query's HitQueue. Any advice?

Thanks,
Peter


On Fri, Nov 29, 2013 at 9:18 AM, Peter Keegan wrote:

> Instead of using a function query, could I use the edismax query (plus
> some low cost filters not shown in the example) and implement the
> scale/sum/product computation in a PostFilter? Is the query's maxScore
> available there?
>
> Thanks,
> Peter
>
>
> On Wed, Nov 27, 2013 at 1:58 PM, Peter Keegan wrote:
>
>> Although the 'scale' is a big part of it, here's a closer breakdown. Here
>> are 4 queries with increasing functions, and theei response times (caching
>> turned off in solrconfig):
>>
>> 100 msec:
>> select?q={!edismax v='news' qf='title^2 body'}
>>
>> 135 msec:
>> select?qq={!edismax v='news' qf='title^2
>> body'}q={!func}product(field(myfield),query($qq)&fq={!query v=$qq}
>>
>> 200 msec:
>> select?qq={!edismax v='news' qf='title^2
>> body'}q={!func}sum(product(0.75,query($qq)),product(0.25,field(myfield&fq={!query
>> v=$qq}
>>
>> 320 msec:
>>  select?qq={!edismax v='news' qf='title^2
>> body'}&scaledQ=scale(product(query($qq),1),0,1)&q={!func}sum(product(0.75,$scaledQ),product(0.25,field(myfield)))&fq={!query
>> v=$qq}
>>
>> Btw, that no-op product is necessary, else you get this exception:
>>
>> org.apache.lucene.search.BooleanQuery$BooleanWeight cannot be cast to 
>> org.apache.lucene.queries.function.valuesource.ScaleFloatFunction$ScaleInfo
>>
>> thanks,
>>
>> peter
>>
>>
>>
>> On Wed, Nov 27, 2013 at 1:30 PM, Chris Hostetter <
>> hossman_luc...@fucit.org> wrote:
>>
>>>
>>> : So, this query does just what I want, but it's typically 3 times slower
>>> : than the edismax query  without the functions:
>>>
>>> that's because the scale() function is inhernetly slow (it has to
>>> compute the min & max value for every document in order to know how to
>>> scale them)
>>>
>>> what you are seeing is the price you have to pay to get that query with a
>>> "normalized" 0-1 value.
>>>
>>> (you might be able to save a little bit of time by eliminating that
>>> no-Op multiply by 1: "product(query($qq),1)" ... but i doubt you'll even
>>> notice much of a chnage given that scale function.
>>>
>>> : Is there any way to speed this up? Would writing a custom function
>>> query
>>> : that compiled all the function queries together be any faster?
>>>
>>> If you can find a faster implementation for scale() then by all means let
>>> us konw, and we can fold it back into Solr.
>>>
>>>
>>> -Hoss
>>>
>>
>>
>


Re: Function query matching

2013-12-06 Thread Peter Keegan
I added some timing logging to IndexSearcher and ScaleFloatFunction and
compared a simple DisMax query with a DisMax query wrapped in the scale
function. The index size was 500K docs, 61K docs match the DisMax query.
The simple DisMax query took 33 ms, the function query took 89 ms. What I
found was:

1. The scale query only normalized the scores once (in
ScaleInfo.createScaleInfo) and added 33 ms to the Qtime.  Subsequent calls
to ScaleFloatFuntion.getValues bypassed 'createScaleInfo and  added ~0 time.

2. The FunctionQuery 'nextDoc' iterations added 16 ms over the DisMax
'nextDoc' iterations.

Here's the breakdown:

Simple DisMax query:
weight.scorer: 3 ms (get term enum)
scorer.score: 23 ms (nextDoc iterations)
other: 3 ms
Total: 33 ms

DisMax wrapped in ScaleFloatFunction:
weight.scorer: 39 ms (get scaled values)
scorer.score: 39 ms (nextDoc iterations)
other: 11 ms
Total: 89 ms

Even with any improvements to 'scale', all function queries will add a
linear increase to the Qtime as index size increases, since they match all
docs.

Trey: I'd be happy to test any patch that you find improves the speed.



On Mon, Dec 2, 2013 at 11:21 PM, Trey Grainger  wrote:

> We're working on the same problem with the combination of the
> scale(query(...)) combination, so I'd like to share a bit more information
> that may be useful.
>
> *On the scale function:*
> Even thought the scale query has to calculate the scores for all documents,
> it is actually doing this work twice for each ValueSource (once to
> calculate the min and max values, and then again when actually scoring the
> documents), which is inefficient.
>
> To solve the problem, we're in the process of putting a cache inside the
> scale function to remember the values for each document when they are
> initially computed (to find the min and max) so that the second pass can
> just use the previously computed values for each document.  Our theory is
> that most of the extra time due to the scale function is really just the
> result of doing duplicate work.
>
> No promises this won't be overly costly in terms of memory utilization, but
> we'll see what we get in terms of speed improvements and will share the
> code if it works out well.  Alternate implementation suggestions (or
> criticism of a cache like this) are also welcomed.
>
>
> *On the NoOp product function: scale(prod(1, query(...))):*
> We do the same thing, which ultimately is just an unnecessary waste of a
> loop through all documents to do an extra multiplication step.  I just
> debugged the code and uncovered the problem.  There is a Map (called
> context) that is passed through to each value source to store intermediate
> state, and both the query and scale functions are passing the ValueSource
> for the query function in as the KEY to this Map (as opposed to using some
> composite key that makes sense in the current context).  Essentially, these
> lines are overwriting each other:
>
> Inside ScaleFloatFunction: context.put(this.source, scaleInfo);
>  //this.source refers to the QueryValueSource, and the scaleInfo refers to
> a ScaleInfo object
> Inside QueryValueSource: context.put(this, w); //this refers to the same
> QueryValueSource from above, and the w refers to a Weight object
>
> As such, when the ScaleFloatFunction later goes to read the ScaleInfo from
> the context Map, it unexpectedly pulls the Weight object out instead and
> thus the invalid case exception occurs.  The NoOp multiplication works
> because it puts an "different" ValueSource between the query and the
> ScaleFloatFunction such that this.source (in ScaleFloatFunction) != this
> (in QueryValueSource).
>
> This should be an easy fix.  I'll create a JIRA ticket to use better key
> names in these functions and push up a patch.  This will eliminate the need
> for the extra NoOp function.
>
> -Trey
>
>
> On Mon, Dec 2, 2013 at 12:41 PM, Peter Keegan  >wrote:
>
> > I'm persuing this possible PostFilter solution, I can see how to collect
> > all the hits and recompute the scores in a PostFilter, after all the hits
> > have been collected (for scaling). Now, I can't see how to get the custom
> > doc/score values back into the main query's HitQueue. Any advice?
> >
> > Thanks,
> > Peter
> >
> >
> > On Fri, Nov 29, 2013 at 9:18 AM, Peter Keegan  > >wrote:
> >
> > > Instead of using a function query, could I use the edismax query (plus
> > > some low cost filters not shown in the example) and implement the
> > > scale/sum/product computation in a PostFilter? Is the query's maxScore
> > > available there?
> > >
> > > Thanks,
> > >

Re: Function query matching

2013-12-06 Thread Peter Keegan
In my previous posting, I said:

  "Subsequent calls to ScaleFloatFuntion.getValues bypassed
'createScaleInfo and  added ~0 time."

These subsequent calls are for the remaining segments in the index reader
(21 segments).

Peter



On Fri, Dec 6, 2013 at 2:10 PM, Peter Keegan  wrote:

> I added some timing logging to IndexSearcher and ScaleFloatFunction and
> compared a simple DisMax query with a DisMax query wrapped in the scale
> function. The index size was 500K docs, 61K docs match the DisMax query.
> The simple DisMax query took 33 ms, the function query took 89 ms. What I
> found was:
>
> 1. The scale query only normalized the scores once (in
> ScaleInfo.createScaleInfo) and added 33 ms to the Qtime.  Subsequent calls
> to ScaleFloatFuntion.getValues bypassed 'createScaleInfo and  added ~0 time.
>
> 2. The FunctionQuery 'nextDoc' iterations added 16 ms over the DisMax
> 'nextDoc' iterations.
>
> Here's the breakdown:
>
> Simple DisMax query:
> weight.scorer: 3 ms (get term enum)
> scorer.score: 23 ms (nextDoc iterations)
> other: 3 ms
> Total: 33 ms
>
> DisMax wrapped in ScaleFloatFunction:
> weight.scorer: 39 ms (get scaled values)
> scorer.score: 39 ms (nextDoc iterations)
> other: 11 ms
> Total: 89 ms
>
> Even with any improvements to 'scale', all function queries will add a
> linear increase to the Qtime as index size increases, since they match all
> docs.
>
> Trey: I'd be happy to test any patch that you find improves the speed.
>
>
>
> On Mon, Dec 2, 2013 at 11:21 PM, Trey Grainger  wrote:
>
>> We're working on the same problem with the combination of the
>> scale(query(...)) combination, so I'd like to share a bit more information
>> that may be useful.
>>
>> *On the scale function:*
>> Even thought the scale query has to calculate the scores for all
>> documents,
>> it is actually doing this work twice for each ValueSource (once to
>> calculate the min and max values, and then again when actually scoring the
>> documents), which is inefficient.
>>
>> To solve the problem, we're in the process of putting a cache inside the
>> scale function to remember the values for each document when they are
>> initially computed (to find the min and max) so that the second pass can
>> just use the previously computed values for each document.  Our theory is
>> that most of the extra time due to the scale function is really just the
>> result of doing duplicate work.
>>
>> No promises this won't be overly costly in terms of memory utilization,
>> but
>> we'll see what we get in terms of speed improvements and will share the
>> code if it works out well.  Alternate implementation suggestions (or
>> criticism of a cache like this) are also welcomed.
>>
>>
>> *On the NoOp product function: scale(prod(1, query(...))):*
>> We do the same thing, which ultimately is just an unnecessary waste of a
>> loop through all documents to do an extra multiplication step.  I just
>> debugged the code and uncovered the problem.  There is a Map (called
>> context) that is passed through to each value source to store intermediate
>> state, and both the query and scale functions are passing the ValueSource
>> for the query function in as the KEY to this Map (as opposed to using some
>> composite key that makes sense in the current context).  Essentially,
>> these
>> lines are overwriting each other:
>>
>> Inside ScaleFloatFunction: context.put(this.source, scaleInfo);
>>  //this.source refers to the QueryValueSource, and the scaleInfo refers to
>> a ScaleInfo object
>> Inside QueryValueSource: context.put(this, w); //this refers to the same
>> QueryValueSource from above, and the w refers to a Weight object
>>
>> As such, when the ScaleFloatFunction later goes to read the ScaleInfo from
>> the context Map, it unexpectedly pulls the Weight object out instead and
>> thus the invalid case exception occurs.  The NoOp multiplication works
>> because it puts an "different" ValueSource between the query and the
>> ScaleFloatFunction such that this.source (in ScaleFloatFunction) != this
>> (in QueryValueSource).
>>
>> This should be an easy fix.  I'll create a JIRA ticket to use better key
>> names in these functions and push up a patch.  This will eliminate the
>> need
>> for the extra NoOp function.
>>
>> -Trey
>>
>>
>> On Mon, Dec 2, 2013 at 12:41 PM, Peter Keegan > >wrote:
>>
>> > I'm persuing this possible PostFilter solution, I can see how to collect
&g

Configurable collectors for custom ranking

2013-12-06 Thread Peter Keegan
I looked at SOLR-4465 and SOLR-5045, where it appears that there is a goal
to be able to do custom sorting and ranking in a PostFilter. So far, it
looks like only custom aggregation can be implemented in PostFilter (5045).
Custom sorting/ranking can be done in a pluggable collector (4465), but
this patch is no longer in dev.

Is there any other dev. being done on adding custom sorting (after
collection) via a plugin?

Thanks,
Peter


Re: Function query matching

2013-12-07 Thread Peter Keegan
  >But for your specific goal Peter: Yes, if the whole point of a function
  >you have is to wrap generated a "scaled" score of your base $qq, ...

Thanks for the confirmation, Chris. So, to do this efficiently, I think I
need to implement a custom Collector that performs the scaling (and other
math) after collecting the matching dismax query docs. I started a separate
thread asking about the state of configurable collectors.

Thanks,
Peter


On Sat, Dec 7, 2013 at 1:45 AM, Chris Hostetter wrote:

>
> I had to do a double take when i read this sentence...
>
> : Even with any improvements to 'scale', all function queries will add a
> : linear increase to the Qtime as index size increases, since they match
> all
> : docs.
>
> ...because that smelled like either a bug in your methodology, or a bug in
> Solr.  To convince myself there wasn't a bug in Solr, i wrote a test case
> (i'll commit tomorow, bunch of churn in svn right now making "ant
> precommit" unhappy) to prove that when wrapping boost functions arround
> queries, Solr will only evaluate the functions for docs matching the
> wrapped query -- so there is no linear increase as the index size
> increases, just the (neccessary) libera increase as the number of
> *matching* docs grows. (for most functions anyway -- as mentioned "scale"
> is special).
>
> BUT! ... then i remembered how this thread started, and your goal of
> "scaling" the scores from a wrapped query.
>
> I want to be clear for 99% of the people reading this, if you find
> yourself writting a query structure like this...
>
>   q={!func}..functions involving wrapping $qq ...
>  qq={!edismax ...lots of stuff but still only matching subset of the
> index...}
>  fq={!query v=$qq}
>
> ...Try to restructure the match you want to do into the form of a
> multiplier
>
>   q={!boost b=$b v=$qq}
>   b=...functions producing a score multiplier...
>  qq={!edismax ...lots of stuff but still only matching subset of the
> index...}
>
> Because the later case is much more efficient and Solr will only compute
> the function values for hte docs it needs to (that match the wrapped $qq
> query)
>
> But for your specific goal Peter: Yes, if the whole point of a function
> you have is to wrap generated a "scaled" score of your base $qq, then the
> function (wrapping the scale(), wrapping the query()) is going to have to
> be evaluated for every doc -- that will definitely be linear based on the
> size of the index.
>
>
>
> -Hoss
> http://www.lucidworks.com/
>


Re: Configurable collectors for custom ranking

2013-12-10 Thread Peter Keegan
Hi Joel,

This is related to another thread on function query matching (
http://lucene.472066.n3.nabble.com/Function-query-matching-td4099807.html#a4105513).
The patch in SOLR-4465 will allow me to extend TopDocsCollector and perform
the 'scale' function on only the documents matching the main dismax query.
As you mention, it is a slightly intrusive design and requires that I
manage my own PriorityQueue (and a local duplicate of HitQueue), but should
work. I think a better design would hide the PQ from the plugin.

Thanks,
Peter


On Sun, Dec 8, 2013 at 5:32 PM, Joel Bernstein  wrote:

> Hi Peter,
>
> I've been meaning to revisit configurable ranking collectors, but I haven't
> yet had a chance. It's on the shortlist of things I'd like to tackle
> though.
>
>
>
> On Fri, Dec 6, 2013 at 4:17 PM, Peter Keegan 
> wrote:
>
> > I looked at SOLR-4465 and SOLR-5045, where it appears that there is a
> goal
> > to be able to do custom sorting and ranking in a PostFilter. So far, it
> > looks like only custom aggregation can be implemented in PostFilter
> (5045).
> > Custom sorting/ranking can be done in a pluggable collector (4465), but
> > this patch is no longer in dev.
> >
> > Is there any other dev. being done on adding custom sorting (after
> > collection) via a plugin?
> >
> > Thanks,
> > Peter
> >
>
>
>
> --
> Joel Bernstein
> Search Engineer at Heliosearch
>


Re: Configurable collectors for custom ranking

2013-12-10 Thread Peter Keegan
Quick question:
In the context of a custom collector, how does one get the values of a
field of type 'ExternalFileField'?

Thanks,
Peter


On Tue, Dec 10, 2013 at 1:18 PM, Peter Keegan wrote:

> Hi Joel,
>
> This is related to another thread on function query matching (
> http://lucene.472066.n3.nabble.com/Function-query-matching-td4099807.html#a4105513).
> The patch in SOLR-4465 will allow me to extend TopDocsCollector and perform
> the 'scale' function on only the documents matching the main dismax query.
> As you mention, it is a slightly intrusive design and requires that I
> manage my own PriorityQueue (and a local duplicate of HitQueue), but should
> work. I think a better design would hide the PQ from the plugin.
>
> Thanks,
> Peter
>
>
> On Sun, Dec 8, 2013 at 5:32 PM, Joel Bernstein  wrote:
>
>> Hi Peter,
>>
>> I've been meaning to revisit configurable ranking collectors, but I
>> haven't
>> yet had a chance. It's on the shortlist of things I'd like to tackle
>> though.
>>
>>
>>
>> On Fri, Dec 6, 2013 at 4:17 PM, Peter Keegan 
>> wrote:
>>
>> > I looked at SOLR-4465 and SOLR-5045, where it appears that there is a
>> goal
>> > to be able to do custom sorting and ranking in a PostFilter. So far, it
>> > looks like only custom aggregation can be implemented in PostFilter
>> (5045).
>> > Custom sorting/ranking can be done in a pluggable collector (4465), but
>> > this patch is no longer in dev.
>> >
>> > Is there any other dev. being done on adding custom sorting (after
>> > collection) via a plugin?
>> >
>> > Thanks,
>> > Peter
>> >
>>
>>
>>
>> --
>> Joel Bernstein
>> Search Engineer at Heliosearch
>>
>
>


Re: Configurable collectors for custom ranking

2013-12-11 Thread Peter Keegan
Hi Joel,

I thought about using a PostFilter, but the problem is that the 'scale'
function must be done after all matching docs have been scored but before
adding them to the PriorityQueue that sorts just the rows to be returned.
Doing the 'scale' function wrapped in a 'query' is proving to be too slow
when it visits every document in the index.

In the Collector, I can see how to get the field values like this:
indexSearcher.getSchema().getField("field(myfield").getType().getValueSource(SchemaField,
QParser).getValues()

But, 'getValueSource' needs a QParser, which isn't available.
And I can't create a QParser without a SolrQueryRequest, which isn't
available.

Thanks,
Peter


On Wed, Dec 11, 2013 at 1:48 PM, Joel Bernstein  wrote:

> Peter,
>
> It sounds like you could achieve what you want to do in a PostFilter rather
> then extending the TopDocsCollector. Is there a reason why a PostFilter
> won't work for you?
>
> Joel
>
>
> On Tue, Dec 10, 2013 at 3:24 PM, Peter Keegan  >wrote:
>
> > Quick question:
> > In the context of a custom collector, how does one get the values of a
> > field of type 'ExternalFileField'?
> >
> > Thanks,
> > Peter
> >
> >
> > On Tue, Dec 10, 2013 at 1:18 PM, Peter Keegan  > >wrote:
> >
> > > Hi Joel,
> > >
> > > This is related to another thread on function query matching (
> > >
> >
> http://lucene.472066.n3.nabble.com/Function-query-matching-td4099807.html#a4105513
> > ).
> > > The patch in SOLR-4465 will allow me to extend TopDocsCollector and
> > perform
> > > the 'scale' function on only the documents matching the main dismax
> > query.
> > > As you mention, it is a slightly intrusive design and requires that I
> > > manage my own PriorityQueue (and a local duplicate of HitQueue), but
> > should
> > > work. I think a better design would hide the PQ from the plugin.
> > >
> > > Thanks,
> > > Peter
> > >
> > >
> > > On Sun, Dec 8, 2013 at 5:32 PM, Joel Bernstein 
> > wrote:
> > >
> > >> Hi Peter,
> > >>
> > >> I've been meaning to revisit configurable ranking collectors, but I
> > >> haven't
> > >> yet had a chance. It's on the shortlist of things I'd like to tackle
> > >> though.
> > >>
> > >>
> > >>
> > >> On Fri, Dec 6, 2013 at 4:17 PM, Peter Keegan 
> > >> wrote:
> > >>
> > >> > I looked at SOLR-4465 and SOLR-5045, where it appears that there is
> a
> > >> goal
> > >> > to be able to do custom sorting and ranking in a PostFilter. So far,
> > it
> > >> > looks like only custom aggregation can be implemented in PostFilter
> > >> (5045).
> > >> > Custom sorting/ranking can be done in a pluggable collector (4465),
> > but
> > >> > this patch is no longer in dev.
> > >> >
> > >> > Is there any other dev. being done on adding custom sorting (after
> > >> > collection) via a plugin?
> > >> >
> > >> > Thanks,
> > >> > Peter
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> Joel Bernstein
> > >> Search Engineer at Heliosearch
> > >>
> > >
> > >
> >
>
>
>
> --
> Joel Bernstein
> Search Engineer at Heliosearch
>


Re: Configurable collectors for custom ranking

2013-12-11 Thread Peter Keegan
>From the Collector context, I suppose I can access the FileFloatSource
directly like this, although it's not generic:

SchemaField field = indexSearcher.getSchema().getField(fieldName);
dataDir = indexSearcher.getSchema().getResourceLoader().getDataDir();
ExternalFileField eff = (ExternalFileField)field.getType();
fieldValues = eff.getFileFloatSource(field, dataDir);

And then read the values in 'setNextReader'

Peter


On Wed, Dec 11, 2013 at 2:05 PM, Peter Keegan wrote:

> Hi Joel,
>
> I thought about using a PostFilter, but the problem is that the 'scale'
> function must be done after all matching docs have been scored but before
> adding them to the PriorityQueue that sorts just the rows to be returned.
> Doing the 'scale' function wrapped in a 'query' is proving to be too slow
> when it visits every document in the index.
>
> In the Collector, I can see how to get the field values like this:
> indexSearcher.getSchema().getField("field(myfield").getType().getValueSource(SchemaField,
> QParser).getValues()
>
> But, 'getValueSource' needs a QParser, which isn't available.
> And I can't create a QParser without a SolrQueryRequest, which isn't
> available.
>
> Thanks,
> Peter
>
>
> On Wed, Dec 11, 2013 at 1:48 PM, Joel Bernstein wrote:
>
>> Peter,
>>
>> It sounds like you could achieve what you want to do in a PostFilter
>> rather
>> then extending the TopDocsCollector. Is there a reason why a PostFilter
>> won't work for you?
>>
>> Joel
>>
>>
>> On Tue, Dec 10, 2013 at 3:24 PM, Peter Keegan > >wrote:
>>
>> > Quick question:
>> > In the context of a custom collector, how does one get the values of a
>> > field of type 'ExternalFileField'?
>> >
>> > Thanks,
>> > Peter
>> >
>> >
>> > On Tue, Dec 10, 2013 at 1:18 PM, Peter Keegan > > >wrote:
>> >
>> > > Hi Joel,
>> > >
>> > > This is related to another thread on function query matching (
>> > >
>> >
>> http://lucene.472066.n3.nabble.com/Function-query-matching-td4099807.html#a4105513
>> > ).
>> > > The patch in SOLR-4465 will allow me to extend TopDocsCollector and
>> > perform
>> > > the 'scale' function on only the documents matching the main dismax
>> > query.
>> > > As you mention, it is a slightly intrusive design and requires that I
>> > > manage my own PriorityQueue (and a local duplicate of HitQueue), but
>> > should
>> > > work. I think a better design would hide the PQ from the plugin.
>> > >
>> > > Thanks,
>> > > Peter
>> > >
>> > >
>> > > On Sun, Dec 8, 2013 at 5:32 PM, Joel Bernstein 
>> > wrote:
>> > >
>> > >> Hi Peter,
>> > >>
>> > >> I've been meaning to revisit configurable ranking collectors, but I
>> > >> haven't
>> > >> yet had a chance. It's on the shortlist of things I'd like to tackle
>> > >> though.
>> > >>
>> > >>
>> > >>
>> > >> On Fri, Dec 6, 2013 at 4:17 PM, Peter Keegan > >
>> > >> wrote:
>> > >>
>> > >> > I looked at SOLR-4465 and SOLR-5045, where it appears that there
>> is a
>> > >> goal
>> > >> > to be able to do custom sorting and ranking in a PostFilter. So
>> far,
>> > it
>> > >> > looks like only custom aggregation can be implemented in PostFilter
>> > >> (5045).
>> > >> > Custom sorting/ranking can be done in a pluggable collector (4465),
>> > but
>> > >> > this patch is no longer in dev.
>> > >> >
>> > >> > Is there any other dev. being done on adding custom sorting (after
>> > >> > collection) via a plugin?
>> > >> >
>> > >> > Thanks,
>> > >> > Peter
>> > >> >
>> > >>
>> > >>
>> > >>
>> > >> --
>> > >> Joel Bernstein
>> > >> Search Engineer at Heliosearch
>> > >>
>> > >
>> > >
>> >
>>
>>
>>
>> --
>> Joel Bernstein
>> Search Engineer at Heliosearch
>>
>
>


Re: Configurable collectors for custom ranking

2013-12-11 Thread Peter Keegan
This is what I was looking for, but the DelegatingCollector 'finish' method
doesn't exist in 4.3.0 :(   Can this be patched in and are there any other
PostFilter dependencies on 4.5?

Thanks,
Peter


On Wed, Dec 11, 2013 at 3:16 PM, Joel Bernstein  wrote:

> Here is one approach to use in a postfilter
>
> 1) In the collect() method call score for each doc. Use the scores to
> create your scaleInfo.
> 2) Keep a bitset of the hits and a priorityQueue of your top X ScoreDocs.
> 3) Don't delegate any documents to lower collectors in the collect()
> method.
> 4) In the finish method create a score mapping (use the hppc
> IntFloatOpenHashMap) with your top X docIds pointing to their score, using
> the priorityQueue created in step 2. Then iterate the bitset (also created
> in step 2) sending down each doc to the lower collectors, retrieving and
> scaling the score from the score map. If the document is not in the score
> map then send down 0.
>
> You'll have setup a dummy scorer to feed to lower collectors. The
> CollapsingQParserPlugin has an example of how to do this.
>
>
>
>
> On Wed, Dec 11, 2013 at 2:05 PM, Peter Keegan  >wrote:
>
> > Hi Joel,
> >
> > I thought about using a PostFilter, but the problem is that the 'scale'
> > function must be done after all matching docs have been scored but before
> > adding them to the PriorityQueue that sorts just the rows to be returned.
> > Doing the 'scale' function wrapped in a 'query' is proving to be too slow
> > when it visits every document in the index.
> >
> > In the Collector, I can see how to get the field values like this:
> >
> >
> indexSearcher.getSchema().getField("field(myfield").getType().getValueSource(SchemaField,
> > QParser).getValues()
> >
> > But, 'getValueSource' needs a QParser, which isn't available.
> > And I can't create a QParser without a SolrQueryRequest, which isn't
> > available.
> >
> > Thanks,
> > Peter
> >
> >
> > On Wed, Dec 11, 2013 at 1:48 PM, Joel Bernstein 
> > wrote:
> >
> > > Peter,
> > >
> > > It sounds like you could achieve what you want to do in a PostFilter
> > rather
> > > then extending the TopDocsCollector. Is there a reason why a PostFilter
> > > won't work for you?
> > >
> > > Joel
> > >
> > >
> > > On Tue, Dec 10, 2013 at 3:24 PM, Peter Keegan  > > >wrote:
> > >
> > > > Quick question:
> > > > In the context of a custom collector, how does one get the values of
> a
> > > > field of type 'ExternalFileField'?
> > > >
> > > > Thanks,
> > > > Peter
> > > >
> > > >
> > > > On Tue, Dec 10, 2013 at 1:18 PM, Peter Keegan <
> peterlkee...@gmail.com
> > > > >wrote:
> > > >
> > > > > Hi Joel,
> > > > >
> > > > > This is related to another thread on function query matching (
> > > > >
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/Function-query-matching-td4099807.html#a4105513
> > > > ).
> > > > > The patch in SOLR-4465 will allow me to extend TopDocsCollector and
> > > > perform
> > > > > the 'scale' function on only the documents matching the main dismax
> > > > query.
> > > > > As you mention, it is a slightly intrusive design and requires
> that I
> > > > > manage my own PriorityQueue (and a local duplicate of HitQueue),
> but
> > > > should
> > > > > work. I think a better design would hide the PQ from the plugin.
> > > > >
> > > > > Thanks,
> > > > > Peter
> > > > >
> > > > >
> > > > > On Sun, Dec 8, 2013 at 5:32 PM, Joel Bernstein  >
> > > > wrote:
> > > > >
> > > > >> Hi Peter,
> > > > >>
> > > > >> I've been meaning to revisit configurable ranking collectors, but
> I
> > > > >> haven't
> > > > >> yet had a chance. It's on the shortlist of things I'd like to
> tackle
> > > > >> though.
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Fri, Dec 6, 2013 at 4:17 PM, Peter Keegan <
> > peterlkee...@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >> > I looked at SOLR-4465 and SOLR-5045, where it appears that there
> > is
> > > a
> > > > >> goal
> > > > >> > to be able to do custom sorting and ranking in a PostFilter. So
> > far,
> > > > it
> > > > >> > looks like only custom aggregation can be implemented in
> > PostFilter
> > > > >> (5045).
> > > > >> > Custom sorting/ranking can be done in a pluggable collector
> > (4465),
> > > > but
> > > > >> > this patch is no longer in dev.
> > > > >> >
> > > > >> > Is there any other dev. being done on adding custom sorting
> (after
> > > > >> > collection) via a plugin?
> > > > >> >
> > > > >> > Thanks,
> > > > >> > Peter
> > > > >> >
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Joel Bernstein
> > > > >> Search Engineer at Heliosearch
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Joel Bernstein
> > > Search Engineer at Heliosearch
> > >
> >
>
>
>
> --
> Joel Bernstein
> Search Engineer at Heliosearch
>


Re: Configurable collectors for custom ranking

2013-12-11 Thread Peter Keegan
Thanks very much for the guidance. I'd be happy to donate a working
solution.

Peter


On Wed, Dec 11, 2013 at 3:53 PM, Joel Bernstein  wrote:

> SOLR-5020 has the commit info, it's mainly changes to SolrIndexSearcher I
> believe. They might apply to 4.3.
> I think as long you have the finish method that's all you'll need. If you
> can get this working it would be excellent if you could donate back the
> Scale PostFilter.
>
>
> On Wed, Dec 11, 2013 at 3:36 PM, Peter Keegan  >wrote:
>
> > This is what I was looking for, but the DelegatingCollector 'finish'
> method
> > doesn't exist in 4.3.0 :(   Can this be patched in and are there any
> other
> > PostFilter dependencies on 4.5?
> >
> > Thanks,
> > Peter
> >
> >
> > On Wed, Dec 11, 2013 at 3:16 PM, Joel Bernstein 
> > wrote:
> >
> > > Here is one approach to use in a postfilter
> > >
> > > 1) In the collect() method call score for each doc. Use the scores to
> > > create your scaleInfo.
> > > 2) Keep a bitset of the hits and a priorityQueue of your top X
> ScoreDocs.
> > > 3) Don't delegate any documents to lower collectors in the collect()
> > > method.
> > > 4) In the finish method create a score mapping (use the hppc
> > > IntFloatOpenHashMap) with your top X docIds pointing to their score,
> > using
> > > the priorityQueue created in step 2. Then iterate the bitset (also
> > created
> > > in step 2) sending down each doc to the lower collectors, retrieving
> and
> > > scaling the score from the score map. If the document is not in the
> score
> > > map then send down 0.
> > >
> > > You'll have setup a dummy scorer to feed to lower collectors. The
> > > CollapsingQParserPlugin has an example of how to do this.
> > >
> > >
> > >
> > >
> > > On Wed, Dec 11, 2013 at 2:05 PM, Peter Keegan  > > >wrote:
> > >
> > > > Hi Joel,
> > > >
> > > > I thought about using a PostFilter, but the problem is that the
> 'scale'
> > > > function must be done after all matching docs have been scored but
> > before
> > > > adding them to the PriorityQueue that sorts just the rows to be
> > returned.
> > > > Doing the 'scale' function wrapped in a 'query' is proving to be too
> > slow
> > > > when it visits every document in the index.
> > > >
> > > > In the Collector, I can see how to get the field values like this:
> > > >
> > > >
> > >
> >
> indexSearcher.getSchema().getField("field(myfield").getType().getValueSource(SchemaField,
> > > > QParser).getValues()
> > > >
> > > > But, 'getValueSource' needs a QParser, which isn't available.
> > > > And I can't create a QParser without a SolrQueryRequest, which isn't
> > > > available.
> > > >
> > > > Thanks,
> > > > Peter
> > > >
> > > >
> > > > On Wed, Dec 11, 2013 at 1:48 PM, Joel Bernstein 
> > > > wrote:
> > > >
> > > > > Peter,
> > > > >
> > > > > It sounds like you could achieve what you want to do in a
> PostFilter
> > > > rather
> > > > > then extending the TopDocsCollector. Is there a reason why a
> > PostFilter
> > > > > won't work for you?
> > > > >
> > > > > Joel
> > > > >
> > > > >
> > > > > On Tue, Dec 10, 2013 at 3:24 PM, Peter Keegan <
> > peterlkee...@gmail.com
> > > > > >wrote:
> > > > >
> > > > > > Quick question:
> > > > > > In the context of a custom collector, how does one get the values
> > of
> > > a
> > > > > > field of type 'ExternalFileField'?
> > > > > >
> > > > > > Thanks,
> > > > > > Peter
> > > > > >
> > > > > >
> > > > > > On Tue, Dec 10, 2013 at 1:18 PM, Peter Keegan <
> > > peterlkee...@gmail.com
> > > > > > >wrote:
> > > > > >
> > > > > > > Hi Joel,
> > > > > > >
> > > > > > > This is related to another thread on function query matching (
> > > > > > >
> > > > > >
> > > > >
&

Re: Configurable collectors for custom ranking

2013-12-12 Thread Peter Keegan
Regarding my original goal, which is to perform a math function using the
scaled score and a field value, and sort on the result, how does this fit
in? Must I implement another custom PostFilter with a higher cost than the
scale PostFilter?

Thanks,
Peter


On Wed, Dec 11, 2013 at 4:01 PM, Peter Keegan wrote:

> Thanks very much for the guidance. I'd be happy to donate a working
> solution.
>
> Peter
>
>
> On Wed, Dec 11, 2013 at 3:53 PM, Joel Bernstein wrote:
>
>> SOLR-5020 has the commit info, it's mainly changes to SolrIndexSearcher I
>> believe. They might apply to 4.3.
>> I think as long you have the finish method that's all you'll need. If you
>> can get this working it would be excellent if you could donate back the
>> Scale PostFilter.
>>
>>
>> On Wed, Dec 11, 2013 at 3:36 PM, Peter Keegan > >wrote:
>>
>> > This is what I was looking for, but the DelegatingCollector 'finish'
>> method
>> > doesn't exist in 4.3.0 :(   Can this be patched in and are there any
>> other
>> > PostFilter dependencies on 4.5?
>> >
>> > Thanks,
>> > Peter
>> >
>> >
>> > On Wed, Dec 11, 2013 at 3:16 PM, Joel Bernstein 
>> > wrote:
>> >
>> > > Here is one approach to use in a postfilter
>> > >
>> > > 1) In the collect() method call score for each doc. Use the scores to
>> > > create your scaleInfo.
>> > > 2) Keep a bitset of the hits and a priorityQueue of your top X
>> ScoreDocs.
>> > > 3) Don't delegate any documents to lower collectors in the collect()
>> > > method.
>> > > 4) In the finish method create a score mapping (use the hppc
>> > > IntFloatOpenHashMap) with your top X docIds pointing to their score,
>> > using
>> > > the priorityQueue created in step 2. Then iterate the bitset (also
>> > created
>> > > in step 2) sending down each doc to the lower collectors, retrieving
>> and
>> > > scaling the score from the score map. If the document is not in the
>> score
>> > > map then send down 0.
>> > >
>> > > You'll have setup a dummy scorer to feed to lower collectors. The
>> > > CollapsingQParserPlugin has an example of how to do this.
>> > >
>> > >
>> > >
>> > >
>> > > On Wed, Dec 11, 2013 at 2:05 PM, Peter Keegan > > > >wrote:
>> > >
>> > > > Hi Joel,
>> > > >
>> > > > I thought about using a PostFilter, but the problem is that the
>> 'scale'
>> > > > function must be done after all matching docs have been scored but
>> > before
>> > > > adding them to the PriorityQueue that sorts just the rows to be
>> > returned.
>> > > > Doing the 'scale' function wrapped in a 'query' is proving to be too
>> > slow
>> > > > when it visits every document in the index.
>> > > >
>> > > > In the Collector, I can see how to get the field values like this:
>> > > >
>> > > >
>> > >
>> >
>> indexSearcher.getSchema().getField("field(myfield").getType().getValueSource(SchemaField,
>> > > > QParser).getValues()
>> > > >
>> > > > But, 'getValueSource' needs a QParser, which isn't available.
>> > > > And I can't create a QParser without a SolrQueryRequest, which isn't
>> > > > available.
>> > > >
>> > > > Thanks,
>> > > > Peter
>> > > >
>> > > >
>> > > > On Wed, Dec 11, 2013 at 1:48 PM, Joel Bernstein > >
>> > > > wrote:
>> > > >
>> > > > > Peter,
>> > > > >
>> > > > > It sounds like you could achieve what you want to do in a
>> PostFilter
>> > > > rather
>> > > > > then extending the TopDocsCollector. Is there a reason why a
>> > PostFilter
>> > > > > won't work for you?
>> > > > >
>> > > > > Joel
>> > > > >
>> > > > >
>> > > > > On Tue, Dec 10, 2013 at 3:24 PM, Peter Keegan <
>> > peterlkee...@gmail.com
>> > > > > >wrote:
>> > > > >
>> > > > > > Quick question:
>> > > > > > In the context of a 

Re: Configurable collectors for custom ranking

2013-12-12 Thread Peter Keegan
This is pretty cool, and worthy of adding to Solr in Action (v2) and the
other books. With function queries, flexible filter processing and caching,
custom collectors, and post filters, there's a lot of flexibility here.

Btw, the query times using a custom collector to scale/recompute scores is
excellent (will have to see how it compares to your outlined solution).

Thanks,
Peter


On Thu, Dec 12, 2013 at 11:13 AM, Joel Bernstein  wrote:

> The sorting is going to happen in the lower level collectors. You need a
> value source that returns the score of the document being collected.
>
> Here is how you can make this happen:
>
> 1) Create an object in your PostFilter that simply holds the current score.
> Place this object in the SearchRequest context map. Update object.score as
> you pass the docs and scores to the lower collectors.
>
> 2) Create a values source that checks the SearchRequest context for the
> object that's holding the current score. Use this object to return the
> current score when called. For example if you give the value source a
> handle called "score" a compound function call will look like this:
> sum(score(), field(x))
>
> Joel
>
>
>
>
>
>
>
>
>
>
> On Thu, Dec 12, 2013 at 9:58 AM, Peter Keegan  >wrote:
>
> > Regarding my original goal, which is to perform a math function using the
> > scaled score and a field value, and sort on the result, how does this fit
> > in? Must I implement another custom PostFilter with a higher cost than
> the
> > scale PostFilter?
> >
> > Thanks,
> > Peter
> >
> >
> > On Wed, Dec 11, 2013 at 4:01 PM, Peter Keegan  > >wrote:
> >
> > > Thanks very much for the guidance. I'd be happy to donate a working
> > > solution.
> > >
> > > Peter
> > >
> > >
> > > On Wed, Dec 11, 2013 at 3:53 PM, Joel Bernstein  > >wrote:
> > >
> > >> SOLR-5020 has the commit info, it's mainly changes to
> SolrIndexSearcher
> > I
> > >> believe. They might apply to 4.3.
> > >> I think as long you have the finish method that's all you'll need. If
> > you
> > >> can get this working it would be excellent if you could donate back
> the
> > >> Scale PostFilter.
> > >>
> > >>
> > >> On Wed, Dec 11, 2013 at 3:36 PM, Peter Keegan  > >> >wrote:
> > >>
> > >> > This is what I was looking for, but the DelegatingCollector 'finish'
> > >> method
> > >> > doesn't exist in 4.3.0 :(   Can this be patched in and are there any
> > >> other
> > >> > PostFilter dependencies on 4.5?
> > >> >
> > >> > Thanks,
> > >> > Peter
> > >> >
> > >> >
> > >> > On Wed, Dec 11, 2013 at 3:16 PM, Joel Bernstein  >
> > >> > wrote:
> > >> >
> > >> > > Here is one approach to use in a postfilter
> > >> > >
> > >> > > 1) In the collect() method call score for each doc. Use the scores
> > to
> > >> > > create your scaleInfo.
> > >> > > 2) Keep a bitset of the hits and a priorityQueue of your top X
> > >> ScoreDocs.
> > >> > > 3) Don't delegate any documents to lower collectors in the
> collect()
> > >> > > method.
> > >> > > 4) In the finish method create a score mapping (use the hppc
> > >> > > IntFloatOpenHashMap) with your top X docIds pointing to their
> score,
> > >> > using
> > >> > > the priorityQueue created in step 2. Then iterate the bitset (also
> > >> > created
> > >> > > in step 2) sending down each doc to the lower collectors,
> retrieving
> > >> and
> > >> > > scaling the score from the score map. If the document is not in
> the
> > >> score
> > >> > > map then send down 0.
> > >> > >
> > >> > > You'll have setup a dummy scorer to feed to lower collectors. The
> > >> > > CollapsingQParserPlugin has an example of how to do this.
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > On Wed, Dec 11, 2013 at 2:05 PM, Peter Keegan <
> > peterlkee...@gmail.com
> > >> > > >wrote:
> > >> > >
> > >> > > > Hi Joel,
> > >> > > >
> > >> >

Re: Configurable collectors for custom ranking

2013-12-19 Thread Peter Keegan
In order to size the PriorityQueue, the result window size for the query is
needed. This has been computed in the SolrIndexSearcher and available in:
QueryCommand.getSupersetMaxDoc(), but doesn't seem to be available for the
PostFilter in either the SolrParms or SolrQueryRequest. Is there a way to
get this precomputed value or do I have to duplicate the logic from
SolrIndexSearcher?

Thanks,
Peter


On Thu, Dec 12, 2013 at 1:53 PM, Joel Bernstein  wrote:

> Thanks, I agree this powerful stuff. One of the reasons that I haven't
> gotten back to pluggable collectors is that I've been using PostFilters
> instead.
>
> When you start doing stuff with scores in postfilters you'll run into the
> bug in SOLR-5416. This will effect you when you use facets in combination
> with the QueryResultCache or tag and exclude faceting.
>
> The patch in SOLR-5416 resolves this issue. You'll just need your
> PostFilter to implement ScoreFilter and the SolrIndexSearcher will know how
> to handle things.
>
> The DelegatingCollector.finish() method is so new, these kinds of bugs are
> still being cleaned out of the system. SOLR-5416 should be in Solr 4.7.
>
>
>
>
>
>
>
>
>
> On Thu, Dec 12, 2013 at 12:54 PM, Peter Keegan  >wrote:
>
> > This is pretty cool, and worthy of adding to Solr in Action (v2) and the
> > other books. With function queries, flexible filter processing and
> caching,
> > custom collectors, and post filters, there's a lot of flexibility here.
> >
> > Btw, the query times using a custom collector to scale/recompute scores
> is
> > excellent (will have to see how it compares to your outlined solution).
> >
> > Thanks,
> > Peter
> >
> >
> > On Thu, Dec 12, 2013 at 11:13 AM, Joel Bernstein 
> > wrote:
> >
> > > The sorting is going to happen in the lower level collectors. You need
> a
> > > value source that returns the score of the document being collected.
> > >
> > > Here is how you can make this happen:
> > >
> > > 1) Create an object in your PostFilter that simply holds the current
> > score.
> > > Place this object in the SearchRequest context map. Update object.score
> > as
> > > you pass the docs and scores to the lower collectors.
> > >
> > > 2) Create a values source that checks the SearchRequest context for the
> > > object that's holding the current score. Use this object to return the
> > > current score when called. For example if you give the value source a
> > > handle called "score" a compound function call will look like this:
> > > sum(score(), field(x))
> > >
> > > Joel
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Thu, Dec 12, 2013 at 9:58 AM, Peter Keegan  > > >wrote:
> > >
> > > > Regarding my original goal, which is to perform a math function using
> > the
> > > > scaled score and a field value, and sort on the result, how does this
> > fit
> > > > in? Must I implement another custom PostFilter with a higher cost
> than
> > > the
> > > > scale PostFilter?
> > > >
> > > > Thanks,
> > > > Peter
> > > >
> > > >
> > > > On Wed, Dec 11, 2013 at 4:01 PM, Peter Keegan <
> peterlkee...@gmail.com
> > > > >wrote:
> > > >
> > > > > Thanks very much for the guidance. I'd be happy to donate a working
> > > > > solution.
> > > > >
> > > > > Peter
> > > > >
> > > > >
> > > > > On Wed, Dec 11, 2013 at 3:53 PM, Joel Bernstein <
> joels...@gmail.com
> > > > >wrote:
> > > > >
> > > > >> SOLR-5020 has the commit info, it's mainly changes to
> > > SolrIndexSearcher
> > > > I
> > > > >> believe. They might apply to 4.3.
> > > > >> I think as long you have the finish method that's all you'll need.
> > If
> > > > you
> > > > >> can get this working it would be excellent if you could donate
> back
> > > the
> > > > >> Scale PostFilter.
> > > > >>
> > > > >>
> > > > >> On Wed, Dec 11, 2013 at 3:36 PM, Peter Keegan <
> > peterlkee...@gmail.com
> > > > >> >wrote:
> > > > >>
> > > > >> > This is what I was looking for, but 

Re: Configurable collectors for custom ranking

2013-12-19 Thread Peter Keegan
I implemented the PostFilter approach described by Joel. Just iterating
over the OpenBitSet, even without the scaling or the HashMap lookup, added
30ms to a query time, which kinda surprised me. There were about 150K hits
out of a total of 500K. Is OpenBitSet the best way to do this?

Thanks,
Peter


On Thu, Dec 19, 2013 at 9:51 AM, Peter Keegan wrote:

> In order to size the PriorityQueue, the result window size for the query
> is needed. This has been computed in the SolrIndexSearcher and available
> in: QueryCommand.getSupersetMaxDoc(), but doesn't seem to be available for
> the PostFilter in either the SolrParms or SolrQueryRequest. Is there a way
> to get this precomputed value or do I have to duplicate the logic from
> SolrIndexSearcher?
>
> Thanks,
> Peter
>
>
> On Thu, Dec 12, 2013 at 1:53 PM, Joel Bernstein wrote:
>
>> Thanks, I agree this powerful stuff. One of the reasons that I haven't
>> gotten back to pluggable collectors is that I've been using PostFilters
>> instead.
>>
>> When you start doing stuff with scores in postfilters you'll run into the
>> bug in SOLR-5416. This will effect you when you use facets in combination
>> with the QueryResultCache or tag and exclude faceting.
>>
>> The patch in SOLR-5416 resolves this issue. You'll just need your
>> PostFilter to implement ScoreFilter and the SolrIndexSearcher will know
>> how
>> to handle things.
>>
>> The DelegatingCollector.finish() method is so new, these kinds of bugs are
>> still being cleaned out of the system. SOLR-5416 should be in Solr 4.7.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Dec 12, 2013 at 12:54 PM, Peter Keegan > >wrote:
>>
>> > This is pretty cool, and worthy of adding to Solr in Action (v2) and the
>> > other books. With function queries, flexible filter processing and
>> caching,
>> > custom collectors, and post filters, there's a lot of flexibility here.
>> >
>> > Btw, the query times using a custom collector to scale/recompute scores
>> is
>> > excellent (will have to see how it compares to your outlined solution).
>> >
>> > Thanks,
>> > Peter
>> >
>> >
>> > On Thu, Dec 12, 2013 at 11:13 AM, Joel Bernstein 
>> > wrote:
>> >
>> > > The sorting is going to happen in the lower level collectors. You
>> need a
>> > > value source that returns the score of the document being collected.
>> > >
>> > > Here is how you can make this happen:
>> > >
>> > > 1) Create an object in your PostFilter that simply holds the current
>> > score.
>> > > Place this object in the SearchRequest context map. Update
>> object.score
>> > as
>> > > you pass the docs and scores to the lower collectors.
>> > >
>> > > 2) Create a values source that checks the SearchRequest context for
>> the
>> > > object that's holding the current score. Use this object to return the
>> > > current score when called. For example if you give the value source a
>> > > handle called "score" a compound function call will look like this:
>> > > sum(score(), field(x))
>> > >
>> > > Joel
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On Thu, Dec 12, 2013 at 9:58 AM, Peter Keegan > > > >wrote:
>> > >
>> > > > Regarding my original goal, which is to perform a math function
>> using
>> > the
>> > > > scaled score and a field value, and sort on the result, how does
>> this
>> > fit
>> > > > in? Must I implement another custom PostFilter with a higher cost
>> than
>> > > the
>> > > > scale PostFilter?
>> > > >
>> > > > Thanks,
>> > > > Peter
>> > > >
>> > > >
>> > > > On Wed, Dec 11, 2013 at 4:01 PM, Peter Keegan <
>> peterlkee...@gmail.com
>> > > > >wrote:
>> > > >
>> > > > > Thanks very much for the guidance. I'd be happy to donate a
>> working
>> > > > > solution.
>> > > > >
>> > > > > Peter
>> > > > >
>> > > > >
>> > > > > On Wed, Dec 11, 2013 at 3:53 PM, Joel Bernstein <
>> joels...@gmail.com
>> > >

Re: Configurable collectors for custom ranking

2013-12-23 Thread Peter Keegan
Hi Joel,

Could you clarify what would be in the key,value Map added to the
SearchRequest context? It seems that all the docId/score tuples need to be
there, including the ones not in the 'top N ScoreDocs' PriorityQueue
(score=0). If so would the Map be something like:
"scaled_scores",Map ?

Also, what is the reason for passing score=0 for documents that aren't in
the PriorityQueue? Will these docs get filtered out before a normal sort by
score?

Thanks,
Peter


On Thu, Dec 12, 2013 at 11:13 AM, Joel Bernstein  wrote:

> The sorting is going to happen in the lower level collectors. You need a
> value source that returns the score of the document being collected.
>
> Here is how you can make this happen:
>
> 1) Create an object in your PostFilter that simply holds the current score.
> Place this object in the SearchRequest context map. Update object.score as
> you pass the docs and scores to the lower collectors.
>
> 2) Create a values source that checks the SearchRequest context for the
> object that's holding the current score. Use this object to return the
> current score when called. For example if you give the value source a
> handle called "score" a compound function call will look like this:
> sum(score(), field(x))
>
> Joel
>
>
>
>
>
>
>
>
>
>
> On Thu, Dec 12, 2013 at 9:58 AM, Peter Keegan  >wrote:
>
> > Regarding my original goal, which is to perform a math function using the
> > scaled score and a field value, and sort on the result, how does this fit
> > in? Must I implement another custom PostFilter with a higher cost than
> the
> > scale PostFilter?
> >
> > Thanks,
> > Peter
> >
> >
> > On Wed, Dec 11, 2013 at 4:01 PM, Peter Keegan  > >wrote:
> >
> > > Thanks very much for the guidance. I'd be happy to donate a working
> > > solution.
> > >
> > > Peter
> > >
> > >
> > > On Wed, Dec 11, 2013 at 3:53 PM, Joel Bernstein  > >wrote:
> > >
> > >> SOLR-5020 has the commit info, it's mainly changes to
> SolrIndexSearcher
> > I
> > >> believe. They might apply to 4.3.
> > >> I think as long you have the finish method that's all you'll need. If
> > you
> > >> can get this working it would be excellent if you could donate back
> the
> > >> Scale PostFilter.
> > >>
> > >>
> > >> On Wed, Dec 11, 2013 at 3:36 PM, Peter Keegan  > >> >wrote:
> > >>
> > >> > This is what I was looking for, but the DelegatingCollector 'finish'
> > >> method
> > >> > doesn't exist in 4.3.0 :(   Can this be patched in and are there any
> > >> other
> > >> > PostFilter dependencies on 4.5?
> > >> >
> > >> > Thanks,
> > >> > Peter
> > >> >
> > >> >
> > >> > On Wed, Dec 11, 2013 at 3:16 PM, Joel Bernstein  >
> > >> > wrote:
> > >> >
> > >> > > Here is one approach to use in a postfilter
> > >> > >
> > >> > > 1) In the collect() method call score for each doc. Use the scores
> > to
> > >> > > create your scaleInfo.
> > >> > > 2) Keep a bitset of the hits and a priorityQueue of your top X
> > >> ScoreDocs.
> > >> > > 3) Don't delegate any documents to lower collectors in the
> collect()
> > >> > > method.
> > >> > > 4) In the finish method create a score mapping (use the hppc
> > >> > > IntFloatOpenHashMap) with your top X docIds pointing to their
> score,
> > >> > using
> > >> > > the priorityQueue created in step 2. Then iterate the bitset (also
> > >> > created
> > >> > > in step 2) sending down each doc to the lower collectors,
> retrieving
> > >> and
> > >> > > scaling the score from the score map. If the document is not in
> the
> > >> score
> > >> > > map then send down 0.
> > >> > >
> > >> > > You'll have setup a dummy scorer to feed to lower collectors. The
> > >> > > CollapsingQParserPlugin has an example of how to do this.
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > On Wed, Dec 11, 2013 at 2:05 PM, Peter Keegan <
> > peterlkee...@gmail.com
> > >> > > >wrote:
> > 

Re: Configurable collectors for custom ranking

2013-12-26 Thread Peter Keegan
In my case, the final function call looks something like this:
sum(product($k1,score()),product($k2,field(x)))
This means that all the scores would have to scaled and passed down, not
just the top N because even a low score could be offset by a high value in
'field(x)'.

Thanks,
Peter


On Mon, Dec 23, 2013 at 6:37 PM, Joel Bernstein  wrote:

> Peter,
>
> You actually only need the current score being collected to be in the
> request context. So you don't need a map, you just need an object wrapper
> around a mutable float.
>
> If you have a page size of X, only the top X scores need to be held onto,
> because all the other scores wouldn't have made it into that page anyway so
> they might as well be 0. Because the QueryResultCache caches's a larger
> window then the page size you should keep enough scores so the cached
> docList is correct. But if you're only dealing with 150K of results you
> could just keep all the scores in a FloatArrayList and not worry about the
> keeping the top X scores in a priority queue.
>
> During the collect hang onto the docIds and scores and build your scaling
> info.
>
> During the finish iterate your docIds and scale the scores as you go.
>
> Set your scaled score into the object wrapper that is in the request
> context before you collect each document.
>
> When you call collect on the delegate collectors they will call the custom
> value source for each document to perform the sort. Your custom value
> source will return whatever the float value is in the request context at
> that time.
>
> If you're also going to run this postfilter when you're doing a standard
> rank by score you'll also need to send down a dummy scorer to the delegate
> collectors. Spend some time with the CollapsingQParserPlugin in trunk to
> see how the dummy scorer works.
>
> I'll be adding value source collapse criteria to the
> CollapsingQParserPlugin this week and it will have a similar interaction
> between a PostFilter and value source. So you may want to watch SOLR-5536
> to see an example of this.
>
> Joel
>
>
>
>
>
>
>
>
>
>
>
>
> Joel Bernstein
> Search Engineer at Heliosearch
>
>
> On Mon, Dec 23, 2013 at 4:03 PM, Peter Keegan  >wrote:
>
> > Hi Joel,
> >
> > Could you clarify what would be in the key,value Map added to the
> > SearchRequest context? It seems that all the docId/score tuples need to
> be
> > there, including the ones not in the 'top N ScoreDocs' PriorityQueue
> > (score=0). If so would the Map be something like:
> > "scaled_scores",Map ?
> >
> > Also, what is the reason for passing score=0 for documents that aren't in
> > the PriorityQueue? Will these docs get filtered out before a normal sort
> by
> > score?
> >
> > Thanks,
> > Peter
> >
> >
> > On Thu, Dec 12, 2013 at 11:13 AM, Joel Bernstein 
> > wrote:
> >
> > > The sorting is going to happen in the lower level collectors. You need
> a
> > > value source that returns the score of the document being collected.
> > >
> > > Here is how you can make this happen:
> > >
> > > 1) Create an object in your PostFilter that simply holds the current
> > score.
> > > Place this object in the SearchRequest context map. Update object.score
> > as
> > > you pass the docs and scores to the lower collectors.
> > >
> > > 2) Create a values source that checks the SearchRequest context for the
> > > object that's holding the current score. Use this object to return the
> > > current score when called. For example if you give the value source a
> > > handle called "score" a compound function call will look like this:
> > > sum(score(), field(x))
> > >
> > > Joel
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Thu, Dec 12, 2013 at 9:58 AM, Peter Keegan  > > >wrote:
> > >
> > > > Regarding my original goal, which is to perform a math function using
> > the
> > > > scaled score and a field value, and sort on the result, how does this
> > fit
> > > > in? Must I implement another custom PostFilter with a higher cost
> than
> > > the
> > > > scale PostFilter?
> > > >
> > > > Thanks,
> > > > Peter
> > > >
> > > >
> > > > On Wed, Dec 11, 2013 at 4:01 PM, Peter Keegan <
> peterlkee...@gmail.com
> > > > >wrote:
> >

how to include result ordinal in response

2014-01-03 Thread Peter Keegan
Is there a simple way to output the result number (ordinal) with each
returned document using the 'fl' parameter? This would be useful when
visually comparing the results from 2 queries.

Thanks,
Peter


Re: how to include result ordinal in response

2014-01-04 Thread Peter Keegan
Thank you both. The DocTransformer solution was very simple:

import java.io.IOException;

import org.apache.solr.common.SolrDocument;
import org.apache.solr.common.params.SolrParams;
import org.apache.solr.request.SolrQueryRequest;
import org.apache.solr.response.transform.DocTransformer;
import org.apache.solr.response.transform.TransformerFactory;

public class PositionAugmenterFactory extends TransformerFactory{

@Override
public DocTransformer create(String field, SolrParams params,
SolrQueryRequest req) {
return new PositionAugmenter( field );
}

class PositionAugmenter extends DocTransformer {
final String name;
int position;

public PositionAugmenter( String display )
{
this.name = display;
this.position = 1;
}
@Override
public String getName() {
return name;
}

@Override
public void transform(SolrDocument doc, int docid) throws
IOException {
doc.setField( name, position++);
}

}
}

@Jack: fl=[docid] is similar to using the uniqueKey, but still hard to
compare visually (for me).

The fields are not returned in the same order as specified in the 'fl'
parameter. Can the order be overridden?

Thanks,
Peter




On Fri, Jan 3, 2014 at 6:58 PM, Jack Krupansky wrote:

> Or just use the internal document ID: fl=*,[docid]
>
> Granted, the docID may change if a segment merge occurs and earlier
> documents have been deleted, but it may be sufficient for your purposes.
>
> -- Jack Krupansky
>
> -Original Message- From: Upayavira
> Sent: Friday, January 03, 2014 5:58 PM
> To: solr-user@lucene.apache.org
> Subject: Re: how to include result ordinal in response
>
>
> On Fri, Jan 3, 2014, at 10:00 PM, Peter Keegan wrote:
>
>> Is there a simple way to output the result number (ordinal) with each
>> returned document using the 'fl' parameter? This would be useful when
>> visually comparing the results from 2 queries.
>>
>
> I'm not aware of a simple way.
>
> If you're competent in Java, this could be a neat new DocTransformer
> component. You'd say:
>
> fl=*,[position]
>
> and you'd get a new field in your search results.
>
> Cruder ways would be to use XSLT to add it to an XML output, or a
> velocity template, but the DocTransformer approach would create
> something that could be of ongoing use.
>
> Upayavira
>


Re: Function query matching

2014-01-06 Thread Peter Keegan
: The bottom line for Peter is still the same: using scale() wrapped arround
: a function/query does involve a computing hte results for every document,
: and that is going to scale linearly as the size of hte index grows -- but
: it it is *only* because of the scale function.

Another problem with this approach is that the scale() function will likely
generate incorrect values because it occurs before any filters. If the
filters drop high scoring docs, the scaled values will never include the
'maxTarget' value (and may not include the 'minTarget' value, either).

Peter


On Sat, Dec 7, 2013 at 2:30 PM, Chris Hostetter wrote:

>
> (This is why i shouldn't send emails just before going to bed.)
>
> I woke up this morning realizing that of course I was completley wrong
> when i said this...
>
> : I want to be clear for 99% of the people reading this, if you find
> : yourself writting a query structure like this...
> :
> :   q={!func}..functions involving wrapping $qq ...
> ...
> : ...Try to restructure the match you want to do into the form of a
> : multiplier
> ...
> : Because the later case is much more efficient and Solr will only compute
> : the function values for hte docs it needs to (that match the wrapped $qq
> : query)
>
> The reason i was wrong...
>
> Even though function queries do by default match all documents, and even
> if the main query is a function query (ie: "q={!func}..."), if there is
> an "fq" that filters down the set of documents, then the (main) function
> query will only be calculated for the documents that match the filter.
>
> It was trivial to ammend the test i mentioned last night to show this (and
> i feel silly for not doing that last night and stoping myself from saying
> something foolish)...
>
>   https://svn.apache.org/viewvc?view=revision&revision=r1548955
>
> The bottom line for Peter is still the same: using scale() wrapped arround
> a function/query does involve a computing hte results for every document,
> and that is going to scale linearly as the size of hte index grows -- but
> it it is *only* because of the scale function.
>
>
>
> -Hoss
> http://www.lucidworks.com/
>


Re: Zookeeper as Service

2014-01-09 Thread Peter Keegan
There's also: http://www.tanukisoftware.com/


On Thu, Jan 9, 2014 at 11:18 AM, Nazik Huq  wrote:

>
>
> From your email I gather your main concern is starting zookeeper on server
> startups.
>
> You may want to look at these non-native service oriented options too:
> Create  a script( cmd or bat) to start ZK on server bootup. This method
> may not restart Zk if Zk crashes(not the server).
> Create C# commad line program that starts on server bootup(see above) that
> uses the .Net System.Diagnostics.Process.Start method to start Zk on
> sever start and monitor the Zk process via a loop. Restart when Zk process
> crash or "hang". I prefer this method. There might be a Java equivalent of
> this. There are many exmaples avaialble on the web.
> Cheers,
> @nazik_huq
>
>
>
> On Thursday, January 9, 2014 10:07 AM, Charlie Hull 
> wrote:
>
> On 09/01/2014 09:44, Karthikeyan.Kannappan wrote:
>
> > I am hosting in windows OS
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/Zookeeper-as-Service-tp4110396p4110413.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>
> There are various ways to 'servicify' (yes that may not be an actual
> word) executable applications on Windows. The venerable SrvAny is one
> such option as is the newer
>  nssm.exe (Non-Sucking Service Manager).
>
> Bear in mind that a Windows Service doesn't operate quite the same way
> with regard to stdout and stderr which may mean any error messages end
> up in a black hole, with you simply
>  getting something unhelpful 'service
> failed to start' error messages from Windows itself if something goes
> wrong. The 'working directory' is another thing that needs careful
> setting up.
>
> Cheers
>
> Charlie
>
> --
> Charlie Hull
> Flax - Open Source Enterprise Search
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.flax.co.uk
>


leading wildcard characters

2014-01-10 Thread Peter Keegan
How do you disable leading wildcards in 4.X? The setAllowLeadingWildcard
method is there in the parser, but nothing references the getter. Also, the
Edismax parser always enables it and provides no way to override.

Thanks,
Peter


Re: leading wildcard characters

2014-01-10 Thread Peter Keegan
Removing ReversedWildcardFilterFactory  had no effect.


On Fri, Jan 10, 2014 at 10:48 AM, Ahmet Arslan  wrote:

> Hi Peter,
>
> Can you remove any occurrence of ReversedWildcardFilterFactory in
> schema.xml? (even if you don't use it)
>
> Ahmet
>
>
>
> On Friday, January 10, 2014 3:34 PM, Peter Keegan 
> wrote:
> How do you disable leading wildcards in 4.X? The setAllowLeadingWildcard
> method is there in the parser, but nothing references the getter. Also, the
> Edismax parser always enables it and provides no way to override.
>
> Thanks,
> Peter
>
>


How to override rollback behavior in DIH

2014-01-14 Thread Peter Keegan
I have a custom data import handler that creates an ExternalFileField from
a source that is different from the main index. If the import fails (in my
case, a connection refused in URLDataSource), I don't want to roll back any
uncommitted changes to the main index. However, this seems to be the
default behavior. Is there a way to override the IndexWriter rollback?

Thanks,
Peter


Re: leading wildcard characters

2014-01-14 Thread Peter Keegan
I created SOLR-5630.
Although WildcardQuery is much much faster now with AutomatonQuery, it can
still result in slow queries when used in multiple keywords. From my
testing, I think I will need to disable all WildcardQuerys and only allow
PrefixQuery.

Peter


On Sat, Jan 11, 2014 at 4:17 AM, Ahmet Arslan  wrote:

> Hi Peter,
>
> Yes you are correct. There is no way to disable it.
>
> Weird thing is javadoc says default is false but it is enabled by default
> in SolrQueryParserBase.
> boolean allowLeadingWildcard = true;
>
>
>
> http://search-lucene.com/jd/solr/solr-core/org/apache/solr/parser/SolrQueryParserBase.html#setAllowLeadingWildcard(boolean)
>
>
> There is an effort for making such (allowLeadingWilcard,fuzzyMinSim,
> fuzzyPrefixLength) properties configurable :
> https://issues.apache.org/jira/browse/SOLR-218
>
> But this one is somehow old. Since its description is stale, do you want
> to open a new one?
>
> Ahmet
>
>
> On Friday, January 10, 2014 6:12 PM, Peter Keegan 
> wrote:
> Removing ReversedWildcardFilterFactory  had no effect.
>
>
>
> On Fri, Jan 10, 2014 at 10:48 AM, Ahmet Arslan  wrote:
>
> > Hi Peter,
> >
> > Can you remove any occurrence of ReversedWildcardFilterFactory in
> > schema.xml? (even if you don't use it)
> >
> > Ahmet
> >
> >
> >
> > On Friday, January 10, 2014 3:34 PM, Peter Keegan <
> peterlkee...@gmail.com>
> > wrote:
> > How do you disable leading wildcards in 4.X? The setAllowLeadingWildcard
> > method is there in the parser, but nothing references the getter. Also,
> the
> > Edismax parser always enables it and provides no way to override.
> >
> > Thanks,
> > Peter
> >
> >
>
>


Re: How to override rollback behavior in DIH

2014-01-17 Thread Peter Keegan
Following up on this a bit - my main index is updated by a SolrJ client in
another process. If the DIH fails, the SolrJ client is never informed of
the index rollback, and any pending updates are lost. For now, I've made
sure that the DIH processor never throws an exception, but this makes it a
bit harder to detect the failure via the admin interface.

Thanks,
Peter


On Tue, Jan 14, 2014 at 11:12 AM, Peter Keegan wrote:

> I have a custom data import handler that creates an ExternalFileField from
> a source that is different from the main index. If the import fails (in my
> case, a connection refused in URLDataSource), I don't want to roll back any
> uncommitted changes to the main index. However, this seems to be the
> default behavior. Is there a way to override the IndexWriter rollback?
>
> Thanks,
> Peter
>


Re: How to override rollback behavior in DIH

2014-01-17 Thread Peter Keegan
I'm actually doing the 'skip' on every successful call to 'nextRow' with
this trick:
  row.put("$externalfield",null); // DocBuilder.addFields will skip fields
starting with '$'
because I'm only creating ExternalFieldFields. However, an error could also
occur in the 'init' call, so exceptions have to be caught there, too.

Thanks,
Peter


On Fri, Jan 17, 2014 at 10:19 AM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> Can you try using onError=skip on your entities which use this data source?
>
> It's been some time since I looked at the code so I don't know if this
> works with data source. Worth a try I guess.
>
> On Fri, Jan 17, 2014 at 7:20 PM, Peter Keegan 
> wrote:
> > Following up on this a bit - my main index is updated by a SolrJ client
> in
> > another process. If the DIH fails, the SolrJ client is never informed of
> > the index rollback, and any pending updates are lost. For now, I've made
> > sure that the DIH processor never throws an exception, but this makes it
> a
> > bit harder to detect the failure via the admin interface.
> >
> > Thanks,
> > Peter
> >
> >
> > On Tue, Jan 14, 2014 at 11:12 AM, Peter Keegan  >wrote:
> >
> >> I have a custom data import handler that creates an ExternalFileField
> from
> >> a source that is different from the main index. If the import fails (in
> my
> >> case, a connection refused in URLDataSource), I don't want to roll back
> any
> >> uncommitted changes to the main index. However, this seems to be the
> >> default behavior. Is there a way to override the IndexWriter rollback?
> >>
> >> Thanks,
> >> Peter
> >>
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


Re: How to override rollback behavior in DIH

2014-01-17 Thread Peter Keegan
Hmm, this does get a bit complicated, and I'm not even doing any writes
with the DIH SolrWriter. In retrospect, using a DIH to create only EFFs
doesn't buy much except for the integration into the Solr Admin UI.  Thanks
for the pointer to 3671, James.

Peter


On Fri, Jan 17, 2014 at 10:59 AM, Dyer, James
wrote:

> Peter,
>
> I think you can override org.apache.solr.handler.dataimport.SolrWriter to
> have a custom (no-op) rollback method.  Your new writer should implement
> org.apache.solr.handler.dataimport.DIHWriter.  You can specify the
> "writerImpl" request parameter to specify the new class.
>
> Unfortunately, it isn't actually this easy because your new writer is
> going to have to know what to do for all the other methods.  That is, there
> is no easy way to tell it how to write/commit/etc to Solr.  The default
> SolrWriter has a lot of hardcoded parameters it gets sent on construction
> in DataImportHandler#handleRequestBody.  You would have to somehow
> duplicate this construction on your own custom class.  See SOLR-3671 for an
> explanation of this dilemma.
>
> James Dyer
> Ingram Content Group
> (615) 213-4311
>
>
> -Original Message-
> From: pkeegan01...@gmail.com [mailto:pkeegan01...@gmail.com] On Behalf Of
> Peter Keegan
> Sent: Friday, January 17, 2014 7:51 AM
> To: solr-user@lucene.apache.org
> Subject: Re: How to override rollback behavior in DIH
>
> Following up on this a bit - my main index is updated by a SolrJ client in
> another process. If the DIH fails, the SolrJ client is never informed of
> the index rollback, and any pending updates are lost. For now, I've made
> sure that the DIH processor never throws an exception, but this makes it a
> bit harder to detect the failure via the admin interface.
>
> Thanks,
> Peter
>
>
> On Tue, Jan 14, 2014 at 11:12 AM, Peter Keegan  >wrote:
>
> > I have a custom data import handler that creates an ExternalFileField
> from
> > a source that is different from the main index. If the import fails (in
> my
> > case, a connection refused in URLDataSource), I don't want to roll back
> any
> > uncommitted changes to the main index. However, this seems to be the
> > default behavior. Is there a way to override the IndexWriter rollback?
> >
> > Thanks,
> > Peter
> >
>
>


Getting index schema in SolrCloud mode

2014-02-03 Thread Peter Keegan
I'm indexing data with a SolrJ client via SolrServer. Currently, I parse
the schema returned from a HttpGet on:
localhost:8983/solr/collection/schema/fields

What is the recommended way to read the schema with CloudSolrServer? Can it
be done with a single HttpGet to a ZK server?

Thanks,
Peter


  1   2   3   4   5   6   7   8   9   >