date:20140211

Facets for fields in subdocuments with block join, is it possible?

2014-02-11 Thread Henning Ivan Solberg


Hello,

I'm testing block join in solr 4.6.1 and wondering, is it possible to 
get facets for fields in subdocuments with number of hits based on ROOT 
documents?


See example below:


ROOT
testing 123
title
GRP

khat
7000
purchase


cannabis
500
sale



My query looks like this:

solrQuery.setQuery("text:testing");
solrQuery.setFilterQueries("{!parent 
which=\"dokumentPart:ROOT\"}field3:khat");

solrQuery.setFacet(true);
solrQuery.addFacetField("group","field5");

This does not give me any facets for the subdocument fields, so i'm 
thinking, could a solution be to execute a second query to get the 
facets for the subdocument by join from parent to child whith a {!child 
of=} query like this:


solrQuery.setQuery("{!child of=\"dokumentPart:ROOT\"}text:testing");
solrQuery.setFilterQueries("field3:khat");
solrQuery.setFacet(true);
solrQuery.addFacetField("field5","field4", "field3");

The problem with this method is that the facet count will be based on 
sub documents and not ROOT/parent documents...


Is there a silver bullet for this kind of requirement?

Yours faithfully

Henning Solberg

Re: Group.Facet issue in Sharded Solr Setup

2014-02-11 Thread rks_lucene

Quick follow up on my question below and if anyone is using Group.facets in a
sharded solr setup ?

Based on further testing, the group.facets counts dont seem reliable at all
for lesser popular items in the facet list.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Group-Facet-issue-in-Sharded-Solr-Setup-tp4116077p4116635.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Facets for fields in subdocuments with block join, is it possible?

2014-02-11 Thread Mikhail Khludnev

Hello Henning,

There is no open source facet component for child level of block-join.
There is no even open jira for this.

Don.t think it helps.
11.02.2014 12:22 пользователь "Henning Ivan Solberg" 
написал:

> Hello,
>
> I'm testing block join in solr 4.6.1 and wondering, is it possible to get
> facets for fields in subdocuments with number of hits based on ROOT
> documents?
>
> See example below:
>
> 
> ROOT
> testing 123
> title
> GRP
> 
> khat
> 7000
> purchase
> 
> 
> cannabis
> 500
> sale
> 
> 
>
> My query looks like this:
>
> solrQuery.setQuery("text:testing");
> solrQuery.setFilterQueries("{!parent which=\"dokumentPart:ROOT\"}
> field3:khat");
> solrQuery.setFacet(true);
> solrQuery.addFacetField("group","field5");
>
> This does not give me any facets for the subdocument fields, so i'm
> thinking, could a solution be to execute a second query to get the facets
> for the subdocument by join from parent to child whith a {!child of=} query
> like this:
>
> solrQuery.setQuery("{!child of=\"dokumentPart:ROOT\"}text:testing");
> solrQuery.setFilterQueries("field3:khat");
> solrQuery.setFacet(true);
> solrQuery.addFacetField("field5","field4", "field3");
>
> The problem with this method is that the facet count will be based on sub
> documents and not ROOT/parent documents...
>
> Is there a silver bullet for this kind of requirement?
>
> Yours faithfully
>
> Henning Solberg
>
>

Set up embedded Solr container and cores programmatically to read their configs from the classpath

2014-02-11 Thread Robert Krüger

Hi,

I have an application with an embedded Solr instance (and I want to
keep it embedded) and so far I have been setting up my Solr
installation programmatically using folder paths to specify where the
specific container or core configs are.

I have used the CoreContainer methods createAndLoad and create using
File arguments and this works fine. However, now I want to change this
so that all configuration files are loaded from certain locations
using the classloader but I have not been able to get this to work.

E.g. I want to have my solr config located in the classpath at

my/base/package/solr/conf

and the core configs at

my/base/package/solr/cores/core1/conf,
my/base/package/solr/cores/core2/conf

etc..

Is this possible at all? Looking through the source code it seems that
specifying classpath resources in such a qualified way is not
supported but I may be wrong.

I could get this to work for the container by supplying my own
implementation of SolrResourceLoader that allows a base path to be
specified for the resources to be loaded (I first thought that would
happen already when specifying instanceDir accordingly but looking at
the code it does not. for resources loaded through the classloader,
instanceDir is not prepended). However then I am stuck with the
loading of the cores' resources as the respective code (see
org.apache.solr.core.CoreContainer#createFromLocal) instantiates a
SolResourceLoader internally.

Thanks for any help with this (be it a clarification that it is not possible).

Robert

How to Learn Linked Configuration for SolrCloud at Zookeeper

2014-02-11 Thread Furkan KAMACI

Hi;

I've written a code that I can update a file to Zookeeper for SlorCloud.
Currently I have many configurations at Zookeeper for SolrCloud. I want to
update synonyms.txt file so I should know the currently linked
configuration (I will update the synonyms.txt file under appropriate
configuration folder) How can I learn it?

Thanks;
Furkan KAMACI

Re: How to Learn Linked Configuration for SolrCloud at Zookeeper

2014-02-11 Thread Alan Woodward

For a particular collection or core?  There should be a collection.configName 
property specified for the core or collection which tells you which ZK config 
directory is being used.

Alan Woodward
www.flax.co.uk


On 11 Feb 2014, at 11:49, Furkan KAMACI wrote:

> Hi;
> 
> I've written a code that I can update a file to Zookeeper for SlorCloud.
> Currently I have many configurations at Zookeeper for SolrCloud. I want to
> update synonyms.txt file so I should know the currently linked
> configuration (I will update the synonyms.txt file under appropriate
> configuration folder) How can I learn it?
> 
> Thanks;
> Furkan KAMACI

Re: How to Learn Linked Configuration for SolrCloud at Zookeeper

2014-02-11 Thread Furkan KAMACI

I am looking it for a particular collection.


2014-02-11 13:55 GMT+02:00 Alan Woodward :

> For a particular collection or core?  There should be a
> collection.configName property specified for the core or collection which
> tells you which ZK config directory is being used.
>
> Alan Woodward
> www.flax.co.uk
>
>
> On 11 Feb 2014, at 11:49, Furkan KAMACI wrote:
>
> > Hi;
> >
> > I've written a code that I can update a file to Zookeeper for SlorCloud.
> > Currently I have many configurations at Zookeeper for SolrCloud. I want
> to
> > update synonyms.txt file so I should know the currently linked
> > configuration (I will update the synonyms.txt file under appropriate
> > configuration folder) How can I learn it?
> >
> > Thanks;
> > Furkan KAMACI
>
>

Re: How to Learn Linked Configuration for SolrCloud at Zookeeper

2014-02-11 Thread Furkan KAMACI

Hi;

OK, I've checked the source code and implemented that:

   public String readConfigName(SolrZkClient zkClient, String collection)
throws KeeperException, InterruptedException {

  String configName = null;

  String path = ZkStateReader.COLLECTIONS_ZKNODE + "/" + collection;

  LOGGER.info("Load collection config from:" + path);
  byte[] data = zkClient.getData(path, null, null, true);

  if (data != null) {
 ZkNodeProps props = ZkNodeProps.load(data);
 configName = props.getStr(CONFIGNAME_PROP);
  }

  if (configName != null && !zkClient.exists(CONFIGS_ZKNODE + "/" +
configName, true)) {
 LOGGER.error("Specified config does not exist in ZooKeeper:" +
configName);
 throw new ZooKeeperException(SolrException.ErrorCode.SERVER_ERROR,
"Specified config does not exist in ZooKeeper:" + configName);
  }
  return configName;
   }

So, I can get the linked configuration name.

Thanks;
Furkan KAMACI


2014-02-11 13:57 GMT+02:00 Furkan KAMACI :

> I am looking it for a particular collection.
>
>
> 2014-02-11 13:55 GMT+02:00 Alan Woodward :
>
> For a particular collection or core?  There should be a
>> collection.configName property specified for the core or collection which
>> tells you which ZK config directory is being used.
>>
>> Alan Woodward
>> www.flax.co.uk
>>
>>
>> On 11 Feb 2014, at 11:49, Furkan KAMACI wrote:
>>
>> > Hi;
>> >
>> > I've written a code that I can update a file to Zookeeper for SlorCloud.
>> > Currently I have many configurations at Zookeeper for SolrCloud. I want
>> to
>> > update synonyms.txt file so I should know the currently linked
>> > configuration (I will update the synonyms.txt file under appropriate
>> > configuration folder) How can I learn it?
>> >
>> > Thanks;
>> > Furkan KAMACI
>>
>>
>

Re: Lowering query time

2014-02-11 Thread Joel Cohen

I'd like to thank you for lending a hand on my query time problem with
SolrCloud. By switching to a single shard with replicas setup, I've reduced
my query time to 18 msec. My full ingestion of 300k+ documents went down
from 2 hours 50 minutes to 1 hour 40 minutes. There are some code changes
that are going in that should help a bit as well. Big thanks to everyone
that had suggestions.


On Tue, Feb 4, 2014 at 8:11 PM, Alexandre Rafalovitch wrote:

> I suspect faceting is the issue here. The actual query you have shown
> seem to bring back a single document (or a single set of document for
> a product):
> fq=id:(320403401)
>
> On the other hand, you are asking for 4 field facets:
> facet.field=q_virtualCategory_ss
> facet.field=q_brand_s
> facet.field=q_color_s
> facet.field=q_category_ss
> AND 2 range facets, both clustered/grouped:
> facet.range=daysSinceStart_i
> facet.range=activePrice_l (e.g. f.activePrice_l.facet.range.gap=5000)
>
> And for all facets you have asked to bring back ALL of the results:
> facet.limit=-1
>
> Plus, you are doing a complex sort:
> sort=popularity_i desc,popularity_i desc
>
> So, you are probably spending quite a bit of time counting (especially
> in a shared setup) and then quite a bit more sending the response
> back.
>
> I would check the size of the result document (HTTP result) and see
> how large it is. Maybe you don't need all of the stuff that's coming
> back. I assume you are not actually querying Solr from the client's
> machine (that is I hope it is inside your data centre close to your
> web server), otherwise I would say to look at automatic content
> compression as well to minimize on-wire document size.
>
> Finally, if your documents have many stored fields (store=true in
> schema.xml) but you only return small subsets of them during search,
> you could look into using enableLazyFieldLoading flag in the
> solrconfig.
>
> Regards,
>Alex.
> P.s. As others said, you don't seem to have too many documents.
> Perhaps you want replication instead of sharding for improved
> performance.
> Personal website: http://www.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all
> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
>
>
> On Wed, Feb 5, 2014 at 6:31 AM, Alexey Kozhemiakin
>  wrote:
> > Btw "timing" for distributed requests are broken at this moment, it
> doesn't combine values from requests to shards.  I'm working on a patch.
> >
> > https://issues.apache.org/jira/browse/SOLR-3644
> >
> > -Original Message-
> > From: Jack Krupansky [mailto:j...@basetechnology.com]
> > Sent: Tuesday, February 04, 2014 22:00
> > To: solr-user@lucene.apache.org
> > Subject: Re: Lowering query time
> >
> > Add the debug=true parameter to some test queries and look at the
> "timing"
> > section to see which search components are taking the time.
> Traditionally, highlighting for large documents was a top culprit.
> >
> > Are you returning a lot of data or field values? Sometimes reducing the
> amount of data processed can help. Any multivalued fields with lots of
> values?
> >
> > -- Jack Krupansky
> >
> > -Original Message-
> > From: Joel Cohen
> > Sent: Tuesday, February 4, 2014 1:43 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Lowering query time
> >
> > 1. We are faceting. I'm not a developer so I'm not quite sure how we're
> doing it. How can I measure?
> > 2. I'm not sure how we'd force this kind of document partitioning. I can
> see how my shards are partitioned by looking at the clusterstate.json from
> Zookeeper, but I don't have a clue on how to get documents into specific
> shards.
> >
> > Would I be better off with fewer shards given the small size of my
> indexes?
> >
> >
> > On Tue, Feb 4, 2014 at 12:32 PM, Yonik Seeley 
> wrote:
> >
> >> On Tue, Feb 4, 2014 at 12:12 PM, Joel Cohen 
> >> wrote:
> >> > I'm trying to get the query time down to ~15 msec. Anyone have any
> >> > tuning recommendations?
> >>
> >> I guess it depends on what the slowest part of the query currently is.
> >>  If you are faceting, it's often that.
> >> Also, it's often a big win if you can somehow partition documents such
> >> that requests can normally be serviced from a single shard.
> >>
> >> -Yonik
> >> http://heliosearch.org - native off-heap filters and fieldcache for
> >> solr
> >>
> >
> >
> >
> > --
> >
> > joel cohen, senior system engineer
> >
> > e joel.co...@bluefly.com p 212.944.8000 x276 bluefly, inc. 42 w. 39th
> st. new york, ny 10018 www.bluefly.com <
> http://www.bluefly.com/?referer=autosig> | *fly since
> > 2013...*
> >
>



-- 

joel cohen, senior system engineer

e joel.co...@bluefly.com p 212.944.8000 x276
bluefly, inc. 42 w. 39th st. new york, ny 10018
www.bluefly.com  | *fly since
2013...*

Urgent Help. Best Way to have multiple OR Conditions for same field in SOLR

2014-02-11 Thread rajeev.nadgauda

HI,

I am new to SOLR , we have CRM data for Contacts and Companies which are in
millions, we have switched to SOLR for fast search results.

PROBLEM: We have large inclusion and exclusion lists with names of companies
or contacts.
Ex: Include or Exclude : "company A" & "Company B" & "Company C"  &
"Company n"  where assume  n = 1;

What would be the best way to do this kind of a query using SOLR.

WHAT I HAVE TRIED: 
Setting "q" ==> field_name: ("companyA" OR "companyB" . OR "Company n");
This works only for a list of 400 odd.

Looking forward for assistance on this.

Thank You,
Rajeev.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Urgent-Help-Best-Way-to-have-multiple-OR-Conditions-for-same-field-in-SOLR-tp4116681.html
Sent from the Solr - User mailing list archive at Nabble.com.

solr-query with NOT and OR operator

2014-02-11 Thread Johannes Siegert


Hi,

my solr-request contains the following filter-query:

fq=((-(field1:value1)))+OR+(field2:value2).

I expect solr deliver documents matching to ((-(field1:value1))) and 
documents matching to (field2:value2).


But solr deliver only documents, that are the result of (field2:value2). 
I receive several documents, if I request only for ((-(field1:value1))).


Thanks!

Johannes

Re: solr-query with NOT and OR operator

2014-02-11 Thread Mikhail Khludnev

http://wiki.apache.org/solr/CommonQueryParameters#debugQuery
and
http://wiki.apache.org/solr/CommonQueryParameters#explainOther
usually help so much


On Tue, Feb 11, 2014 at 7:57 PM, Johannes Siegert <
johannes.sieg...@marktjagd.de> wrote:

> Hi,
>
> my solr-request contains the following filter-query:
>
> fq=((-(field1:value1)))+OR+(field2:value2).
>
> I expect solr deliver documents matching to ((-(field1:value1))) and
> documents matching to (field2:value2).
>
> But solr deliver only documents, that are the result of (field2:value2). I
> receive several documents, if I request only for ((-(field1:value1))).
>
> Thanks!
>
> Johannes
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

Re: Tf-Idf for a specific query

2014-02-11 Thread David Miller

Hi Erick,

Slower queries for getting facets can be tolerated, as long as they don't
affect those without facets. The requirement is for a separate query which
can get me both term vector and facet counts.

One issue I am facing is that, for a search query I only want the term
vectors and facet counts, but not the results/docs. If I set the rows=0,
then term vectors are not returned. Could you suggest some way to achieve
the above.

Also it will be helpful to get a way to get aggregate TF of a term (across
all docs in the query).

Regards,
David






On Sat, Feb 8, 2014 at 10:49 AM, Erick Erickson wrote:

> David:
>
> If you're, say, faceting on fields with lots of unique values, this
> will be quite expensive.
> No idea whether you can tolerate slower queries or not, just sayin'
>
> Erick
>
> On Fri, Feb 7, 2014 at 5:35 PM, David Miller 
> wrote:
> > Thanks Mikhai,
> >
> > It seems that, this was what I was looking for. Being new to this, I
> wasn't
> > aware of such a use of facets.
> >
> > Now I can probably combine the term vectors and facets to fit my
> scenario.
> >
> > Regards,
> > Dave
> >
> >
> > On Fri, Feb 7, 2014 at 2:43 PM, Mikhail Khludnev <
> mkhlud...@griddynamics.com
> >> wrote:
> >
> >> David,
> >>
> >> I can imagine that "DF for resultset" is facets!
> >>
> >>
> >> On Fri, Feb 7, 2014 at 11:26 PM, David Miller  >> >wrote:
> >>
> >> > Hi Mikhail,
> >> >
> >> > The DF seems to be based on the entire document set. What I require is
> >> > based on a the results of a single query.
> >> >
> >> > Suppose my Solr query returns a set of 50K documents from a superset
> of
> >> > 10Million documents, I require to calculate the DF just based on the
> 50K
> >> > documents. But currently it seems to be calculated on the entire doc
> set.
> >> >
> >> > So, is there any way to get the DF or IDF just on basis of the docs
> >> > returned by the query?
> >> >
> >> > Regards,
> >> > Dave
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > On Fri, Feb 7, 2014 at 5:15 AM, Mikhail Khludnev <
> >> > mkhlud...@griddynamics.com
> >> > > wrote:
> >> >
> >> > > Hello Dave
> >> > > you can get DF from http://wiki.apache.org/solr/TermsComponent(invert
> >> > it
> >> > > yourself)
> >> > > then, for certain term you can get number of occurrences per
> document
> >> by
> >> > > http://wiki.apache.org/solr/FunctionQuery#tf
> >> > >
> >> > >
> >> > >
> >> > > On Fri, Feb 7, 2014 at 3:58 AM, David Miller <
> davthehac...@gmail.com>
> >> > > wrote:
> >> > >
> >> > > > Hi Guys..
> >> > > >
> >> > > > I require to obtain Tf-idf score from Solr for a certain set of
> >> > > documents.
> >> > > > But the catch is that, I needs the IDF (or DF) to be calculated on
> >> the
> >> > > > documents returned by the specific query and not the entire
> corpus.
> >> > > >
> >> > > > Please provide me some hint on whether Solr has this feature or
> if I
> >> > can
> >> > > > use the Lucene Api directly to achieve this.
> >> > > >
> >> > > >
> >> > > > Thanks in advance,
> >> > > > Dave
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Sincerely yours
> >> > > Mikhail Khludnev
> >> > > Principal Engineer,
> >> > > Grid Dynamics
> >> > >
> >> > > 
> >> > >  
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Sincerely yours
> >> Mikhail Khludnev
> >> Principal Engineer,
> >> Grid Dynamics
> >>
> >> 
> >>  
> >>
>

Re: solr-query with NOT and OR operator

2014-02-11 Thread Jack Krupansky

With so many parentheses in there, I wonder what you are really trying to 
do Try expressing your query in simple English first so that we can 
understand your goal.


But generally, a purely negative nested query must have a *:* term to apply 
the exclusion against:


fq=((*:* -(field1:value1)))+OR+(field2:value2).

-- Jack Krupansky

-Original Message- 
From: Johannes Siegert

Sent: Tuesday, February 11, 2014 10:57 AM
To: solr-user@lucene.apache.org
Subject: solr-query with NOT and OR operator

Hi,

my solr-request contains the following filter-query:

fq=((-(field1:value1)))+OR+(field2:value2).

I expect solr deliver documents matching to ((-(field1:value1))) and
documents matching to (field2:value2).

But solr deliver only documents, that are the result of (field2:value2).
I receive several documents, if I request only for ((-(field1:value1))).

Thanks!

Johannes

Re: Lowering query time

2014-02-11 Thread Erick Erickson

Hmmm, I'm still a little puzzled BTW. 300K documents, unless they're
huge, shouldn't be taking 100 minutes. I can index 11M documents on
my laptop (Wikipedia dump) in 45 minutes for instance Of course
that's a single core, not cloud and not replicas...

So possibly it' on the data acquisition side? Is your Solr CPU pegged?

YMMV of course.

Erick


On Tue, Feb 11, 2014 at 6:40 AM, Joel Cohen  wrote:

> I'd like to thank you for lending a hand on my query time problem with
> SolrCloud. By switching to a single shard with replicas setup, I've reduced
> my query time to 18 msec. My full ingestion of 300k+ documents went down
> from 2 hours 50 minutes to 1 hour 40 minutes. There are some code changes
> that are going in that should help a bit as well. Big thanks to everyone
> that had suggestions.
>
>
> On Tue, Feb 4, 2014 at 8:11 PM, Alexandre Rafalovitch  >wrote:
>
> > I suspect faceting is the issue here. The actual query you have shown
> > seem to bring back a single document (or a single set of document for
> > a product):
> > fq=id:(320403401)
> >
> > On the other hand, you are asking for 4 field facets:
> > facet.field=q_virtualCategory_ss
> > facet.field=q_brand_s
> > facet.field=q_color_s
> > facet.field=q_category_ss
> > AND 2 range facets, both clustered/grouped:
> > facet.range=daysSinceStart_i
> > facet.range=activePrice_l (e.g. f.activePrice_l.facet.range.gap=5000)
> >
> > And for all facets you have asked to bring back ALL of the results:
> > facet.limit=-1
> >
> > Plus, you are doing a complex sort:
> > sort=popularity_i desc,popularity_i desc
> >
> > So, you are probably spending quite a bit of time counting (especially
> > in a shared setup) and then quite a bit more sending the response
> > back.
> >
> > I would check the size of the result document (HTTP result) and see
> > how large it is. Maybe you don't need all of the stuff that's coming
> > back. I assume you are not actually querying Solr from the client's
> > machine (that is I hope it is inside your data centre close to your
> > web server), otherwise I would say to look at automatic content
> > compression as well to minimize on-wire document size.
> >
> > Finally, if your documents have many stored fields (store=true in
> > schema.xml) but you only return small subsets of them during search,
> > you could look into using enableLazyFieldLoading flag in the
> > solrconfig.
> >
> > Regards,
> >Alex.
> > P.s. As others said, you don't seem to have too many documents.
> > Perhaps you want replication instead of sharding for improved
> > performance.
> > Personal website: http://www.outerthoughts.com/
> > LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> > - Time is the quality of nature that keeps events from happening all
> > at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> > book)
> >
> >
> > On Wed, Feb 5, 2014 at 6:31 AM, Alexey Kozhemiakin
> >  wrote:
> > > Btw "timing" for distributed requests are broken at this moment, it
> > doesn't combine values from requests to shards.  I'm working on a patch.
> > >
> > > https://issues.apache.org/jira/browse/SOLR-3644
> > >
> > > -Original Message-
> > > From: Jack Krupansky [mailto:j...@basetechnology.com]
> > > Sent: Tuesday, February 04, 2014 22:00
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Lowering query time
> > >
> > > Add the debug=true parameter to some test queries and look at the
> > "timing"
> > > section to see which search components are taking the time.
> > Traditionally, highlighting for large documents was a top culprit.
> > >
> > > Are you returning a lot of data or field values? Sometimes reducing the
> > amount of data processed can help. Any multivalued fields with lots of
> > values?
> > >
> > > -- Jack Krupansky
> > >
> > > -Original Message-
> > > From: Joel Cohen
> > > Sent: Tuesday, February 4, 2014 1:43 PM
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Lowering query time
> > >
> > > 1. We are faceting. I'm not a developer so I'm not quite sure how we're
> > doing it. How can I measure?
> > > 2. I'm not sure how we'd force this kind of document partitioning. I
> can
> > see how my shards are partitioned by looking at the clusterstate.json
> from
> > Zookeeper, but I don't have a clue on how to get documents into specific
> > shards.
> > >
> > > Would I be better off with fewer shards given the small size of my
> > indexes?
> > >
> > >
> > > On Tue, Feb 4, 2014 at 12:32 PM, Yonik Seeley 
> > wrote:
> > >
> > >> On Tue, Feb 4, 2014 at 12:12 PM, Joel Cohen 
> > >> wrote:
> > >> > I'm trying to get the query time down to ~15 msec. Anyone have any
> > >> > tuning recommendations?
> > >>
> > >> I guess it depends on what the slowest part of the query currently is.
> > >>  If you are faceting, it's often that.
> > >> Also, it's often a big win if you can somehow partition documents such
> > >> that requests can normally be serviced from a single shard.
> > >>
> > >> -Yonik
> > >> http://heli

Re: Urgent Help. Best Way to have multiple OR Conditions for same field in SOLR

2014-02-11 Thread Erick Erickson

right, 10K Boolean clauses are not very efficient. You actually can
up the limit here, but still...

Consider a "post filter", here's a place to start:
http://lucene.apache.org/solr/4_3_1/solr-core/org/apache/solr/search/PostFilter.html

Best,
Erick


On Tue, Feb 11, 2014 at 6:47 AM, rajeev.nadgauda <
rajeev.nadga...@leadenrich.com> wrote:

> HI,
>
> I am new to SOLR , we have CRM data for Contacts and Companies which are in
> millions, we have switched to SOLR for fast search results.
>
> PROBLEM: We have large inclusion and exclusion lists with names of
> companies
> or contacts.
> Ex: Include or Exclude : "company A" & "Company B" & "Company C"  &
> "Company n"  where assume  n = 1;
>
> What would be the best way to do this kind of a query using SOLR.
>
> WHAT I HAVE TRIED:
> Setting "q" ==> field_name: ("companyA" OR "companyB" . OR "Company
> n");
> This works only for a list of 400 odd.
>
> Looking forward for assistance on this.
>
> Thank You,
> Rajeev.
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Urgent-Help-Best-Way-to-have-multiple-OR-Conditions-for-same-field-in-SOLR-tp4116681.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: solr-query with NOT and OR operator

2014-02-11 Thread Erick Erickson

Solr/Lucene is not strictly Boolean logic, this trips up a lot
of people.

Excellent blog on the subject here:
http://searchhub.org/dev/2011/12/28/why-not-and-or-and-not/

Best,
Erick


On Tue, Feb 11, 2014 at 8:22 AM, Jack Krupansky wrote:

> With so many parentheses in there, I wonder what you are really trying to
> do Try expressing your query in simple English first so that we can
> understand your goal.
>
> But generally, a purely negative nested query must have a *:* term to
> apply the exclusion against:
>
> fq=((*:* -(field1:value1)))+OR+(field2:value2).
>
> -- Jack Krupansky
>
> -Original Message- From: Johannes Siegert
> Sent: Tuesday, February 11, 2014 10:57 AM
> To: solr-user@lucene.apache.org
> Subject: solr-query with NOT and OR operator
>
>
> Hi,
>
> my solr-request contains the following filter-query:
>
> fq=((-(field1:value1)))+OR+(field2:value2).
>
> I expect solr deliver documents matching to ((-(field1:value1))) and
> documents matching to (field2:value2).
>
> But solr deliver only documents, that are the result of (field2:value2).
> I receive several documents, if I request only for ((-(field1:value1))).
>
> Thanks!
>
> Johannes
>

Re: Is \'optimize\' necessary for a 45-segment Solr 4.6 index?

2014-02-11 Thread Shawn Heisey

On 2/11/2014 3:27 AM, Jäkel, Guido wrote:
> Dear Shawn,
>
>> On 2/9/2014 11:41 PM, Arun Rangarajan wrote:
>>> I have a 28 GB Solr 4.6 index with 45 segments. Optimize failed with an
>>> 'out of memory' error. Is optimize really necessary, since I read that
>>> lucene is able to handle multiple segments well now?
> It seems I currently run into the same problem while migrating from Solr 1.4 
> to Solr 4.6.1.
>
> I run into OOM-Problems -- after running a full, fresh re-index of your 
> catalogue data -- while optimizing an ~80GB core on a 16GB JVM. After about 
> one hour the heap "explodes" within a minute while " create compound file 
> _5b2.cfs". How to deal with this? Wit it happens because there are too much 
> small (about 30 @ 1..4GB) segments before optimize? It seem that they are 
> limited to this size by the defaults of the TieredMergePolicy. And, of 
> course: Is optimize"depreciated"?
>
> Because it takes about 1h to reach the "point of prolems" any hints or 
> explanations will be helpful for me to save a lot of time!

Replying to a privately sent email on this thread:

I can't be sure that there are no memory leaks in Solr's program code,
but it is a rare thing, and I'm running 4.6.1 on a large system with a
smaller heap than yours without problems, so a memory leak is unlikely.
My setup DOES do index optimizes.

I have two guesses.  It could be either or both.  They are similar but
not identical.  There might be something else entirely, but these are
the most likely:

One guess is that you don't have enough RAM, leading to a performance
issue that compounds itself.  Adding the optimize pushes the system over
a threshold, everything slows down enough that the system tries to do
too much simultaneously, and it uses all the heap.

Assuming there's nothing else running on the machine, with an 80GB index
and a 16GB heap, a perfectly ideal server for this index would have 96GB
of RAM.  You might be able to get really good performance with 48GB, but
more would be better.  If it were me, I don't think I'd try it with less
than 64GB.

http://wiki.apache.org/solr/SolrPerformanceProblems#RAM

The other guess is that your Solr config and your request/index
characteristicsare resulting in a lot of heap usage, so when you add an
optimize on top of it, 16GB is not enough.

http://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap

Thanks,
Shawn

Re: solr-query with NOT and OR operator

2014-02-11 Thread Johannes Siegert


Hi Jack,

thanks!

fq=((*:* -(field1:value1)))+OR+(field2:value2).

This is the solution.

Johannes

Am 11.02.2014 17:22, schrieb Jack Krupansky:
With so many parentheses in there, I wonder what you are really trying 
to do Try expressing your query in simple English first so that we 
can understand your goal.


But generally, a purely negative nested query must have a *:* term to 
apply the exclusion against:


fq=((*:* -(field1:value1)))+OR+(field2:value2).

-- Jack Krupansky

-Original Message- From: Johannes Siegert
Sent: Tuesday, February 11, 2014 10:57 AM
To: solr-user@lucene.apache.org
Subject: solr-query with NOT and OR operator

Hi,

my solr-request contains the following filter-query:

fq=((-(field1:value1)))+OR+(field2:value2).

I expect solr deliver documents matching to ((-(field1:value1))) and
documents matching to (field2:value2).

But solr deliver only documents, that are the result of (field2:value2).
I receive several documents, if I request only for ((-(field1:value1))).

Thanks!

Johannes


--
Johannes Siegert
Softwareentwickler

Telefon:  0351 - 418 894 -73
Fax:  0351 - 418 894 -99
E-Mail:   johannes.sieg...@marktjagd.de
Xing: https://www.xing.com/profile/Johannes_Siegert2

Webseite: http://www.marktjagd.de
Blog: http://blog.marktjagd.de
Facebook: http://www.facebook.com/marktjagd
Twitter:  http://twitter.com/Marktjagd
__

Marktjagd GmbH | Schützenplatz 14 | D - 01067 Dresden

Geschäftsführung: Jan Großmann
Sitz Dresden | Amtsgericht Dresden | HRB 28678

Re: Is 'optimize' necessary for a 45-segment Solr 4.6 index?

2014-02-11 Thread Arun Rangarajan

Dear Shawn,
Thanks for your reply. For now, I did merges in steps with maxSegments
param (using HOST:PORT/CORE/update?optimize=true&maxSegments=10). First I
merged the 45 segments to 10, and then from 10 to 5. (Merging from 5 to 2
again caused out-of-memory exception.) Now I have a 5-segment index with
all segments roughly of equal sizes. Will try using that and see if that is
good enough for us.


On Sun, Feb 9, 2014 at 11:22 PM, Shawn Heisey  wrote:

> On 2/9/2014 11:41 PM, Arun Rangarajan wrote:
> > I have a 28 GB Solr 4.6 index with 45 segments. Optimize failed with an
> > 'out of memory' error. Is optimize really necessary, since I read that
> > lucene is able to handle multiple segments well now?
>
> I have had indexes with more than 45 segments, because of the merge
> settings that I use.  My large index shards are about 16GB at the
> moment.  Out of memory errors are very rare because I use a fairly large
> heap, at 6GB for a machine that hosts three of these large shards.  When
> I was still experimenting with my memory settings, I did see occasional
> out of memory errors during normal segment merging.
>
> Increasing your heap size is pretty much required at this point.  I've
> condensed some very basic information about heap sizing here:
>
> http://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap
>
> As for whether optimizing on 4.x is necessary: I do not have any hard
> numbers for you, but I can tell you that an optimized index does seem
> noticeably faster than one that is freshly built and has has a large
> number of relatively large segments.
>
> I optimize my index shards on an schedule, but it is relatively
> infrequent -- one large shard per night.  Most of the time what I have
> is one really large segment and a bunch of super-small segments, and
> that does not seem to suffer from performance issues compared to a fully
> optimized index.  The situation is different right after a fresh
> rebuild, which produces a handful of very large segments and a bunch of
> smaller segments of varying sizes.
>
> Interesting but probably irrelevant details:
>
> Although I don't use mergeFactor any more, the TieredMergePolicy
> settings that I use are equivalent to a mergeFactor of 35.  I chose this
> number back in the 1.4.1 days because it resulted in synchronicity
> between merges and lucene segment names when LogByteSizeMergePolicy was
> still in use.  Segments _0 through _z would be merged into segment _10,
> and so on.
>
> Thanks,
> Shawn
>
>

handleSelect=true with SolrCloud

2014-02-11 Thread Jeff Wartes


I’m working on a port of a Solr service to SolrCloud. (Targeting v4.6.0 at 
present.) The old query style relied on using /solr/select?qt=foo to select the 
proper requestHandler. I know handleSelect=true is deprecated now, but it’d be 
pretty handy for testing to be able to be backwards compatible, at least until 
some time after the initial release.

So in my SolrCloud configuration, I set http://wiki.apache.org/solr/SolrRequestHandler#Old_handleSelect.3Dtrue_Resolution_.28qt_param.29

However, my /solr/collection1/select?qt=foo query throws an “unknown handler: 
null” error with this configuration. Has anyone successfully tried 
handleSelect=true with the collections api?

Thanks.

boost group doclist members

2014-02-11 Thread David Santamauro



Without falling into the x/y problem area, I'll explain what I want to 
do: I would like to group my result set by a field, f1 and within each 
group, I'd like to boost the score of the "most appropriate" member of 
the group so it appears first in the doc list.


The "most appropriate member" is defined by the content of other fields 
(e.g., f2, f3). So basically, I'd like to boost based on the values in 
fields f2 and f3.


If there is a better way to achieve this, I'm all ears. But I was 
thinking this could be achieved by using a function query as the 
sortspec to group.sort.


Example content:


  4181770 
  x_val   
  100 


  4181770
  y_val
  100


  4181770
  z_val
  100


All 3 of the above documents will be grouped into a doclist with 
groupValue=4181770. My questions is then, How do I make the document 
with f2=y_val appear first in the doclist. I've been playing with


group.field=f1
group.sort=query({!dismax qf=f2 bq=f2:y_val^100}) asc

... but I'm getting:
org.apache.solr.common.SolrException: Can't determine a Sort Order (asc 
or desc) in sort spec 'query({!dismax qf=f2 bq=f2:y_val^100.0}) asc', 
pos=14.


Can anyone point to a some examples of this?

thanks

David

Re: handleSelect=true with SolrCloud

2014-02-11 Thread Shawn Heisey


On 2/11/2014 10:21 AM, Jeff Wartes wrote:

I’m working on a port of a Solr service to SolrCloud. (Targeting v4.6.0 at 
present.) The old query style relied on using /solr/select?qt=foo to select the 
proper requestHandler. I know handleSelect=true is deprecated now, but it’d be 
pretty handy for testing to be able to be backwards compatible, at least until 
some time after the initial release.

So in my SolrCloud configuration, I set http://wiki.apache.org/solr/SolrRequestHandler#Old_handleSelect.3Dtrue_Resolution_.28qt_param.29

However, my /solr/collection1/select?qt=foo query throws an “unknown handler: 
null” error with this configuration. Has anyone successfully tried 
handleSelect=true with the collections api?


I'm pretty sure that if you won't have a handler named /select, then you 
need to have default="true" as an attribute on one of your other handler 
definitions.


See line 715 of the example solrconfig.xml for Solr 3.5:

http://svn.apache.org/viewvc/lucene/dev/branches/lucene_solr_3_5/solr/example/solr/conf/solrconfig.xml?view=annotate

Thanks,
Shawn

Re: handleSelect=true with SolrCloud

2014-02-11 Thread Jeff Wartes


Got it in one. Thanks!


On 2/11/14, 9:50 AM, "Shawn Heisey"  wrote:

>On 2/11/2014 10:21 AM, Jeff Wartes wrote:
>> I¹m working on a port of a Solr service to SolrCloud. (Targeting v4.6.0
>>at present.) The old query style relied on using /solr/select?qt=foo to
>>select the proper requestHandler. I know handleSelect=true is deprecated
>>now, but it¹d be pretty handy for testing to be able to be backwards
>>compatible, at least until some time after the initial release.
>>
>> So in my SolrCloud configuration, I set >handleSelect="true²> and deleted the /select requestHandler as suggested
>>here: 
>>http://wiki.apache.org/solr/SolrRequestHandler#Old_handleSelect.3Dtrue_Re
>>solution_.28qt_param.29
>>
>> However, my /solr/collection1/select?qt=foo query throws an ³unknown
>>handler: null² error with this configuration. Has anyone successfully
>>tried handleSelect=true with the collections api?
>
>I'm pretty sure that if you won't have a handler named /select, then you
>need to have default="true" as an attribute on one of your other handler
>definitions.
>
>See line 715 of the example solrconfig.xml for Solr 3.5:
>
>http://svn.apache.org/viewvc/lucene/dev/branches/lucene_solr_3_5/solr/exam
>ple/solr/conf/solrconfig.xml?view=annotate
>
>Thanks,
>Shawn
>

Re: USER NAME Baruch Labunski

2014-02-11 Thread Baruch

Hello Wiki admin,

 I would like to some value links. Can you please add me, my user name is 
Baruch Labunski


Thank You,
Baruch!



On Thursday, January 16, 2014 2:12:32 PM, Baruch  wrote:
 
Hello Wiki admin,

 I would like to some value links. Can you please add me, my user name is 
Baruch Labunski


Thank You,

Baruch!

Re: Lowering query time

2014-02-11 Thread Joel Cohen

It's a custom ingestion process. It does a big DB query and then inserts
stuff in batches. The batch size is tuneable.


On Tue, Feb 11, 2014 at 11:23 AM, Erick Erickson wrote:

> Hmmm, I'm still a little puzzled BTW. 300K documents, unless they're
> huge, shouldn't be taking 100 minutes. I can index 11M documents on
> my laptop (Wikipedia dump) in 45 minutes for instance Of course
> that's a single core, not cloud and not replicas...
>
> So possibly it' on the data acquisition side? Is your Solr CPU pegged?
>
> YMMV of course.
>
> Erick
>
>
> On Tue, Feb 11, 2014 at 6:40 AM, Joel Cohen 
> wrote:
>
> > I'd like to thank you for lending a hand on my query time problem with
> > SolrCloud. By switching to a single shard with replicas setup, I've
> reduced
> > my query time to 18 msec. My full ingestion of 300k+ documents went down
> > from 2 hours 50 minutes to 1 hour 40 minutes. There are some code changes
> > that are going in that should help a bit as well. Big thanks to everyone
> > that had suggestions.
> >
> >
> > On Tue, Feb 4, 2014 at 8:11 PM, Alexandre Rafalovitch <
> arafa...@gmail.com
> > >wrote:
> >
> > > I suspect faceting is the issue here. The actual query you have shown
> > > seem to bring back a single document (or a single set of document for
> > > a product):
> > > fq=id:(320403401)
> > >
> > > On the other hand, you are asking for 4 field facets:
> > > facet.field=q_virtualCategory_ss
> > > facet.field=q_brand_s
> > > facet.field=q_color_s
> > > facet.field=q_category_ss
> > > AND 2 range facets, both clustered/grouped:
> > > facet.range=daysSinceStart_i
> > > facet.range=activePrice_l (e.g. f.activePrice_l.facet.range.gap=5000)
> > >
> > > And for all facets you have asked to bring back ALL of the results:
> > > facet.limit=-1
> > >
> > > Plus, you are doing a complex sort:
> > > sort=popularity_i desc,popularity_i desc
> > >
> > > So, you are probably spending quite a bit of time counting (especially
> > > in a shared setup) and then quite a bit more sending the response
> > > back.
> > >
> > > I would check the size of the result document (HTTP result) and see
> > > how large it is. Maybe you don't need all of the stuff that's coming
> > > back. I assume you are not actually querying Solr from the client's
> > > machine (that is I hope it is inside your data centre close to your
> > > web server), otherwise I would say to look at automatic content
> > > compression as well to minimize on-wire document size.
> > >
> > > Finally, if your documents have many stored fields (store=true in
> > > schema.xml) but you only return small subsets of them during search,
> > > you could look into using enableLazyFieldLoading flag in the
> > > solrconfig.
> > >
> > > Regards,
> > >Alex.
> > > P.s. As others said, you don't seem to have too many documents.
> > > Perhaps you want replication instead of sharding for improved
> > > performance.
> > > Personal website: http://www.outerthoughts.com/
> > > LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> > > - Time is the quality of nature that keeps events from happening all
> > > at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> > > book)
> > >
> > >
> > > On Wed, Feb 5, 2014 at 6:31 AM, Alexey Kozhemiakin
> > >  wrote:
> > > > Btw "timing" for distributed requests are broken at this moment, it
> > > doesn't combine values from requests to shards.  I'm working on a
> patch.
> > > >
> > > > https://issues.apache.org/jira/browse/SOLR-3644
> > > >
> > > > -Original Message-
> > > > From: Jack Krupansky [mailto:j...@basetechnology.com]
> > > > Sent: Tuesday, February 04, 2014 22:00
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Re: Lowering query time
> > > >
> > > > Add the debug=true parameter to some test queries and look at the
> > > "timing"
> > > > section to see which search components are taking the time.
> > > Traditionally, highlighting for large documents was a top culprit.
> > > >
> > > > Are you returning a lot of data or field values? Sometimes reducing
> the
> > > amount of data processed can help. Any multivalued fields with lots of
> > > values?
> > > >
> > > > -- Jack Krupansky
> > > >
> > > > -Original Message-
> > > > From: Joel Cohen
> > > > Sent: Tuesday, February 4, 2014 1:43 PM
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Re: Lowering query time
> > > >
> > > > 1. We are faceting. I'm not a developer so I'm not quite sure how
> we're
> > > doing it. How can I measure?
> > > > 2. I'm not sure how we'd force this kind of document partitioning. I
> > can
> > > see how my shards are partitioned by looking at the clusterstate.json
> > from
> > > Zookeeper, but I don't have a clue on how to get documents into
> specific
> > > shards.
> > > >
> > > > Would I be better off with fewer shards given the small size of my
> > > indexes?
> > > >
> > > >
> > > > On Tue, Feb 4, 2014 at 12:32 PM, Yonik Seeley  >
> > > wrote:
> > > >
> > > >> On Tue, Feb 4, 2014 at 12:12 PM,

Re: handleSelect=true with SolrCloud

2014-02-11 Thread Joel Bernstein

Jeff,

I believe the shards.qt parameter is what you're looking for. For example
when using the "/elevate" handler with SolrCloud I use the following url to
tell Solr to use the "/elevate" handler on the shards:

http://localhost:8983/solr/collection1/elevate?q=ipod&wt=json&indent=true&shards.qt=/elevate







Joel Bernstein
Search Engineer at Heliosearch


On Tue, Feb 11, 2014 at 1:01 PM, Jeff Wartes  wrote:

>
> Got it in one. Thanks!
>
>
> On 2/11/14, 9:50 AM, "Shawn Heisey"  wrote:
>
> >On 2/11/2014 10:21 AM, Jeff Wartes wrote:
> >> I¹m working on a port of a Solr service to SolrCloud. (Targeting v4.6.0
> >>at present.) The old query style relied on using /solr/select?qt=foo to
> >>select the proper requestHandler. I know handleSelect=true is deprecated
> >>now, but it¹d be pretty handy for testing to be able to be backwards
> >>compatible, at least until some time after the initial release.
> >>
> >> So in my SolrCloud configuration, I set  >>handleSelect="true²> and deleted the /select requestHandler as suggested
> >>here:
> >>
> http://wiki.apache.org/solr/SolrRequestHandler#Old_handleSelect.3Dtrue_Re
> >>solution_.28qt_param.29
> >>
> >> However, my /solr/collection1/select?qt=foo query throws an ³unknown
> >>handler: null² error with this configuration. Has anyone successfully
> >>tried handleSelect=true with the collections api?
> >
> >I'm pretty sure that if you won't have a handler named /select, then you
> >need to have default="true" as an attribute on one of your other handler
> >definitions.
> >
> >See line 715 of the example solrconfig.xml for Solr 3.5:
> >
> >
> http://svn.apache.org/viewvc/lucene/dev/branches/lucene_solr_3_5/solr/exam
> >ple/solr/conf/solrconfig.xml?view=annotate
> >
> >Thanks,
> >Shawn
> >
>
>

RE: handleSelect=true with SolrCloud

2014-02-11 Thread EXTERNAL Taminidi Ravi (ETI, Automotive-Service-Solutions)

Hi Jeff, it is not with elevated, I am talking in the link of Relevancy / 
Boost/ Score.

Select productid from products where SKU = "101"
Select Productid from products where ManufactureSKU = "101"
Select Productid from product where SKU Like "101%"
Select Productid from Product where ManufactureSKU like "101%"
Select Productid from product where Name Like "101%"
Select Productid from Product where Description like '%101%"

Is there any way in Solr can search the exact match,starts with and anywhere.. 
in single solr query.

-Original Message-
From: Joel Bernstein [mailto:joels...@gmail.com] 
Sent: Tuesday, February 11, 2014 3:11 PM
To: solr-user@lucene.apache.org
Subject: Re: handleSelect=true with SolrCloud

Jeff,

I believe the shards.qt parameter is what you're looking for. For example when 
using the "/elevate" handler with SolrCloud I use the following url to tell 
Solr to use the "/elevate" handler on the shards:

http://localhost:8983/solr/collection1/elevate?q=ipod&wt=json&indent=true&shards.qt=/elevate







Joel Bernstein
Search Engineer at Heliosearch


On Tue, Feb 11, 2014 at 1:01 PM, Jeff Wartes  wrote:

>
> Got it in one. Thanks!
>
>
> On 2/11/14, 9:50 AM, "Shawn Heisey"  wrote:
>
> >On 2/11/2014 10:21 AM, Jeff Wartes wrote:
> >> I¹m working on a port of a Solr service to SolrCloud. (Targeting 
> >>v4.6.0 at present.) The old query style relied on using 
> >>/solr/select?qt=foo to select the proper requestHandler. I know 
> >>handleSelect=true is deprecated now, but it¹d be pretty handy for 
> >>testing to be able to be backwards compatible, at least until some time 
> >>after the initial release.
> >>
> >> So in my SolrCloud configuration, I set  >>handleSelect="true²> and deleted the /select requestHandler as 
> >>suggested
> >>here:
> >>
> http://wiki.apache.org/solr/SolrRequestHandler#Old_handleSelect.3Dtrue
> _Re
> >>solution_.28qt_param.29
> >>
> >> However, my /solr/collection1/select?qt=foo query throws an 
> >>³unknown
> >>handler: null² error with this configuration. Has anyone 
> >>successfully tried handleSelect=true with the collections api?
> >
> >I'm pretty sure that if you won't have a handler named /select, then 
> >you need to have default="true" as an attribute on one of your other 
> >handler definitions.
> >
> >See line 715 of the example solrconfig.xml for Solr 3.5:
> >
> >
> http://svn.apache.org/viewvc/lucene/dev/branches/lucene_solr_3_5/solr/
> exam
> >ple/solr/conf/solrconfig.xml?view=annotate
> >
> >Thanks,
> >Shawn
> >
>
>

Solr Autosuggest - Strange issue with leading numbers in query

2014-02-11 Thread Developer

I have a strange issue with Autosuggest.

Whenever I query for a keyword along with numbers (leading) it returns the
suggestion corresponding to the alphabets (ignoring the numbers). I was
under assumption that it will return an empty result back. I am not sure
what I am doing wrong. Can someone help?

*Query:*
/autocomplete?qt=/lucid&req_type=auto_complete&spellcheck.maxCollations=10&q="12342343243242ga"&spellcheck.count=10

*Result:*



0
1




1
15
17

galaxy


"12342343243242galaxy"





*My field configuration is as below:*








*SolrConfig.xml*



autocomplete
org.apache.solr.spelling.suggest.Suggester
org.apache.solr.spelling.suggest.tst.TSTLookup
autocomplete_word
autocomplete
true
.005





true
autocomplete
true
10
false


autocomplete





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Autosuggest-Strange-issue-with-leading-numbers-in-query-tp4116751.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing question on individual field update

2014-02-11 Thread shamik

Eric,

  Thanks for your reply. I should have given a better context. I'm currently
running an incremental crawl daily on this particular source and indexing
the documents. Incremental crawl looks for any change since last crawl date
based on the document publish date. But, there's no way for me to know if a
document has been deleted. To ensure that, I ran a full crawl on a weekend,
which basically re-index the entire content. After the full index is over, I
call a purge script, which deletes any content which is more than 24 hour
old, based on the indextimestamp field. 

The issue with atomic update is that it doesn't alter the indextimstamp
field. So even if I run a full crawl with atomic updates, the timestamp will
stick to its old value. Unfortunately, I can't rely on another date field
coming from the source as they are not consistent. That translates to the
fact that I can't remove stale content.

Let me know if I'm missing something here.

- Thanks,
Shamik





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-question-on-individual-field-update-tp4116605p4116757.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Solr server requirements for 100+ million documents

2014-02-11 Thread Susheel Kumar

Hi Otis,

Just to confirm, the 3 servers you mean here are 2 for shards/nodes and 1 for 
Zookeeper. Is that correct?

Thanks,
Susheel

-Original Message-
From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] 
Sent: Friday, January 24, 2014 5:21 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr server requirements for 100+ million documents

Hi Susheel,

Like Erick said, it's impossible to give precise recommendations, but making a 
few assumptions and combining them with experience (+ a licked finger in the 
air):
* 3 servers
* 32 GB
* 2+ CPU cores
* Linux

Assuming docs are not bigger than a few KB, that they are not being reindexed 
over and over, that you don't have a search rate higher than a few dozen QPS, 
assuming your queries are not a page long, etc. assuming best practices are 
followed, the above should be sufficient.

I hope this helps.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch 
Support * http://sematext.com/


On Fri, Jan 24, 2014 at 1:10 PM, Susheel Kumar < 
susheel.ku...@thedigitalgroup.net> wrote:

> Hi,
>
> Currently we are indexing 10 million document from database (10 db 
> data
> entities) & index size is around 8 GB on windows virtual box. Indexing 
> in one shot taking 12+ hours while indexing parallel in separate cores 
> & merging them together taking 4+ hours.
>
> We are looking to scale to 100+ million documents and looking for 
> recommendation on servers requirements on below parameters for a 
> Production environment. There can be 200+ users performing search same time.
>
> No of physical servers (considering solr cloud) Memory requirement 
> Processor requirement (# cores) Linux as OS oppose to windows
>
> Thanks in advance.
> Susheel
>
>

Re: Solr server requirements for 100+ million documents

2014-02-11 Thread Otis Gospodnetic

Hi Susheel,

No, we wouldn't want to go with just 1 ZK. :)

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Tue, Feb 11, 2014 at 5:18 PM, Susheel Kumar <
susheel.ku...@thedigitalgroup.net> wrote:

> Hi Otis,
>
> Just to confirm, the 3 servers you mean here are 2 for shards/nodes and 1
> for Zookeeper. Is that correct?
>
> Thanks,
> Susheel
>
> -Original Message-
> From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com]
> Sent: Friday, January 24, 2014 5:21 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr server requirements for 100+ million documents
>
> Hi Susheel,
>
> Like Erick said, it's impossible to give precise recommendations, but
> making a few assumptions and combining them with experience (+ a licked
> finger in the air):
> * 3 servers
> * 32 GB
> * 2+ CPU cores
> * Linux
>
> Assuming docs are not bigger than a few KB, that they are not being
> reindexed over and over, that you don't have a search rate higher than a
> few dozen QPS, assuming your queries are not a page long, etc. assuming
> best practices are followed, the above should be sufficient.
>
> I hope this helps.
>
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics Solr &
> Elasticsearch Support * http://sematext.com/
>
>
> On Fri, Jan 24, 2014 at 1:10 PM, Susheel Kumar <
> susheel.ku...@thedigitalgroup.net> wrote:
>
> > Hi,
> >
> > Currently we are indexing 10 million document from database (10 db
> > data
> > entities) & index size is around 8 GB on windows virtual box. Indexing
> > in one shot taking 12+ hours while indexing parallel in separate cores
> > & merging them together taking 4+ hours.
> >
> > We are looking to scale to 100+ million documents and looking for
> > recommendation on servers requirements on below parameters for a
> > Production environment. There can be 200+ users performing search same
> time.
> >
> > No of physical servers (considering solr cloud) Memory requirement
> > Processor requirement (# cores) Linux as OS oppose to windows
> >
> > Thanks in advance.
> > Susheel
> >
> >
>

RE: Solr server requirements for 100+ million documents

2014-02-11 Thread Susheel Kumar

Thanks, Otis for quick reply. So for ZK do you recommend separate servers and 
if so how many for initial Solr cloud cluster setup. 

-Original Message-
From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] 
Sent: Tuesday, February 11, 2014 4:21 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr server requirements for 100+ million documents

Hi Susheel,

No, we wouldn't want to go with just 1 ZK. :)

Otis
--
Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch 
Support * http://sematext.com/


On Tue, Feb 11, 2014 at 5:18 PM, Susheel Kumar < 
susheel.ku...@thedigitalgroup.net> wrote:

> Hi Otis,
>
> Just to confirm, the 3 servers you mean here are 2 for shards/nodes 
> and 1 for Zookeeper. Is that correct?
>
> Thanks,
> Susheel
>
> -Original Message-
> From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com]
> Sent: Friday, January 24, 2014 5:21 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr server requirements for 100+ million documents
>
> Hi Susheel,
>
> Like Erick said, it's impossible to give precise recommendations, but 
> making a few assumptions and combining them with experience (+ a 
> licked finger in the air):
> * 3 servers
> * 32 GB
> * 2+ CPU cores
> * Linux
>
> Assuming docs are not bigger than a few KB, that they are not being 
> reindexed over and over, that you don't have a search rate higher than 
> a few dozen QPS, assuming your queries are not a page long, etc. 
> assuming best practices are followed, the above should be sufficient.
>
> I hope this helps.
>
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics Solr & 
> Elasticsearch Support * http://sematext.com/
>
>
> On Fri, Jan 24, 2014 at 1:10 PM, Susheel Kumar < 
> susheel.ku...@thedigitalgroup.net> wrote:
>
> > Hi,
> >
> > Currently we are indexing 10 million document from database (10 db 
> > data
> > entities) & index size is around 8 GB on windows virtual box. 
> > Indexing in one shot taking 12+ hours while indexing parallel in 
> > separate cores & merging them together taking 4+ hours.
> >
> > We are looking to scale to 100+ million documents and looking for 
> > recommendation on servers requirements on below parameters for a 
> > Production environment. There can be 200+ users performing search 
> > same
> time.
> >
> > No of physical servers (considering solr cloud) Memory requirement 
> > Processor requirement (# cores) Linux as OS oppose to windows
> >
> > Thanks in advance.
> > Susheel
> >
> >
>

Re: Indexing question on individual field update

2014-02-11 Thread shamik

Ok, I was wrong here. I can always set the indextimestamp field with current
time (NOW) for every atomic update. On a similar note, is there any
performance constraint with updates compared to add ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-question-on-individual-field-update-tp4116605p4116772.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr server requirements for 100+ million documents

2014-02-11 Thread svante karlsson

ZK needs a quorum to keep functional so 3 servers handles one failure. 5
handles 2 node failures. If you Solr with 1 replica per shard then stick to
3 ZK. If you use 2 replicas use 5 ZK





>

Replica node down but zookeeper clusterstate not updated

2014-02-11 Thread Gopal Patwa

Solr = 4.6.1, attached solrcloud admin console view
Zookeeper 3.4.5  = 3 node ensemble

In my test setup, I have 3 Node SolrCloud setup with 2 shard. Today we had
power failure and all node went down.

I started 3 node zookeeper ensemble first then followed with 3 node
solrcloud, and one of replica ip address was change due to dynamic ip
allocation but zookeeper
clusterstate is not updated with new ip address and it was still holding
old ip address for that bad node.

Do I need to manually update clusterstate in zookeeper? what are my options
if this could happen in production.

Bad node:
old IP:10.249.132.35 (still exist in zookeeper)
new IP: 10.249.133.10

Log from Node1:

11:26:25,242 INFO  [STDOUT] 49170786 [Thread-2-EventThread] INFO
 org.apache.solr.common.cloud.ZkStateReader  â A cluster state change:
WatchedEvent state:SyncConnected type:NodeDataChanged
path:/clusterstate.json, has occurred - updating... (live nodes size: 3)
11:26:41,072 INFO  [STDOUT] 49186615 [RecoveryThread] INFO
 org.apache.solr.cloud.ZkController  â publishing
core=genre_shard1_replica1 state=recovering
11:26:41,079 INFO  [STDOUT] 49186622 [RecoveryThread] ERROR
org.apache.solr.cloud.RecoveryStrategy  â Error while trying to recover.
core=genre_shard1_replica1:org.apache.solr.client.solrj.SolrServerException:
Server refused connection at: http://10.249.132.35:8080/solr
11:26:41,079 INFO  [STDOUT] at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:496)
11:26:41,079 INFO  [STDOUT] at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
11:26:41,079 INFO  [STDOUT] at
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:221)
11:26:41,079 INFO  [STDOUT] at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:367)
11:26:41,079 INFO  [STDOUT] at
org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:244)
11:26:41,079 INFO  [STDOUT] Caused by:
org.apache.http.conn.HttpHostConnectException: Connection to
http://10.249.132.35:8080 refused


11:27:14,036 INFO  [STDOUT] 49219580 [RecoveryThread] ERROR
org.apache.solr.cloud.RecoveryStrategy  â Recovery failed - trying again...
(9) core=geo_shard1_replica1
11:27:14,037 INFO  [STDOUT] 49219581 [RecoveryThread] INFO
 org.apache.solr.cloud.RecoveryStrategy  â Wait 600.0 seconds before trying
to recover again (10)
11:27:14,958 INFO  [STDOUT] 49220498 [Thread-40] INFO
 org.apache.solr.common.cloud.ZkStateReader  â Updating cloud state from
ZooKeeper...



Log from bad node with new ip address:

11:06:29,551 INFO  [STDOUT] 6234 [coreLoadExecutor-4-thread-10] INFO
 org.apache.solr.cloud.ShardLeaderElectionContext  â Enough replicas found
to continue.
11:06:29,552 INFO  [STDOUT] 6236 [coreLoadExecutor-4-thread-10] INFO
 org.apache.solr.cloud.ShardLeaderElectionContext  â I may be the new
leader - try and sync
11:06:29,554 INFO  [STDOUT] 6237 [coreLoadExecutor-4-thread-10] INFO
 org.apache.solr.cloud.SyncStrategy  â Sync replicas to
http://10.249.132.35:8080/solr/venue_shard2_replica2/
11:06:29,555 INFO  [STDOUT] 6239 [coreLoadExecutor-4-thread-10] INFO
 org.apache.solr.update.PeerSync  â PeerSync: core=venue_shard2_replica2
url=http://10.249.132.35:8080/solr START replicas=[
http://10.249.132.56:8080/solr/venue_shard2_replica1/] nUpdates=100
11:06:29,556 INFO  [STDOUT] 6240 [coreLoadExecutor-4-thread-10] INFO
 org.apache.solr.update.PeerSync  â PeerSync: core=venue_shard2_replica2
url=http://10.249.132.35:8080/solr DONE.  We have no versions.  sync failed.
11:06:29,556 INFO  [STDOUT] 6241 [coreLoadExecutor-4-thread-10] INFO
 org.apache.solr.cloud.SyncStrategy  â Leader's attempt to sync with shard
failed, moving to the next candidate
11:06:29,558 INFO  [STDOUT] 6241 [coreLoadExecutor-4-thread-10] INFO
 org.apache.solr.cloud.ShardLeaderElectionContext  â We failed sync, but we
have no versions - we can't sync in that case - we were active before, so
become leader anyway
11:06:29,559 INFO  [STDOUT] 6243 [coreLoadExecutor-4-thread-10] INFO
 org.apache.solr.cloud.ShardLeaderElectionContext  â I am the new leader:
http://10.249.132.35:8080/solr/venue_shard2_replica2/ shard2
11:06:29,561 INFO  [STDOUT] 6245 [coreLoadExecutor-4-thread-10] INFO
 org.apache.solr.common.cloud.SolrZkClient  â makePath:
/collections/venue/leaders/shard2
11:06:29,577 INFO  [STDOUT] 6261 [Thread-2-EventThread] INFO
 org.apache.solr.update.PeerSync  â PeerSync: core=event_shard2_replica2
url=http://10.249.132.35:8080/solr  Received 18 versions from
10.249.132.56:8080/solr/event_shard2_replica1/
11:06:29,578 INFO  [STDOUT] 6263 [Thread-2-EventThread] INFO
 org.apache.solr.update.PeerSync  â PeerSync: core=event_shard2_replica2
url=http://10.249.132.35:8080/solr Requesting updates from
10.249.132.56:8080/solr/event_shard2_replica1/n=10versions=[1457764666067386368,
1456709993140060160, 1456709989863260160,
1456709986075803648, 1456709971758546944, 1456709179685208064,
1456709137524064256, 1456709130

Re: Indexing question on individual field update

2014-02-11 Thread Shawn Heisey


On 2/11/2014 2:37 PM, shamik wrote:

Eric,

   Thanks for your reply. I should have given a better context. I'm currently
running an incremental crawl daily on this particular source and indexing
the documents. Incremental crawl looks for any change since last crawl date
based on the document publish date. But, there's no way for me to know if a
document has been deleted. To ensure that, I ran a full crawl on a weekend,
which basically re-index the entire content. After the full index is over, I
call a purge script, which deletes any content which is more than 24 hour
old, based on the indextimestamp field.

The issue with atomic update is that it doesn't alter the indextimstamp
field. So even if I run a full crawl with atomic updates, the timestamp will
stick to its old value. Unfortunately, I can't rely on another date field
coming from the source as they are not consistent. That translates to the
fact that I can't remove stale content.


One possibility is this: When you send the atomic update to Solr, 
include a new value for the indextimestamp field.


Another option: You can write a custom update processor plugin for 
Solr.  When the custom code is used, it will be executed on each 
incoming document.  Depending on what it finds in the update request, it 
can make appropriate changes, like updating indextimestamp.  You can do 
pretty much anything.


http://wiki.apache.org/solr/UpdateRequestProcessor

Writing an update processor in Java typically gives the best results in 
terms of flexibility and performance, but there is also a way to use 
other programming languages:


http://wiki.apache.org/solr/ScriptUpdateProcessor

Thanks,
Shawn

Re: Solr server requirements for 100+ million documents

2014-02-11 Thread Jason Hellman

Whether you use the same machines as Solr or separate machines is a matter 
suited to taste.

If you are the CTO, then you should make this decision.  If not, inform 
management that risk conditions are greater when you share function and control 
on a single piece of hardware.  A single failure of a replica + zookeeper node 
will be more impactful than a single failure of a replica *or* a zookeeper 
node.  Let them earn the big bucks to make the risk decision.

The good news is, zookeeper hardware can be extremely lightweight for Solr 
Cloud.  Commodity hardware should work just fine…and thus scaling to 5 nodes 
for zookeeper is not that hard at all.

Jason

On Feb 11, 2014, at 3:00 PM, svante karlsson  wrote:

> ZK needs a quorum to keep functional so 3 servers handles one failure. 5
> handles 2 node failures. If you Solr with 1 replica per shard then stick to
> 3 ZK. If you use 2 replicas use 5 ZK
> 
> 
> 
> 
> 
>>

Re: Solr server requirements for 100+ million documents

2014-02-11 Thread Shawn Heisey


On 2/11/2014 3:28 PM, Susheel Kumar wrote:

Thanks, Otis for quick reply. So for ZK do you recommend separate servers and 
if so how many for initial Solr cloud cluster setup.


In a minimal 3-server setup, all servers would run zookeeper and two of 
them would also run Solr.With this setup, you can survive the failure of 
any of those three machines, even if it dies completely.


If the third machine is only running zookeeper, two fast CPU cores and 
2GB of RAM would be plenty.  For 100 million documents, I would 
personally recommend at least 8 CPU cores on the machines running Solr, 
ideally provided by at least two separate physical CPUs.  Otis 
recommended 32GB of RAM as a starting point.  You would very likely want 
more.


One copy of my 90 million document index uses two servers to run all the 
shards.  Because I have two copies of the index, I have four servers.  
Each server has 64GB of RAM.  This is **NOT** running SolrCloud, but if 
it were, I would have zookeeper running on three of those servers.


Thanks,
Shawn

Re: FuzzyLookupFactory with exactMatchFirst not giving the exact match.

2014-02-11 Thread Hamish Campbell

I've tried the new SuggestComponent, however it doesn't work quite as
expected. It returns the full field value rather than a list of corrections
for the specific term. I can see how SuggestComponent would be excellent
for phrase suggestions and document lookups, but it doesn't seem to be
suitable for a per-word spelling suggestion. Correct me if I'm wrong.

I'm taking another look at solr.SpellCheckComponenet. I've switched on
`spellcheck.extendedResults` but the response `correctlySpelled` is always
false, regardless of other settings. It seems it's an example SOLR-4278. In
that ticket James Dyer says:

> You can tell if the user's keywords exist in the index on a term-by-term
basis by specifying "spellcheck.extendedResults=true". Then look under each
 for 0.

This would be suit me perfectly - but `origFreq` does not appear in the
response at all. I'm looking that code but tracing down how the token
frequency is added is leading me down a deep and dark rabbit hole :). Am I
missing something basic here?


On Tue, Feb 11, 2014 at 3:59 PM, Areek Zillur  wrote:

> Dont worry about the analysis chain, I realized you are using the
> spellcheck component for suggestions. The suggestion gets returned from the
> Lucene layer, but unfortunately the Spellcheck component strips the
> suggestion out as it is mainly built for spell checking (when the query
> token == suggestion; spelling is correct, so why suggest it!). You can try
> out the SuggestComponent (SOLR-5378), it does the right thing in this
> situation.
>
>
> On Mon, Feb 10, 2014 at 9:30 PM, Areek Zillur  wrote:
>
> > That should not be the case, Maybe the analysis-chain of 'text_spell' is
> > doing something before the key hits the suggester (you want to use
> > something like KeywordTokenizerFactory)? Also maybe specify the
> queryAnalyzerFieldType
> > in the suggest component config? you might want to do something similar
> to
> > solr-config: (
> >
> https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/test-files/solr/collection1/conf/solrconfig-phrasesuggest.xml
> )
> > [look at suggest_analyzing component] and schema: (
> >
> https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/test-files/solr/collection1/conf/schema-phrasesuggest.xml
> )
> > [look at phrase_suggest field type].
> >
> >
> > On Mon, Feb 10, 2014 at 8:44 PM, Hamish Campbell <
> > hamish.campb...@koordinates.com> wrote:
> >
> >> Same issue with AnalyzingLookupFactory - I'll get autocomplete
> suggestions
> >> but not the original query.
> >>
> >>
> >> On Tue, Feb 11, 2014 at 1:57 PM, Areek Zillur 
> wrote:
> >>
> >> > The FuzzyLookupFactory should accept all the options as that of as
> >> > AnalyzingLookupFactory (
> >> >
> >> >
> >>
> http://lucene.apache.org/solr/4_2_1/solr-core/org/apache/solr/spelling/suggest/fst/AnalyzingLookupFactory.html
> >> > ).
> >> > [FuzzySuggester is a direct subclass of the AnalyzingSuggester in
> >> lucene].
> >> > Have you tried the exactMatchFirst with the AnalyzingLookupFactory?
> Does
> >> > AnalyzingLookup have the same problem with the exactMatchFirst option?
> >> >
> >> >
> >> > On Mon, Feb 10, 2014 at 6:00 PM, Hamish Campbell <
> >> > hamish.campb...@koordinates.com> wrote:
> >> >
> >> > > Looking at:
> >> > >
> >> > >
> >> > >
> >> >
> >>
> http://lucene.apache.org/solr/4_2_1/solr-core/org/apache/solr/spelling/suggest/fst/FuzzyLookupFactory.html
> >> > >
> >> > > It seems that exactMatchFirst is not a valid option for
> >> > FuzzyLookupFactory.
> >> > > Potential workarounds?
> >> > >
> >> > >
> >> > > On Mon, Feb 10, 2014 at 5:04 PM, Hamish Campbell <
> >> > > hamish.campb...@koordinates.com> wrote:
> >> > >
> >> > > > Hi all,
> >> > > >
> >> > > > I've got a FuzzyLookupFactory spellchecker with exactMatchFirst
> >> > enabled.
> >> > > A
> >> > > > query like "tes" will return "test" and "testing", but a query for
> >> > "test"
> >> > > > will *not* return "test" even though it is clearly in the
> >> dictionary.
> >> > Why
> >> > > > would this be?
> >> > > >
> >> > > > Relevant config follows
> >> > > >
> >> > > > 
> >> > > > 
> >> > > > suggest
> >> > > >
> >> > > > 
> >> > > >  >> > > > name="classname">org.apache.solr.spelling.suggest.Suggester
> >> > > >  >> > > >
> >> > >
> >> >
> >>
> name="lookupImpl">org.apache.solr.spelling.suggest.fst.FuzzyLookupFactory
> >> > > >
> >> > > > 
> >> > > > false
> >> > > > true
> >> > > > text_spell
> >> > > > 0.005
> >> > > >
> >> > > > 
> >> > > >
> >> > > > 
> >> > > > suggest
> >> > > > 
> >> > > > 
> >> > > >
> >> > > > 
> >> > > > 
> >> > > > true
> >> > > > suggest
> >> > > > true
> >> > > > 5
> >> > > > true
> >> > > > 
> >> > > >
> >> > > > 
> >> > > > suggest
> >> > > > 
> >> > > > 
> >> > > >
> >> > > > --
> >> > > > Hamish Campbell
> >> > > > Koordinates Ltd 
> >> > > > PH

Re: FuzzyLookupFactory with exactMatchFirst not giving the exact match.

2014-02-11 Thread Hamish Campbell

Ah, I think the term frequency is only available for the Spellcheckers
rather than the Suggesters - so I tried a DirectSolrSpellChecker. This gave
me good spelling suggestions for misspelt terms, but if the term is spelled
correctly I, again, get no term information and correctlySpelled is false.
Back to square 1.


On Wed, Feb 12, 2014 at 12:37 PM, Hamish Campbell <
hamish.campb...@koordinates.com> wrote:

> I've tried the new SuggestComponent, however it doesn't work quite as
> expected. It returns the full field value rather than a list of corrections
> for the specific term. I can see how SuggestComponent would be excellent
> for phrase suggestions and document lookups, but it doesn't seem to be
> suitable for a per-word spelling suggestion. Correct me if I'm wrong.
>
> I'm taking another look at solr.SpellCheckComponenet. I've switched on
> `spellcheck.extendedResults` but the response `correctlySpelled` is always
> false, regardless of other settings. It seems it's an example SOLR-4278. In
> that ticket James Dyer says:
>
> > You can tell if the user's keywords exist in the index on a term-by-term
> basis by specifying "spellcheck.extendedResults=true". Then look under each
>  for 0.
>
> This would be suit me perfectly - but `origFreq` does not appear in the
> response at all. I'm looking that code but tracing down how the token
> frequency is added is leading me down a deep and dark rabbit hole :). Am I
> missing something basic here?
>
>
> On Tue, Feb 11, 2014 at 3:59 PM, Areek Zillur  wrote:
>
>> Dont worry about the analysis chain, I realized you are using the
>> spellcheck component for suggestions. The suggestion gets returned from
>> the
>> Lucene layer, but unfortunately the Spellcheck component strips the
>> suggestion out as it is mainly built for spell checking (when the query
>> token == suggestion; spelling is correct, so why suggest it!). You can try
>> out the SuggestComponent (SOLR-5378), it does the right thing in this
>> situation.
>>
>>
>> On Mon, Feb 10, 2014 at 9:30 PM, Areek Zillur  wrote:
>>
>> > That should not be the case, Maybe the analysis-chain of 'text_spell' is
>> > doing something before the key hits the suggester (you want to use
>> > something like KeywordTokenizerFactory)? Also maybe specify the
>> queryAnalyzerFieldType
>> > in the suggest component config? you might want to do something similar
>> to
>> > solr-config: (
>> >
>> https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/test-files/solr/collection1/conf/solrconfig-phrasesuggest.xml
>> )
>> > [look at suggest_analyzing component] and schema: (
>> >
>> https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/test-files/solr/collection1/conf/schema-phrasesuggest.xml
>> )
>> > [look at phrase_suggest field type].
>> >
>> >
>> > On Mon, Feb 10, 2014 at 8:44 PM, Hamish Campbell <
>> > hamish.campb...@koordinates.com> wrote:
>> >
>> >> Same issue with AnalyzingLookupFactory - I'll get autocomplete
>> suggestions
>> >> but not the original query.
>> >>
>> >>
>> >> On Tue, Feb 11, 2014 at 1:57 PM, Areek Zillur 
>> wrote:
>> >>
>> >> > The FuzzyLookupFactory should accept all the options as that of as
>> >> > AnalyzingLookupFactory (
>> >> >
>> >> >
>> >>
>> http://lucene.apache.org/solr/4_2_1/solr-core/org/apache/solr/spelling/suggest/fst/AnalyzingLookupFactory.html
>> >> > ).
>> >> > [FuzzySuggester is a direct subclass of the AnalyzingSuggester in
>> >> lucene].
>> >> > Have you tried the exactMatchFirst with the AnalyzingLookupFactory?
>> Does
>> >> > AnalyzingLookup have the same problem with the exactMatchFirst
>> option?
>> >> >
>> >> >
>> >> > On Mon, Feb 10, 2014 at 6:00 PM, Hamish Campbell <
>> >> > hamish.campb...@koordinates.com> wrote:
>> >> >
>> >> > > Looking at:
>> >> > >
>> >> > >
>> >> > >
>> >> >
>> >>
>> http://lucene.apache.org/solr/4_2_1/solr-core/org/apache/solr/spelling/suggest/fst/FuzzyLookupFactory.html
>> >> > >
>> >> > > It seems that exactMatchFirst is not a valid option for
>> >> > FuzzyLookupFactory.
>> >> > > Potential workarounds?
>> >> > >
>> >> > >
>> >> > > On Mon, Feb 10, 2014 at 5:04 PM, Hamish Campbell <
>> >> > > hamish.campb...@koordinates.com> wrote:
>> >> > >
>> >> > > > Hi all,
>> >> > > >
>> >> > > > I've got a FuzzyLookupFactory spellchecker with exactMatchFirst
>> >> > enabled.
>> >> > > A
>> >> > > > query like "tes" will return "test" and "testing", but a query
>> for
>> >> > "test"
>> >> > > > will *not* return "test" even though it is clearly in the
>> >> dictionary.
>> >> > Why
>> >> > > > would this be?
>> >> > > >
>> >> > > > Relevant config follows
>> >> > > >
>> >> > > > 
>> >> > > > 
>> >> > > > suggest
>> >> > > >
>> >> > > > 
>> >> > > > > >> > > > name="classname">org.apache.solr.spelling.suggest.Suggester
>> >> > > > > >> > > >
>> >> > >
>> >> >
>> >>
>> name="lookupImpl">org.apache.solr.spelling.suggest.fst.FuzzyLookupFactory
>> >> > > >
>> >> > > > 
>> >> > > > false
>> >> > > >

Solr performance on a very huge data set

2014-02-11 Thread neerajp

Hello Dear,
I have 1000 GB of data that I want to index.
Assuming I have enough space for storing the indexes in a single machine.
*I would like to get an idea about Solr performance for searching an item
from a huge data set.
Do I need to use shards for improving the Solr search efficiency or it is OK
to search without sharding ?*

I will use SolrCloud for high availability and fault tolerance with the help
of zoo-keeper.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-performance-on-a-very-huge-data-set-tp4116792.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: USER NAME Baruch Labunski

2014-02-11 Thread Erick Erickson

Baruch:

Is that your Wiki ID? We need that. But sure, we'll be happy to add you to
the list...


On Tue, Feb 11, 2014 at 11:03 AM, Baruch  wrote:

> Hello Wiki admin,
>
>  I would like to some value links. Can you please add me, my user name is
> Baruch Labunski
>
>
> Thank You,
> Baruch!
>
>
>
> On Thursday, January 16, 2014 2:12:32 PM, Baruch 
> wrote:
>
> Hello Wiki admin,
>
>  I would like to some value links. Can you please add me, my user name is
> Baruch Labunski
>
>
> Thank You,
>
> Baruch!
>

Re: Lowering query time

2014-02-11 Thread Erick Erickson

So my guess is you're spending by far the largest portion of your time doing
the DB query(ies), which makes sense


On Tue, Feb 11, 2014 at 11:50 AM, Joel Cohen  wrote:

> It's a custom ingestion process. It does a big DB query and then inserts
> stuff in batches. The batch size is tuneable.
>
>
> On Tue, Feb 11, 2014 at 11:23 AM, Erick Erickson  >wrote:
>
> > Hmmm, I'm still a little puzzled BTW. 300K documents, unless they're
> > huge, shouldn't be taking 100 minutes. I can index 11M documents on
> > my laptop (Wikipedia dump) in 45 minutes for instance Of course
> > that's a single core, not cloud and not replicas...
> >
> > So possibly it' on the data acquisition side? Is your Solr CPU pegged?
> >
> > YMMV of course.
> >
> > Erick
> >
> >
> > On Tue, Feb 11, 2014 at 6:40 AM, Joel Cohen 
> > wrote:
> >
> > > I'd like to thank you for lending a hand on my query time problem with
> > > SolrCloud. By switching to a single shard with replicas setup, I've
> > reduced
> > > my query time to 18 msec. My full ingestion of 300k+ documents went
> down
> > > from 2 hours 50 minutes to 1 hour 40 minutes. There are some code
> changes
> > > that are going in that should help a bit as well. Big thanks to
> everyone
> > > that had suggestions.
> > >
> > >
> > > On Tue, Feb 4, 2014 at 8:11 PM, Alexandre Rafalovitch <
> > arafa...@gmail.com
> > > >wrote:
> > >
> > > > I suspect faceting is the issue here. The actual query you have shown
> > > > seem to bring back a single document (or a single set of document for
> > > > a product):
> > > > fq=id:(320403401)
> > > >
> > > > On the other hand, you are asking for 4 field facets:
> > > > facet.field=q_virtualCategory_ss
> > > > facet.field=q_brand_s
> > > > facet.field=q_color_s
> > > > facet.field=q_category_ss
> > > > AND 2 range facets, both clustered/grouped:
> > > > facet.range=daysSinceStart_i
> > > > facet.range=activePrice_l (e.g. f.activePrice_l.facet.range.gap=5000)
> > > >
> > > > And for all facets you have asked to bring back ALL of the results:
> > > > facet.limit=-1
> > > >
> > > > Plus, you are doing a complex sort:
> > > > sort=popularity_i desc,popularity_i desc
> > > >
> > > > So, you are probably spending quite a bit of time counting
> (especially
> > > > in a shared setup) and then quite a bit more sending the response
> > > > back.
> > > >
> > > > I would check the size of the result document (HTTP result) and see
> > > > how large it is. Maybe you don't need all of the stuff that's coming
> > > > back. I assume you are not actually querying Solr from the client's
> > > > machine (that is I hope it is inside your data centre close to your
> > > > web server), otherwise I would say to look at automatic content
> > > > compression as well to minimize on-wire document size.
> > > >
> > > > Finally, if your documents have many stored fields (store=true in
> > > > schema.xml) but you only return small subsets of them during search,
> > > > you could look into using enableLazyFieldLoading flag in the
> > > > solrconfig.
> > > >
> > > > Regards,
> > > >Alex.
> > > > P.s. As others said, you don't seem to have too many documents.
> > > > Perhaps you want replication instead of sharding for improved
> > > > performance.
> > > > Personal website: http://www.outerthoughts.com/
> > > > LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> > > > - Time is the quality of nature that keeps events from happening all
> > > > at once. Lately, it doesn't seem to be working.  (Anonymous  - via
> GTD
> > > > book)
> > > >
> > > >
> > > > On Wed, Feb 5, 2014 at 6:31 AM, Alexey Kozhemiakin
> > > >  wrote:
> > > > > Btw "timing" for distributed requests are broken at this moment, it
> > > > doesn't combine values from requests to shards.  I'm working on a
> > patch.
> > > > >
> > > > > https://issues.apache.org/jira/browse/SOLR-3644
> > > > >
> > > > > -Original Message-
> > > > > From: Jack Krupansky [mailto:j...@basetechnology.com]
> > > > > Sent: Tuesday, February 04, 2014 22:00
> > > > > To: solr-user@lucene.apache.org
> > > > > Subject: Re: Lowering query time
> > > > >
> > > > > Add the debug=true parameter to some test queries and look at the
> > > > "timing"
> > > > > section to see which search components are taking the time.
> > > > Traditionally, highlighting for large documents was a top culprit.
> > > > >
> > > > > Are you returning a lot of data or field values? Sometimes reducing
> > the
> > > > amount of data processed can help. Any multivalued fields with lots
> of
> > > > values?
> > > > >
> > > > > -- Jack Krupansky
> > > > >
> > > > > -Original Message-
> > > > > From: Joel Cohen
> > > > > Sent: Tuesday, February 4, 2014 1:43 PM
> > > > > To: solr-user@lucene.apache.org
> > > > > Subject: Re: Lowering query time
> > > > >
> > > > > 1. We are faceting. I'm not a developer so I'm not quite sure how
> > we're
> > > > doing it. How can I measure?
> > > > > 2. I'm not sure how we'd force this kind of document partitioning.
> I

Re: Solr Autosuggest - Strange issue with leading numbers in query

2014-02-11 Thread Erick Erickson

Hmmm, the example you post seems correct to me, the returned
suggestion is really close to the term. What are you expecting here?

The example is inconsistent with
"it returns the suggestion corresponding to the alphabets (ignoring the
numbers)"

It looks like it's considering the numbers just fine, which is what makes
the returned suggestion close to the term I think.

Best,
Erick


On Tue, Feb 11, 2014 at 1:01 PM, Developer  wrote:

> I have a strange issue with Autosuggest.
>
> Whenever I query for a keyword along with numbers (leading) it returns the
> suggestion corresponding to the alphabets (ignoring the numbers). I was
> under assumption that it will return an empty result back. I am not sure
> what I am doing wrong. Can someone help?
>
> *Query:*
>
> /autocomplete?qt=/lucid&req_type=auto_complete&spellcheck.maxCollations=10&q="12342343243242ga"&spellcheck.count=10
>
> *Result:*
>
> 
> 
> 0
> 1
> 
> 
> 
> 
> 1
> 15
> 17
> 
> galaxy
> 
> 
> "12342343243242galaxy"
> 
> 
> 
>
>
> *My field configuration is as below:*
>  positionIncrementGap="100">
> 
> 
> 
>  enablePositionIncrements="true"
> ignoreCase="true" words="stopwords_autosuggest.txt"/>
> 
> 
>
> *SolrConfig.xml*
>
>  name="autocomplete">
> 
> autocomplete
>  name="classname">org.apache.solr.spelling.suggest.Suggester
>  name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup
> autocomplete_word
> autocomplete
> true
> .005
>
> 
> 
>  class="org.apache.solr.handler.component.SearchHandler"
> name="/autocomplete">
> 
> true
>  name="spellcheck.dictionary">autocomplete
> true
> 10
> false
> 
> 
> autocomplete
> 
> 
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-Autosuggest-Strange-issue-with-leading-numbers-in-query-tp4116751.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Indexing question on individual field update

2014-02-11 Thread Erick Erickson

Update and add are basically the same thing if there's an existing document.
There will be some performance consequence since you're getting the stored
fields on the server as opposed to getting the full input from the external
source
and handing it to Solr. However, I know of at least one situation where the
atomic update rate is sky-high and it works, so I wouldn't worry about it
unless and
until I saw a problem.

Best,
Erick


On Tue, Feb 11, 2014 at 3:03 PM, Shawn Heisey  wrote:

> On 2/11/2014 2:37 PM, shamik wrote:
>
>> Eric,
>>
>>Thanks for your reply. I should have given a better context. I'm
>> currently
>> running an incremental crawl daily on this particular source and indexing
>> the documents. Incremental crawl looks for any change since last crawl
>> date
>> based on the document publish date. But, there's no way for me to know if
>> a
>> document has been deleted. To ensure that, I ran a full crawl on a
>> weekend,
>> which basically re-index the entire content. After the full index is
>> over, I
>> call a purge script, which deletes any content which is more than 24 hour
>> old, based on the indextimestamp field.
>>
>> The issue with atomic update is that it doesn't alter the indextimstamp
>> field. So even if I run a full crawl with atomic updates, the timestamp
>> will
>> stick to its old value. Unfortunately, I can't rely on another date field
>> coming from the source as they are not consistent. That translates to the
>> fact that I can't remove stale content.
>>
>
> One possibility is this: When you send the atomic update to Solr, include
> a new value for the indextimestamp field.
>
> Another option: You can write a custom update processor plugin for Solr.
>  When the custom code is used, it will be executed on each incoming
> document.  Depending on what it finds in the update request, it can make
> appropriate changes, like updating indextimestamp.  You can do pretty much
> anything.
>
> http://wiki.apache.org/solr/UpdateRequestProcessor
>
> Writing an update processor in Java typically gives the best results in
> terms of flexibility and performance, but there is also a way to use other
> programming languages:
>
> http://wiki.apache.org/solr/ScriptUpdateProcessor
>
> Thanks,
> Shawn
>
>

Re: Solr performance on a very huge data set

2014-02-11 Thread Erick Erickson

Can't answer that, there are just too many variables. Here's a helpful
resource:
http://searchhub.org/dev/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Best,
Erick


On Tue, Feb 11, 2014 at 5:23 PM, neerajp  wrote:

> Hello Dear,
> I have 1000 GB of data that I want to index.
> Assuming I have enough space for storing the indexes in a single machine.
> *I would like to get an idea about Solr performance for searching an item
> from a huge data set.
> Do I need to use shards for improving the Solr search efficiency or it is
> OK
> to search without sharding ?*
>
> I will use SolrCloud for high availability and fault tolerance with the
> help
> of zoo-keeper.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-performance-on-a-very-huge-data-set-tp4116792.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Need feedback: Browsing and searching solr-user list emails

2014-02-11 Thread Alexandre Rafalovitch

Hi Durgam,

You are asking a hard question. Yes, the idea looks interesting as an
experiment. Possibly even useful in some ways. And I love the fact
that you are eating your own dogfood (running Solr). And the interface
looks nice (I guess this is your hosted Nimeyo offering underneath).

Yet, I am having troubles seeing it stick around long term. Here are my reasons:
*) This oferring feels like an inverse of StackExchange. SE is a
primary source of data and they actually get most of the search
traffic from Google. This proposal has the data coming from somewhere
else and is trying to add a search on top of it.
*) Furthermore, the SE voting/participation is heavily gamified and
they spend a lot of time and manpower to keeping the balance of that
gamification vs. abuse. I think it is a lot harder to provide
incentives to vote in your approach
*) There are other dogfood-eating search websites.
http://search-lucene.com/ is one of them.
*) There are also other mailing-list navigational websites with
gateway ability to post message in. They suck, both in interface and
in monetisation around the interface. In fact, they feel like the SPAM
farms similar to those republishing Wikipedia. I am not saying this is
relevant to your effort directly, but it is an issue related to
discovery of good search website in the sea of bad ones. search-lucene
for example is discoverable because it is one of the search engines on
the Apache website. Even then, it took me (at least) very long time to
discover it.
*) In general, discoverability is a b*tch (try to multiterm this,
Solr! :-) as you need a very significant traction for people to use
your site before it becomes useful to more people. A bit of a
catch-22. Again, SE did it by having a large audience on StackOverflow
and then branching off into topics that people on SO were also
interested in. And even that was an issue (see area51 for how they do
it). You have people (who read mailing list), but are they the people
who need to search the archives? I think the mailing list is a more of
a 'flow' interface to most of the people.
*) You have Google Analytics - did you get much traction yet? I
suspect no from the lack of replies on the mailing list.

I would step back and evaluate:
*) Who specifically is a target audience? I, for example, do star some
posts on the mailing list because they are just so good that I will
want to refer to them later. But, even then, I would have no incentive
right now to do it in public. Nor would I do 3-4 steps necessary to go
from email I like to some alternative interface to find the same email
again just to vote for it. And how do I find my voted emails later?
Requiring an account (to track) is even harder to swallow.
*) Again, who specifically is a target audience? Is it beginners?
Intermediates? Advanced? What are the pain point of those different
group you are trying to solve.
*) What can you offer to the first user before the voting actually
works (bootstrap phase). Pure search? Others do that already.
*) How would people find your service (SEO, etc).
*) Why are you doing it. It may not be a lot of effort to set it up,
but to actually grow any crowd-source resource is a significant task.
What does this build towards that will make it sustainable for you.
And, I really hope it is not page ads.
*) From Nimeyo's home page, you are targeting enterprises; are you
sure the offering maps to the public resource with dynamic transient
audience the same way.

Now, if you do want to help Solr community, that would be great. I am
trying to do that in my own way and really welcome anybody try to
assist beyond their own needs. Grow the community, and so on.

Here is an example of how I thought of the above issues myself:
*) I just released the full list of UpdateRequestProcessor Factories (
http://www.solr-start.com/update-request-processor/4.6.1/ ).
*) This is information that anybody can discover for themselves, but
it takes a lot searching and clicking and getting lost. I have
discovered that problem on my own when writing my Solr book and it was
stuck with me as a problem to be solved. So, I solved it (in a very
basic way for this version) and I have more similar things on the way.
*) My target audience, just as with my book, are people trying to
skill up from the beginners to the intermediates. My goal is to reduce
the barrier of entry to the more advanced Solr knowledge.
*) My SEO (we'll see if it works) is to provide information that does
not exist anywhere else in one place and to be discoverable when
people search for the particular names of URP.
*) I also have an incentive to keep it going (version 4.7, 4.8, other
resources) because I want people to be on my mailing list for when I
do the next REALLY exciting Solr project (Github-based interactive
Solr training would be a strong hint). So, these resources are my
bootstrapping strategy as well.

Now, there is plenty of other things that can be done to assist Solr
community. Some of them would ali

Re: Join Scoring

2014-02-11 Thread David Smiley (@MITRE.org)

Hi Anand.

Solr's JOIN query, {!join}, constant-scores.  It's simpler and faster and
more memory efficient (particularly the worse-case memory use) to implement
the JOIN query without scoring, so that's why.  Of course, you might want it
to score and pay whatever penalty is involved.  For that you'll need to
write a Solr "QueryParser" that might use Lucene's "join" module which has
scoring variants.  I've taken this approach before.  You asked a specific
question about the purpose of JoinScorer when it doesn't actually score. 
Lucene's "Query" produces a "Weight" which in turn produces a "Scorer" that
is a DocIdSetIterator plus it returns a score.  So Queries have to have a
Scorer to match any document even if the score is always 1.

Solr does indeed have a lot of caching; that may be in play here when
comparing against a quick attempt at using Lucene directly.  In particular,
the matching documents are likely to end up in Solr's DocumentCache. 
Returning stored fields that come back in search results are one of the more
expensive things Lucene/Solr does.

I also think you noted that the fields on documents from the "from" side of
the query are not available to be returned in search results, just the "to"
side.  Yup; that's true.  To remedy this, you might write a Solr
SearchComponent that adds fields from the "from" side.  That could be tricky
to do; it would probably need to re-run the from-side query but filtered to
the matching top-N documents being returned.

~ David

anand chandak wrote
> Resending, if somebody can please respond.
> 
> 
> Thanks,
> 
> Anand
> 
> 
> On 2/5/2014 6:26 PM, anand chandak wrote:
> Hi,
> 
> Having a question on join score, why doesn't the solr join query return 
> the scores. Looking at the code, I see there's JoinScorer defined in 
> the  JoinQParserPlugin class ? If its not used for scoring ? where is it 
> actually used.
> 
> Also, to evaluate the performance of solr join plugin vs lucene 
> joinutil, I filed same join query against same data-set and same schema 
> and in the results, I am always seeing the Qtime for Solr much lower 
> then lucenes. What is the reason behind this ?  Solr doesn't return 
> scores could that cause so much difference ?
> 
> My guess is solr has very sophisticated caching mechanism and that might 
> be coming in play, is that true ? or there's difference in the way JOIN 
> happens in the 2 approach.
> 
> If I understand correctly both the implementation are using 2 pass 
> approach - first all the terms from fromField and then returns all 
> documents that have matching terms in a toField
> 
> If somebody can throw some light, would highly appreciate.
> 
> Thanks,
> Anand

-
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Join-Scoring-tp4115539p4116818.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Join Scoring

2014-02-11 Thread anand chandak


Thanks David, really helpful response.


You mentioned that if we have to add scoring support in solr then a 
possible approach would be to add a custom QueryParser, which might be 
taking Lucene's JOIN module.



Curious, if it is possible instead to enhance existing solr's 
JoinQParserPlugin and add the the scoring support in the same class ? Do 
you think its feasible and recommended ? If yes, what would it take - in 
terms of code changes, any pointers ?


Thanks,

Anand


On 2/12/2014 10:31 AM, David Smiley (@MITRE.org) wrote:

Hi Anand.

Solr's JOIN query, {!join}, constant-scores.  It's simpler and faster and
more memory efficient (particularly the worse-case memory use) to implement
the JOIN query without scoring, so that's why.  Of course, you might want it
to score and pay whatever penalty is involved.  For that you'll need to
write a Solr "QueryParser" that might use Lucene's "join" module which has
scoring variants.  I've taken this approach before.  You asked a specific
question about the purpose of JoinScorer when it doesn't actually score.
Lucene's "Query" produces a "Weight" which in turn produces a "Scorer" that
is a DocIdSetIterator plus it returns a score.  So Queries have to have a
Scorer to match any document even if the score is always 1.

Solr does indeed have a lot of caching; that may be in play here when
comparing against a quick attempt at using Lucene directly.  In particular,
the matching documents are likely to end up in Solr's DocumentCache.
Returning stored fields that come back in search results are one of the more
expensive things Lucene/Solr does.

I also think you noted that the fields on documents from the "from" side of
the query are not available to be returned in search results, just the "to"
side.  Yup; that's true.  To remedy this, you might write a Solr
SearchComponent that adds fields from the "from" side.  That could be tricky
to do; it would probably need to re-run the from-side query but filtered to
the matching top-N documents being returned.

~ David


anand chandak wrote

Resending, if somebody can please respond.


Thanks,

Anand


On 2/5/2014 6:26 PM, anand chandak wrote:
Hi,

Having a question on join score, why doesn't the solr join query return
the scores. Looking at the code, I see there's JoinScorer defined in
the  JoinQParserPlugin class ? If its not used for scoring ? where is it
actually used.

Also, to evaluate the performance of solr join plugin vs lucene
joinutil, I filed same join query against same data-set and same schema
and in the results, I am always seeing the Qtime for Solr much lower
then lucenes. What is the reason behind this ?  Solr doesn't return
scores could that cause so much difference ?

My guess is solr has very sophisticated caching mechanism and that might
be coming in play, is that true ? or there's difference in the way JOIN
happens in the 2 approach.

If I understand correctly both the implementation are using 2 pass
approach - first all the terms from fromField and then returns all
documents that have matching terms in a toField

If somebody can throw some light, would highly appreciate.

Thanks,
Anand




-
  Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Join-Scoring-tp4115539p4116818.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Spatial Score by overlap area

2014-02-11 Thread Smiley, David W.

Hi,
BBoxStrategy is still only in “trunk” (not the 4x branch).  And
furthermore… the Solr portion, a FieldType, is over in
Spatial-Solr-Sandbox —
https://github.com/ryantxu/spatial-solr-sandbox/blob/master/LSE/src/main/ja
va/org/apache/solr/spatial/pending/BBoxFieldType.java  It should be quite
easy to port to 4x and put independently into a JAR file plug-in to Solr 4.

It’s lacking better tests, and until your question I haven’t seen interest
from users.  Ryan McKinley ported it from GeoServer.

~ David

On 2/10/14, 12:53 AM, "geoport"  wrote:

>Hi,
>i am using solr 4.6 and i´ve indexed bounding boxes. Now, i want to test
>the
>"area overlap sorting" link
>deep-dive> 
>(slide 23), have some of you an example for me?Thanks for helping me.
>
>
>
>
>
>
>--
>View this message in context:
>http://lucene.472066.n3.nabble.com/Spatial-Score-by-overlap-area-tp4116439
>.html
>Sent from the Solr - User mailing list archive at Nabble.com.

Unable to index mysql table

2014-02-11 Thread Tarun Sharma

Hi
I downloaded solr and without any changes in directory structure i just
followed the solr wiki and tried to import mysql table but unable to do...
Actualy Im using the directory as is in example folder but copied the
contrib jar files and  here and there where required..

Please help in indexing my mysql table...

NOTE: Im using remote linux server by doing ssh and am able to start the
solr server.

---
Regards
*Tarun Sharma*

Re: Unable to index mysql table

2014-02-11 Thread Alexandre Rafalovitch

What's "unable to do" actually translates to? Are you having troubles
writing a particular config file? Are you getting an error message?
Are you getting only some of the data in?

Tell us exactly where you are stuck. Better, google first for exactly
what you are stuck with, maybe it's already been answered.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Wed, Feb 12, 2014 at 12:52 PM, Tarun Sharma
 wrote:
> Hi
> I downloaded solr and without any changes in directory structure i just
> followed the solr wiki and tried to import mysql table but unable to do...
> Actualy Im using the directory as is in example folder but copied the
> contrib jar files and  here and there where required..
>
> Please help in indexing my mysql table...
>
> NOTE: Im using remote linux server by doing ssh and am able to start the
> solr server.
>
> ---
> Regards
> *Tarun Sharma*

Re: Indexing question on individual field update

2014-02-11 Thread shamik

Thanks Eric and Shawn, appreciate your help.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-question-on-individual-field-update-tp4116605p4116831.html
Sent from the Solr - User mailing list archive at Nabble.com.

54 matches

Mail list logo