RE: Question on multi-threaded faceting

2014-08-02 Thread Toke Eskildsen
Vamsee Yarlagadda [vam...@cloudera.com] Wrote:
> I filed https://issues.apache.org/jira/browse/SOLR-6314 to track this issue
> going forward.
> Any ideas around this problem?

Apparently the distributed faceting handling collapsed the duplicate fields, 
which singular did not. I guess your test case can be boiled down to asking for 
field f1_ws two times. This is legitimate with Solr, although your current 
request makes little sense logically as it is just direct duplication. Maybe 
you could request the same field with different sort orders and check whether 
the results are the same for distributed/singular?

- Toke Eskildsen


Re: SolrCloud Scale Struggle

2014-08-02 Thread Bill Bell
Seems way overkill. Are you using /get at all ? If you need the docs avail 
right away - why ? How about after 30 seconds ? How many docs do you get added 
per second during peak ? Even Google has a delay when you do Adwords. 

One idea is yo have an empty core that you insert into and then shard into the 
queries. So one fire would be called newdocs and then you would add this core 
into your query. There are a couple issues with this with scoring but it works 
nicely. I would not even use Solrcloud for that core.

Try to reduce number of Java running. Reduce memory and use one java per 
machine. 

Then if you need faster avail if docs you really need to ask why. Why not 
later? If it got search or just showing the user the info ? If for showing 
maybe query a not indexes table for the few not yet indexed ?? Or just store in 
a db to show the user the info and index later?

Bill Bell
Sent from mobile


> On Aug 1, 2014, at 4:19 AM, "anand.mahajan"  wrote:
> 
> Hello all,
> 
> Struggling to get this going with SolrCloud - 
> 
> Requirement in brief :
> - Ingest about 4M Used Cars listings a day and track all unique cars for
> changes
> - 4M automated searches a day (during the ingestion phase to check if a doc
> exists in the index (based on values of 4-5 key fields) or it is a new one
> or an updated version)
> - Of the 4 M - About 3M Updates to existing docs (for every non-key value
> change)
> - About 1M inserts a day (I'm assuming these many new listings come in
> every day)
> - Daily Bulk CSV exports of inserts / updates in last 24 hours of various
> snapshots of the data to various clients
> 
> My current deployment : 
> i) I'm using Solr 4.8 and have set up a SolrCloud with 6 dedicated machines
> - 24 Core + 96 GB RAM each.
> ii)There are over 190M docs in the SolrCloud at the moment (for all
> replicas its consuming overall disk 2340GB which implies - each doc is at
> about 5-8kb in size.)
> iii) The docs are split into 36 Shards - and 3 replica per shard (in all
> 108 Solr Jetty processes split over 6 Servers leaving about 18 Jetty JVMs
> running on each host)
> iv) There are 60 fields per doc and all fields are stored at the moment  :( 
> (The backend is only Solr at the moment)
> v) The current shard/routing key is a combination of Car Year, Make and
> some other car level attributes that help classify the cars
> vi) We are mostly using the default Solr config as of now - no heavy caching
> as the search is pretty random in nature 
> vii) Autocommit is on - with maxDocs = 1
> 
> Current throughput & Issues :
> With the above mentioned deployment the daily throughout is only at about
> 1.5M on average (Inserts + Updates) - falling way short of what is required.
> Search is slow - Some queries take about 15 seconds to return - and since
> insert is dependent on at least one Search that degrades the write
> throughput too. (This is not a Solr issue - but the app demands it so)
> 
> Questions :
> 
> 1. Autocommit with maxDocs = 1 - is that a goof up and could that be slowing
> down indexing? Its a requirement that all docs are available as soon as
> indexed.
> 
> 2. Should I have been better served had I deployed a Single Jetty Solr
> instance per server with multiple cores running inside? The servers do start
> to swap out after a couple of days of Solr uptime - right now we reboot the
> entire cluster every 4 days.
> 
> 3. The routing key is not able to effectively balance the docs on available
> shards - There are a few shards with just about 2M docs - and others over
> 11M docs. Shall I split the larger shards? But I do not have more nodes /
> hardware to allocate to this deployment. In such case would splitting up the
> large shards give better read-write throughput? 
> 
> 4. To remain with the current hardware - would it help if I remove 1 replica
> each from a shard? But that would mean even when just 1 node goes down for a
> shard there would be only 1 live node left that would not serve the write
> requests.
> 
> 5. Also, is there a way to control where the Split Shard replicas would go?
> Is there a pattern / rule that Solr follows when it creates replicas for
> split shards?
> 
> 6. I read somewhere that creating a Core would cost the OS one thread and a
> file handle. Since a core repsents an index in its entirty would it not be
> allocated the configured number of write threads? (The dafault that is 8)
> 
> 7. The Zookeeper cluster is deployed on the same boxes as the Solr instance
> - Would separating the ZK cluster out help?
> 
> Sorry for the long thread _ I thought of asking these all at once rather
> than posting separate ones.
> 
> Thanks,
> Anand
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/SolrCloud-Scale-Struggle-tp4150592.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud Scale Struggle

2014-08-02 Thread Bill Bell
Auto correct not good

Corrected below 

Bill Bell
Sent from mobile


> On Aug 2, 2014, at 11:11 AM, Bill Bell  wrote:
> 
> Seems way overkill. Are you using /get at all ? If you need the docs avail 
> right away - why ? How about after 30 seconds ? How many docs do you get 
> added per second during peak ? Even Google has a delay when you do Adwords. 
> 
> One idea is to have an empty core that you insert into and then shard into 
> the queries. So one core would be called newdocs and then you would add this 
> core into your query. There are a couple issues with this with scoring but it 
> works nicely. I would not even use Solrcloud for that core.
> 
> Try to reduce number of Java instances running. Reduce memory and use one 
> java per machine. 
> 
> Then if you need faster avail of docs you really need to ask why. Why not 
> later? Do you need search or just showing the user the info ? If for showing 
> maybe query a indexed table for the few not yet indexed ?? Or just store in a 
> db to show the user the info and index later?
> 
> Bill Bell
> Sent from mobile
> 
> 
>> On Aug 1, 2014, at 4:19 AM, "anand.mahajan"  wrote:
>> 
>> Hello all,
>> 
>> Struggling to get this going with SolrCloud - 
>> 
>> Requirement in brief :
>> - Ingest about 4M Used Cars listings a day and track all unique cars for
>> changes
>> - 4M automated searches a day (during the ingestion phase to check if a doc
>> exists in the index (based on values of 4-5 key fields) or it is a new one
>> or an updated version)
>> - Of the 4 M - About 3M Updates to existing docs (for every non-key value
>> change)
>> - About 1M inserts a day (I'm assuming these many new listings come in
>> every day)
>> - Daily Bulk CSV exports of inserts / updates in last 24 hours of various
>> snapshots of the data to various clients
>> 
>> My current deployment : 
>> i) I'm using Solr 4.8 and have set up a SolrCloud with 6 dedicated machines
>> - 24 Core + 96 GB RAM each.
>> ii)There are over 190M docs in the SolrCloud at the moment (for all
>> replicas its consuming overall disk 2340GB which implies - each doc is at
>> about 5-8kb in size.)
>> iii) The docs are split into 36 Shards - and 3 replica per shard (in all
>> 108 Solr Jetty processes split over 6 Servers leaving about 18 Jetty JVMs
>> running on each host)
>> iv) There are 60 fields per doc and all fields are stored at the moment  :( 
>> (The backend is only Solr at the moment)
>> v) The current shard/routing key is a combination of Car Year, Make and
>> some other car level attributes that help classify the cars
>> vi) We are mostly using the default Solr config as of now - no heavy caching
>> as the search is pretty random in nature 
>> vii) Autocommit is on - with maxDocs = 1
>> 
>> Current throughput & Issues :
>> With the above mentioned deployment the daily throughout is only at about
>> 1.5M on average (Inserts + Updates) - falling way short of what is required.
>> Search is slow - Some queries take about 15 seconds to return - and since
>> insert is dependent on at least one Search that degrades the write
>> throughput too. (This is not a Solr issue - but the app demands it so)
>> 
>> Questions :
>> 
>> 1. Autocommit with maxDocs = 1 - is that a goof up and could that be slowing
>> down indexing? Its a requirement that all docs are available as soon as
>> indexed.
>> 
>> 2. Should I have been better served had I deployed a Single Jetty Solr
>> instance per server with multiple cores running inside? The servers do start
>> to swap out after a couple of days of Solr uptime - right now we reboot the
>> entire cluster every 4 days.
>> 
>> 3. The routing key is not able to effectively balance the docs on available
>> shards - There are a few shards with just about 2M docs - and others over
>> 11M docs. Shall I split the larger shards? But I do not have more nodes /
>> hardware to allocate to this deployment. In such case would splitting up the
>> large shards give better read-write throughput? 
>> 
>> 4. To remain with the current hardware - would it help if I remove 1 replica
>> each from a shard? But that would mean even when just 1 node goes down for a
>> shard there would be only 1 live node left that would not serve the write
>> requests.
>> 
>> 5. Also, is there a way to control where the Split Shard replicas would go?
>> Is there a pattern / rule that Solr follows when it creates replicas for
>> split shards?
>> 
>> 6. I read somewhere that creating a Core would cost the OS one thread and a
>> file handle. Since a core repsents an index in its entirty would it not be
>> allocated the configured number of write threads? (The dafault that is 8)
>> 
>> 7. The Zookeeper cluster is deployed on the same boxes as the Solr instance
>> - Would separating the ZK cluster out help?
>> 
>> Sorry for the long thread _ I thought of asking these all at once rather
>> than posting separate ones.
>> 
>> Thanks,
>> Anand
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.

Re: SolrCloud Scale Struggle

2014-08-02 Thread anand.mahajan
Thank you everyone for your responses. Increased the hard commit to 10mins
and autoSoftCommit to 10 secs. (I wont really need a real time get - tweaked
the app code to cache the doc and use the app side cached version instead of
fetching it from Solr) Will watch it for a day or two and clock the
throughput.

For this deployment the peak is throughout the day as more data keeps
streaming in - there are no direct users with search queries here (as of
now) - but every incoming doc is compared against the existing set of docs
in Solr - to check whether its a new one or an updated version of an
existing one and only then the doc is inserted/updated. Right now its adding
about 1100 docs a minute (~20 docs a second) [But thats because it has to
run a search before to determine whether its an insert/update]

Also, since there are already 18 JVMs per machine - How do I go about
merging these existing cores under just 1 JVM? Would it be that I'd need to
create 1 Solr instance with 18 cores inside and then migrate data from these
separate JVMs into the new instance?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-Scale-Struggle-tp4150592p4150810.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud Scale Struggle

2014-08-02 Thread anand.mahajan
Thanks Shawn. I'm using 2 level composite id routing right now. These are all
Used Cars listings and all search queries always have car year and make in
the search criteria - hence that made sense to have Year+Make as level 1 in
the composite id. Beyond that the second level composite id is based on
about 8 car attributes and that means all listings for a similar type of car
and listings of any car are grouped together and co-located in the
SlorCloud. Even with this there is still an imbalance in the cluster - as
certain car makes are popular and there are more listings for such cars that
go the same shard. Will splitting these up with the existing set of hardware
help at all?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-Scale-Struggle-tp4150592p4150811.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Query on Facet

2014-08-02 Thread Umesh Prasad
You can use pivot faceting.

https://wiki.apache.org/solr/SimpleFacetParameters#Pivot_.28ie_Decision_Tree.29_Faceting

There is no index time work required and you can nest the facets at search
time as for your need.

PS : It won't work with SolrCloud / Sharded Index ..  SOLR-2894 is in
progress if you need distributed faceting.

Example ;

&facet=true&facet.pivot=binding_string,language_string

binding_string,language_string: [

   - {
  - field: "binding_string",
  - value: "Paperback",
  - count: 882,
  - pivot: []
  },
   - {
  - field: "binding_string",
  - value: "Hardcover",
  - count: 169,
  - pivot: []
  },
   - {
  - field: "binding_string",
  - value: "P",
  - count: 454,
  - pivot: [
 - {
- field: "language_string",
- value: "English",
- count: 198
},
 - {
- field: "language_string",
- value: "Spanish",
- count: 44
},
 - {
- field: "language_string",
- value: "German",
- count: 27



On 31 July 2014 11:25, Alexandre Rafalovitch  wrote:

> Now it sounds like maybe you have nested facets as opposed to just
> different ones. See if one of these fits your use case better:
> http://wiki.apache.org/solr/HierarchicalFaceting
>
> Regards,
>Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>
>
> On Thu, Jul 31, 2014 at 12:22 PM, Smitha Rajiv 
> wrote:
> > Hi All,
> >
> > We have tried both exclude option as well as facet query. Both approach
> are
> > not giving us the desired results.
> >
> > I will explain a little further. I have first level facets - Paperback
> and
> > Ebook, and second level facets include a list of languages like English,
> > French etc..
> >
> > When user selects 'Paperback', then currently i am getting all the
> > languages of paperback. But "Ebook" is not getting in result. But i can
> > resolve this by using exclude for the first level facet.
> >
> >  Now I am facing issue for the second level facet. When the user selects
> > 'Paperback' and 'English', then the query returns only 'Paperback' in
> first
> > level and 'English'  in second level. But I need other languages (which
> > satisfies paperback) also, so that I can show them in second level.
> >
> > We tried with facet.query as well as facet.field options. Please find the
> > query below.
> >
> >
> >
> http://localhost:8080/solr/collection1/select?q=software%20testing&fq=language%3A(%22English%22)&fq=Binding%3A(%22paperback%22)&facet=true&facet.mincount=1
> > &facet.field=Language&facet.field=latestArrivals&facet.f
> > ield=Binding&wt=json&indent=true&defType=edismax&json.nl=
> > map&facet.field=language&facet.field=binding.
> >
> >
> >
> http://localhost:8080/solr/collection1/select?q=software%20testing&fq=language%3A(%22English%22)&fq=Binding%3A(%22paperback%22)&facet=true&facet.mincount=1
> > &facet.field=Language&facet.field=latestArrivals&facet.f
> > ield=Binding&wt=json&indent=true&defType=edismax&json.nl=
> > map&facet.query=language&facet.query=binding.
> >
> >
> > Please provide your thoughts.
> >
> >
> > Thanks & Regards,
> >
> > Smitha
> >
> >
> >
> >
> >
> > On Wed, Jul 30, 2014 at 8:18 PM, Sujit Pal 
> wrote:
> >
> >> Hi Smitha,
> >>
> >> Have you looked at Facet queries? It allows you to attach Solr queries
> to
> >> facets. The problem with this is that you will need to know all possible
> >> combinations of language and binding (or make an initial query to find
> this
> >> information).
> >>
> >>
> >>
> https://wiki.apache.org/solr/SimpleFacetParameters#facet.query_:_Arbitrary_Query_Faceting
> >>
> >> Another alternative could be to bake in language+binding pairs into a
> field
> >> in your index and facet on that.
> >>
> >> -sujit
> >>
> >>
> >>
> >> On Wed, Jul 30, 2014 at 7:01 AM, vamshi kiran <
> mothevamshiki...@gmail.com>
> >> wrote:
> >>
> >> > Hi Alex,
> >> >
> >> > As you said If we exclude language facet field ,it will get all the
> >> > language facets with count right ?
> >> > It Will not filter by binding facet field of type 'paperback'  , how
> can
> >> we
> >> > do this ?
> >> >
> >> > Thanks & Regards,
> >> > Vamshi.
> >> > On Jul 30, 2014 4:11 PM, "Alexandre Rafalovitch" 
> >> > wrote:
> >> >
> >> > > I am not sure I fully understood your question, but I would start by
> >> > > looking at Tagging and Excluding first:
> >> > >
> >> > >
> >> >
> >>
> https://wiki.apache.org/solr/SimpleFacetParameters#Tagging_and_excluding_Filters
> >> > >
> >> > > Regards,
> >> > >Alex.
> >> > > Personal: http://www.outerthoughts.com/ and @arafalov
> >> > > Solr resources and newsletter: http://www.solr-start.com/ and
> >> @solrstart
> >> > > Solr popularizers community:
> >> https://www.linkedin.com/groups?gid=6713853
> >> > >
> >>

Re: Solr gives the same fieldnorm for two different-size fields

2014-08-02 Thread Umesh Prasad
What you really need is a covering type  match. I feel your use case fits
into this type

Score (Exact match in order) >   Score ( Exact match without order ) >
Score (Non Exact Match)

Example  Query : a b c

Example docs :
  d1 :  a b c
  d2 :  a c b
  d3 :  c a b
  d4 : a b c d
  d5 : a b c d e

Use case 1 : Only exact match is a match. (So only d1 is a match)
Use case 2 : Only in order are matches. So d2, d3 aren't matches. Scores
are d1 > d4 > d5
Use case 3 : Only in order are matches. And only one extra term is allowed.
So d2, d3, d5  aren't matches. Scores are d1 > d4
Use case 4 : All are matches and d1 > d2 > d3 > d4 > d5

All of these use cases can be satisfied by using SpanQueries, which tracks
the positions at which terms matches. For covering match, you will need to
introduce add start and end sentinel terms during indexing.

There is an excellent post by Mark Miller about span queries
http://searchhub.org/2009/07/18/the-spanquery/
 Solr's SurroundQuery Parser allows you to create SpanQueries
http://wiki.apache.org/solr/SurroundQueryParser
Or you can plug your own query parser into solr to do the same.

Some more links you can get here ..
http://search-lucene.com/?q=span+queries&fc_project=Lucene&fc_project=Solr



On 1 August 2014 00:24, Erick Erickson  wrote:

> You can consider, say, a copyField directive and copy the field into a
> string type (or perhaps keyworTokenizer followed by lowerCaseFilter) and
> then match or boost on an exact match rather than trying to make scoring
> fill this role.
>
> In any case, I'm thinking of normalizing the sensitive fields and indexing
> them as a single token (i.e. the string type or keywordtokenizer) to
> disambiguate these cases.
>
> Because otherwise I fear you'll get one situation to work, then fail on the
> next case. In your example, you're trying to use length normalization to
> influence scoring to get the doc with the shorter field to sort above the
> doc with the longer field. But what are you going to do when your target is
> "university of california berkley research"? Rely on matching all the
> terms? And so on...
>
> Best,
> Erick
>
>
> On Thu, Jul 31, 2014 at 10:26 AM, gorjida  wrote:
>
> > Thanks so much for your reply... In my case, it really matters because I
> am
> > going to find the correct institution match for an affiliation string...
> > For
> > example, if an author belongs to the "university of Toronto", his/her
> > affiliation should be normalized against the solr... In this case,
> > "University of California Berkley Research" is a different place to
> > "university of california berkeley"... I see top-matches are tied in the
> > score for this specific example... I can break the tie using other
> > techniques... However, I am keen to see if this is a common problem in
> > solr?
> >
> > Regards,
> >
> > Ali
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Solr-gives-the-same-fieldnorm-for-two-different-size-fields-tp4150418p4150430.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>



-- 
---
Thanks & Regards
Umesh Prasad


Re: Searching words with spaces for word without spaces in solr

2014-08-02 Thread Umesh Prasad
 I would suggest  breaking the problem in smaller parts
1.  Identify variations(say compound words) offline (where you can combine
multiple sources to ensure much better quality).
2. Expand the user query during search time using your sources. So query
will become
icecream OR  (ice cream)   (with q.op=AND)
   Parse the query using LuceneQuery parser. If you are using
dismax/edismax then I would suggest plugging a custom query parser which
combines queries from LuceneQueryParser and dismaxQuery. (dismax/edsimax
doesn't support full lucene query syntax)





On 31 July 2014 22:39, sunshine glass  wrote:

> *Point 1:*
> On Thu, Jul 31, 2014 at 9:32 PM, Dyer, James  >
>  wrote:
>
> > If a user is searching on "ice cream" but your index has "icecream", you
> > can treat this like a spelling error.  WordBreakSolrSpellChecker would
> > identify the fact that  while "ice cream" is not in your index,
> "icecream"
> > and then you can re-query for the corrected version without the space.
> >
>
> What if I have  1M records for "ice cream" & same number for "icecream".
> Then trick will not work here. What is desire in this case is that either I
> search for "ice cream" or "icecream", Solr should return 2M results.
>
> *Point 2:*
> On Thu, Jul 31, 2014 at 9:32 PM, Dyer, James  >
>  wrote:
> The problem with solving this with analyers, is that you can analyze
> "ice-cream" as either "ice cream" or "icecream" (split or catenate on
> hyphen).  You can even analyze "IceCream > Ice Cream" (catenate on case
> change).  But how is your analyzer going to know that "icecream" should
> index as two tokens: "ice" "cream" ?  You're asking analysis to do too much
> in this case. This is where spellcheck can bridge the gap.
>
> I don't want "icecream" to be indexed as "ice" or "cream". I agree that
> this is not feasible. What I am looking forward is to create shingles at
> query time as well. In more words, while querying "ice cream", Can't it
> search as "ice" or "cream" or "icecream".
> That is forming shingles at query time.
>
> There is a long list of such words in my inde. So, I does want to implement
> via synonym filter factory.
>
>
> On Thu, Jul 31, 2014 at 9:32 PM, Dyer, James  >
> wrote:
>
> > If a user is searching on "ice cream" but your index has "icecream", you
> > can treat this like a spelling error.  WordBreakSolrSpellChecker would
> > identify the fact that  while "ice cream" is not in your index,
> "icecream"
> > and then you can re-query for the corrected version without the space.
> >
> > The problem with solving this with analyers, is that you can analyze
> > "ice-cream" as either "ice cream" or "icecream" (split or catenate on
> > hyphen).  You can even analyze "IceCream > Ice Cream" (catenate on case
> > change).  But how is your analyzer going to know that "icecream" should
> > index as two tokens: "ice" "cream" ?  You're asking analysis to do too
> much
> > in this case.  This is where spellcheck can bridge the gap.
> >
> > Of course, if you have a discrete list of words you want split like this,
> > then you can do it with analysis using index-time synonyms.  In this
> case,
> > you need to provide it with the list.  See
> >
> https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
> > for more information.
> >
> > James Dyer
> > Ingram Content Group
> > (615) 213-4311
> >
> >
> > -Original Message-
> > From: sunshine glass [mailto:sunshineglassof2...@gmail.com]
> > Sent: Thursday, July 31, 2014 10:32 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Searching words with spaces for word without spaces in solr
> >
> > I am not clear with this. This link is related to spell check. Can you
> > elaborate it more ?
> >
> >
> > On Wed, Jul 30, 2014 at 9:17 PM, Dyer, James <
> james.d...@ingramcontent.com
> > >
> > wrote:
> >
> > > In addition to the analyzer configuration you're using, you might want
> to
> > > also use WordBreakSolrSpellChecker to catch possible matches that can't
> > > easily be solved through analysis.  For more information, see the
> section
> > > for it at
> > https://cwiki.apache.org/confluence/display/solr/Spell+Checking
> > >
> > > James Dyer
> > > Ingram Content Group
> > > (615) 213-4311
> > >
> > > -Original Message-
> > > From: sunshine glass [mailto:sunshineglassof2...@gmail.com]
> > > Sent: Wednesday, July 30, 2014 9:38 AM
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Searching words with spaces for word without spaces in
> solr
> > >
> > > This is the new configuration:
> > >
> > >  > > > positionIncrementGap="100">
> > > >   
> > > > 
> > > > 
> > > >  > > > outputUnigrams="true" tokenSeparator=""/>
> > > >  > > > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > > > 
> > > >  > > > language="English" protected="protwords.txt"/>
> > > >> > > synonyms="stemmed_

Re: Bloom filter

2014-08-02 Thread Umesh Prasad
+1 to Guava's BloomFilter implementation.

You can actually hook into UpdateProcessor chain and have the logic of
updating bloom filter / checking there.

We had a somewhat similar use case.  We were using DIH and it was possible
that same solr input document (meaning same content) will be coming lots of
times and it was leading to a lot of unnecessary updates in index. I
introduced a DuplicateDetector using update processor chain which kept a
map of Unique ID --> solr doc hash code and will drop the document if it
was a duplicate.

There is a nice video of other usage of Update chain

https://www.youtube.com/watch?v=qoq2QEPHefo






On 30 July 2014 23:05, Shalin Shekhar Mangar  wrote:

> You're right. I misunderstood. I thought that you wanted to optimize the
> "finding by id" path which is typically done for comparing versions during
> inserts in Solr.
>
> Yes, it won't help with the case where the ID does not exist.
>
>
> On Wed, Jul 30, 2014 at 6:14 PM, Per Steffensen 
> wrote:
>
> > Hi
> >
> > I am not sure exactly what LUCENE-5675 does, but reading the description
> > it seems to me that it would help finding out that there is no document
> > (having an id-field) where version-field is less than . As
> > far as I can see this will not help finding out if a document with
> > id= exists. We want to ask "does a document with id 
> > exist", without knowing the value of its version-field (if it actually
> > exists). You do not know if it ever existed, either.
> >
> > Please elaborate. Thanks!
> >
> > Regarding " The only other choice today is bloom filters, which use up
> > huge amounts of memory", I guess a bloom filter only takes as much space
> > (disk or memory) as you want it to. The more space you allow it to use
> the
> > more it gives you a false positive (saying "this doc might exist" in
> cases
> > where the doc actually does not exist). So the space you need to use for
> > the bloom filter depends on how frequently you can live with false
> > positives (where you have to actually look it up in the real index).
> >
> > Regards, Per Steffensen
> >
> >
> > On 30/07/14 10:05, Shalin Shekhar Mangar wrote:
> >
> >> Hi Per,
> >>
> >> There's LUCENE-5675 which has added a new postings format for IDs.
> Trying
> >> it out in Solr is in my todo list but maybe you can get to it before me.
> >>
> >> https://issues.apache.org/jira/browse/LUCENE-5675
> >>
> >>
> >> On Wed, Jul 30, 2014 at 12:57 PM, Per Steffensen 
> >> wrote:
> >>
> >>  On 30/07/14 08:55, jim ferenczi wrote:
> >>>
> >>>  Hi Per,
>  First of all the BloomFilter implementation in Lucene is not exactly a
>  bloom filter. It uses only one hash function and you cannot set the
>  false
>  positive ratio beforehand. ElasticSearch has its own bloom filter
>  implementation (using "guava like" BloomFilter), you should take a
> look
>  at
>  their implementation if you really need this feature.
> 
>   Yes, I am looking into what Lucene can do and how to use it through
> >>> Solr.
> >>> If it does not fit our needs I will enhance it - potentially with
> >>> inspiration from ES implementation. Thanks
> >>>
> >>>   What is your use-case ? If your index fits in RAM the bloom filter
> >>> won't
> >>>
>  help (and it may have a negative impact if you have a lot of
> segments).
>  In
>  fact the only use case where the bloom filter can help is when your
> term
>  dictionary does not fit in RAM which is rarely the case.
> 
>   We have so many documents that it will never fit in memory. We use
> >>> optimistic locking (our own implementation) to do correct concurrent
> >>> assembly of documents and to do duplicate control. This require a lot
> of
> >>> finding docs from their id, and most of the time the document is not
> >>> there,
> >>> but to be sure we need to check both transactionlog and the actual
> index
> >>> (UpdateLog). We would like to use Bloom Filter to quickly tell that a
> >>> document with a particular id is NOT present.
> >>>
> >>>  Regards,
>  Jim
> 
>   Regards, Per Steffensen
> >>>
> >>>
> >>
> >>
> >
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>



-- 
---
Thanks & Regards
Umesh Prasad


Re: Shuffle results a little

2014-08-02 Thread Umesh Prasad
What you are look for is a distribution of search results. One way would be
a two phase search
Phase 1 : Search (with rows =0, No scoring, no grouping)
1. Find the groups (unique combinations) using pivot facets  (won't work in
distributed env yet)
2. Transform those groups as group.queries ..

Phase 2 : Actual search ( with group.queries )

Pros : Readily available and well tested.
Cons :  It will give you exact same number of results for each group, which
may not be desired. Specifically with pagination. And of course, you are
making two searches.

2nd Approach would be to have this logic of distributing along different
dimensions as your own custom component. Solr's PostFilter/delegating
collector can be used for same. Basically TopDocCollector just maintains a
PriorityQueue for matching documents. You can plugin your own collector, so
that it sees all matching documents. Identifies which groups they belong to
(if groups/pivots have been already identified) , maintains the priority
queue for each of them and then finally merges them. Quite a bit of
customization if you ask me, but can be done and it would be most powerful.

PS : We use the 2nd approach.





On 30 July 2014 05:56, babenis  wrote:

> despite the fact that I upgrade to 4.9.0 - grouping doesn't seem to work on
> multi valued field, ie
>
> i was going to try to group by tags + brand (where tags is a multi-valued
> field) and spread results apart or select unique combinations only
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Shuffle-results-a-little-tp1891206p4149973.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
---
Thanks & Regards
Umesh Prasad


Re: To warm the whole cache of Solr other than the only autowarmcount

2014-08-02 Thread Umesh Prasad
@Eric : As you said, each use-case is different. We actually autowarm our
caches to 80% and we have a 99% hit ratio on filter cache. For query cache,
hit ratios are like 25% but given that cache hit saves us about 10X, we
strive to increase cache hit ratio.

@Yang : You can't do a direct copy of values. Values are related to
lucene's internal document id and they can change during an index update.
The change can happen because of document being deleted, segments being
merged or new segments being created. Solr's caches refer to global doc id
which are even more prone to change (because of index merges).



On 28 July 2014 21:32, Erick Erickson  wrote:

> bq: autowarmcount=1024...
>
> That's the point, this is quite a high number in my
> experience.
>
> I've rarely seen numbers above 128 show much of
> any improvement. I've seen a large number of
> installations use much smaller autowarm numbers,
> as in the 16-32 range and be quite content.
>
> I _really_ recommend you try to use much smaller
> numbers then _measure_ whether the first few
> queries after a commit show unacceptable
> response times before trying to make things
> "better". This really feels like premature
> optimization.
>
> Of course you know your problem space better than
> I do, it's just that I've spent too much of my
> professional life fixing the wrong "problem"; I've
> become something of a "measure first" curmudgeon.
>
> FWIW,
> Erick
>
>
> On Sun, Jul 27, 2014 at 10:48 PM, YouPeng Yang 
> wrote:
>
> > Hi Erick
> >
> > We do the DIH job from the DB and committed frequently.It takes a long
> time
> > to autowarm the filterCaches after commit or soft commit  happened when
> > setting the autowarmcount=1024,which I do think is small enough.
> > So It comes up an idea that whether it  could  directly pass the
> reference
> > of the caches   over to the new caches so that the autowarm processing
> will
> > take much fewer time .
> >
> >
> >
> > 2014-07-28 2:30 GMT+08:00 Erick Erickson :
> >
> > > Why do you think you _need_ to autowarm the entire cache? It
> > > is, after all, an LRU cache, the theory being that the most recent
> > > queries are most likely to be reused.
> > >
> > > Personally I'd run some tests on using small autowarm counts
> > > before getting at all mixed up in some complex scheme that
> > > may not be useful at all. Say an autowarm count of 16. Then
> > > measure using that, then say 32 then... Insure you have a real
> > > problem before worrying about a solution! ;)
> > >
> > > Best,
> > > Erick
> > >
> > >
> > > On Fri, Jul 25, 2014 at 6:45 AM, Shawn Heisey 
> wrote:
> > >
> > > > On 7/24/2014 8:45 PM, YouPeng Yang wrote:
> > > > > To Matt
> > > > >
> > > > >   Thank you,your opinion is very valuable ,So I have checked the
> > source
> > > > > codes about how the cache warming  up. It seems to just put items
> of
> > > the
> > > > > old caches into the new caches.
> > > > >   I will pull Mark Miller into this discussion.He is the one of the
> > > > > developer of the Solr whom  I had  contacted with.
> > > > >
> > > > >  To Mark Miller
> > > > >
> > > > >Would you please check out what we are discussing in the last
> two
> > > > > posts.I need your help.
> > > >
> > > > Matt is completely right.  Any commit can drastically change the
> Lucene
> > > > document id numbers.  It would be too expensive to determine which
> > > > numbers haven't changed.  That means Solr must throw away all cache
> > > > information on commit.
> > > >
> > > > Two of Solr's caches support autowarming.  Those caches use queries
> as
> > > > keys and results as values.  Autowarming works by re-executing the
> top
> > N
> > > > queries (keys) in the old cache to obtain fresh Lucene document id
> > > > numbers (values).  The cache code does take *keys* from the old cache
> > > > for the new cache, but not *values*.  I'm very sure about this, as I
> > > > wrote the current (and not terribly good) LFUCache.
> > > >
> > > > Thanks,
> > > > Shawn
> > > >
> > > >
> > >
> >
>



-- 
---
Thanks & Regards
Umesh Prasad


Re: Implementing custom analyzer for multi-language stemming

2014-08-02 Thread Umesh Prasad
Also, take a look at the Lucid revolution talk Typed Index
https://www.youtube.com/watch?v=X93DaRfi790

 *Published on 25 Nov 2013*

Presented by Christoph Goller, Chief Scientist, IntraFind Software AG

If you want to search in a multilingual environment with high-quality
language-specific word-normalization, if you want to handle mixed-language
documents, if you want to add phonetic search for names if you need a
semantic search which distinguishes between a search for the color "brown"
and a person with the second name "brown", in all these cases you have to
deal with different types of terms. I will show why it makes much more
sense to attach types (prefixes) to Lucene terms instead of relying on
different fields or even different indexes for different kinds of terms.
Furthermore I will show how queries to such a typed index look and why e.g.
SpanQueries are needed to correctly treat compound words and phrases or
realize a reasonable phonetic search. The Analyzers and the QueryParser
described are available as plugins for Lucene, Solr, and elasticsearch.




On 31 July 2014 00:34, Sujit Pal  wrote:

> Hi Eugene,
>
> In a system we built couple of years ago, we had a corpus of English and
> French mixed (and Spanish on the way but that was implemented by client
> after we handed off). We had different fields for each language. So (title,
> body) for English docs was (title_en, body_en), for French (title_fr,
> body_fr) and for Spanish (title_es, body_es) - each of these were
> associated with a different Analyzer (that was associated with the field
> types in schema.xml, in case of Lucene you can use
> PerFieldAnalyzerWrapper). Our pipeline used Google translate to detect the
> language and write the contents into the appropriate field set for the
> language. Our analyzers were custom - but Lucene/Solr provides analyzer
> chains for many major languages. You can find a list here:
>
> https://wiki.apache.org/solr/LanguageAnalysis
>
> -sujit
>
>
>
> On Wed, Jul 30, 2014 at 10:52 AM, Chris Morley 
> wrote:
>
> > I know BasisTech.com has a plugin for elasticsearch that extends
> > stemming/lemmatization to work across 40 natural languages.
> > I'm not sure what they have for Solr, but I think something like that may
> > exist as well.
> >
> > Cheers,
> > -Chris.
> >
> > 
> >  From: "Eugene" 
> > Sent: Wednesday, July 30, 2014 1:48 PM
> > To: solr-user@lucene.apache.org
> > Subject: Implementing custom analyzer for multi-language stemming
> >
> > Hello, fellow Solr and Lucene users and developers!
> >
> > In our project we receive text from users in different languages. We
> > detect language automatically and use Google Translate APIs a lot (so
> > having arbitrary number of languages in our system doesn't concern us).
> > However we need to be able to search using stemming. Having nearly
> hundred
> > of fields (several fields for each language with language-specific
> > stemmers) listed in our search query is not an option. So we need a way
> to
> > have a single index which has stemmed tokens for different languages. I
> > have two questions:
> >
> > 1. Are there already (third-party) custom multi-language stemming
> > analyzers? (I doubt that no one else ran into this issue)
> >
> > 2. If I'm going to implement such analyzer myself, could you please
> > suggest a better way to 'pass' detected language value into such
> analyzer?
> > Detecting language in analyzer itself is not an option, because: a) we
> > already detect it in other place b) we do it based on combined values of
> > many fields ('name', 'topic', 'description', etc.), while current field
> > can
> > be to short for reliable detection c) sometimes we just want to specify
> > language explicitly. The obvious hack would be to prepend ISO 639-1 code
> > to
> > field value. But I'd like to believe that Solr allows for cleaner
> > solution.
> > I could think about either: a) custom query parameter (but I guess, it
> > will
> > require modifying request handlers, etc. which is highly undesirable) b)
> > getting value from other field (we obviously have 'language' field and we
> > do not have mixed-language records). If it is possible, could you please
> > describe the mechanism for doing this or point to relevant code examples?
> > Thank you very much and have a good day!
> >
> >
>



-- 
---
Thanks & Regards
Umesh Prasad


Re: Identify specific document insert error inside a solrj batch request

2014-08-02 Thread Umesh Prasad
Solr  schema over REST https://wiki.apache.org/solr/SchemaRESTAPI

https://cwiki.apache.org/confluence/display/solr/Schema+API

You can use that for getting required fields and validate at client side ..





On 31 July 2014 14:32, Liram Vardi  wrote:

> Hi Jack,
> Thank you for your reply.
> This is the Solr stack trace. As you can see, the missing field is
> "hourOfDay".
>
> Thanks,
> Liram
>
> 2014-07-30 14:27:54,934 ERROR [qtp-608368492-19] (SolrException.java:108)
> - org.apache.solr.common.SolrException:
> [doc=53b16126--0002-2b03-17ac4d4a07b6] missing required field: hourOfDay
> at
> org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:189)
> at
> org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:73)
> at
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:210)
> at
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
> at
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:556)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:692)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:435)
> at
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
> at
> org.apache.solr.update.processor.AbstractDefaultValueUpdateProcessorFactory$DefaultValueUpdateProcessor.processAdd(AbstractDefaultValueUpdateProcessorFactory.java:94)
> at
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
> at
> com.checkpoint.solr_plugins.MulticoreUpdateRequestProcessor.processAdd(MulticoreUpdateRequestProcessorFactory.java:152)
> at
> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
> at
> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
> at
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1904)
> at
> com.checkpoint.solr_plugins.MulticoreUpdateRequestProcessor.processAdd(MulticoreUpdateRequestProcessorFactory.java:248)
> at
> org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:86)
> at
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:143)
> at
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:123)
> at
> org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:220)
> at
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:108)
> at
> org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:185)
> at
> org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:111)
> at
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:150)
> at
> org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:96)
> at
> org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:55)
> at
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1904)
> at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:659)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:362)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1474)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:499)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
> at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(

Re: SolrCloud Scale Struggle

2014-08-02 Thread Shawn Heisey
On 8/2/2014 2:46 PM, anand.mahajan wrote:
> Also, since there are already 18 JVMs per machine - How do I go about
> merging these existing cores under just 1 JVM? Would it be that I'd need to
> create 1 Solr instance with 18 cores inside and then migrate data from these
> separate JVMs into the new instance?

Use the CoreAdmin API (or the ADDREPLICA action on the Collections API,
if the Solr version is new enough) to add replicas for all the shards to
one of the JVMs.  After the new replicas show green on the cloud graph
in the admin UI (which indicates that the index has been replicated),
unload the old cores (or use the DELETEREPLICA action) so they get
removed from the clusterstate, and once they're gone, stop/delete the
related JVMs.  You'll need some free disk space to complete these steps.

Thanks,
Shawn