Re: SolrCloud delete by query performance

2015-05-20 Thread Ryan Cutter
Shawn, thank you very much for that explanation. It helps a lot. Cheers, Ryan On Wed, May 20, 2015 at 5:07 PM, Shawn Heisey wrote: > On 5/20/2015 5:57 PM, Ryan Cutter wrote: > > GC is operating the way I think it should but I am lacking memory. I am > > just surprised because indexing is perf

Re: Reindex of document leaves old fields behind

2015-05-20 Thread Erick Erickson
Well, let's see the code. Standard updates should replace the previous docs, reindexing the same unique ID with fewer fields should show fewer fields. So something's weird here. Although do, just for yucks, issue a query on some of the unique ids in question, I'd be curious if you get more than on

Re: solr 5.x on glassfish/tomcat instead of jetty

2015-05-20 Thread TK Solr
Never mind. I found that thread. Sorry for the noise. On 5/20/15, 5:56 PM, TK Solr wrote: On 5/20/15, 8:21 AM, Shawn Heisey wrote: As of right now, there is still a .war file. Look in the server/webapps directory for the .war, server/lib/ext for logging jars, and server/resources for the loggi

Re: solr 5.x on glassfish/tomcat instead of jetty

2015-05-20 Thread TK Solr
On 5/20/15, 8:21 AM, Shawn Heisey wrote: As of right now, there is still a .war file. Look in the server/webapps directory for the .war, server/lib/ext for logging jars, and server/resources for the logging configuration. Consult your container's documentation to learn where to place these thin

Re: Reindex of document leaves old fields behind

2015-05-20 Thread tuxedomoon
The uniqueKey value is the same. The new documents contain fewer fields than the already indexed ones. Could this cause the updates to be treated as atomic? With the persisting fields treated as un-updated? Routing should be implicit since the collection was created using numShards. Many req

Re: SolrCloud delete by query performance

2015-05-20 Thread Shawn Heisey
On 5/20/2015 5:57 PM, Ryan Cutter wrote: > GC is operating the way I think it should but I am lacking memory. I am > just surprised because indexing is performing fine (documents going in) but > deletions are really bad (documents coming out). > > Is it possible these deletes are hitting many seg

Re: SolrCloud delete by query performance

2015-05-20 Thread Ryan Cutter
GC is operating the way I think it should but I am lacking memory. I am just surprised because indexing is performing fine (documents going in) but deletions are really bad (documents coming out). Is it possible these deletes are hitting many segments, each of which I assume must be re-built? An

Re: SolrCloud delete by query performance

2015-05-20 Thread Shawn Heisey
On 5/20/2015 5:41 PM, Ryan Cutter wrote: > I have a collection with 1 billion documents and I want to delete 500 of > them. The collection has a dozen shards and a couple replicas. Using Solr > 4.4. > > Sent the delete query via HTTP: > > http://hostname:8983/solr/my_collection/update?stream.bo

SolrCloud delete by query performance

2015-05-20 Thread Ryan Cutter
I have a collection with 1 billion documents and I want to delete 500 of them. The collection has a dozen shards and a couple replicas. Using Solr 4.4. Sent the delete query via HTTP: http://hostname:8983/solr/my_collection/update?stream.body= source:foo Took a couple minutes and several repli

SolrCloud Leader Election

2015-05-20 Thread Ryan Steele
My SolrCloud cluster isn't reassigning the collections leaders from downed cores--the downed cores are still listed as the leaders. The cluster has been in the state for a few hours and the logs continue to report "No registered leader was found after waiting for 4000ms." Is there a way to forc

Re: Reindex of document leaves old fields behind

2015-05-20 Thread Shawn Heisey
On 5/20/2015 4:43 PM, tuxedomoon wrote: > I'm reindexing Mongo docs into SolrCloud. The new docs have had a few fields > removed so upon reindexing those fields should be gone in Solr. They are > not. So the result is a new doc merged with an old doc rather than a > replacement which is what I n

Reindex of document leaves old fields behind

2015-05-20 Thread tuxedomoon
I'm reindexing Mongo docs into SolrCloud. The new docs have had a few fields removed so upon reindexing those fields should be gone in Solr. They are not. So the result is a new doc merged with an old doc rather than a replacement which is what I need. I do not know whether the issue is with my

Re: Edismax

2015-05-20 Thread Shawn Heisey
On 5/20/2015 3:35 PM, John Blythe wrote: > regarding the new question itself, i'd replied to this thread w more info > but had the system kick it back to me for some reason. maybe i replied too > much too soon? anyway, it ended up being a result of my query still being > in the primary query box in

Re: Edismax

2015-05-20 Thread Upayavira
A few things: Scores aren't confidence metrics, they are relative rankings, in relation to a single resultset, that's all. Secondly for edismax, boost does multiplicative boosting (whatever function you provide, the score is multiplied by that), whereas bf does additive boosting. Upayavira On W

Re: Edismax

2015-05-20 Thread John Blythe
Good call thank you On Wed, May 20, 2015 at 5:15 PM, Erick Erickson wrote: > John: > The spam filter is very aggressive. Try changing the type to "plain > text" rather than rich text or html... > Best, > Erick > On Wed, May 20, 2015 at 2:35 PM, John Blythe wrote: >> thanks guys. >> >> it doesn'

Re: Upgrading question

2015-05-20 Thread Erick Erickson
Yep. Solr/Lucene strives for one major revision backwards compatibility. So any 5x should be able to read any index produced with 4x, but no index produced with 3x. Best, Erick On Wed, May 20, 2015 at 2:44 PM, Craig Longman wrote: > > We've been using Solr a bit now for a year or so, 4.6 is the

Re: Edismax

2015-05-20 Thread Erick Erickson
John: The spam filter is very aggressive. Try changing the type to "plain text" rather than rich text or html... Best, Erick On Wed, May 20, 2015 at 2:35 PM, John Blythe wrote: > thanks guys. > > it doesn't depend on absolute scores, but it is leaning on the score as a > confident metric of sor

Upgrading question

2015-05-20 Thread Craig Longman
We've been using Solr a bit now for a year or so, 4.6 is the oldest version of Solr we've deployed. We're currently working through the process we'll use to upgrade to 5.1, an upgrade we need for the new facet.stats capabilities. Reading the Major Changes document, it indicates that there is n

Re: Edismax

2015-05-20 Thread John Blythe
thanks guys. it doesn't depend on absolute scores, but it is leaning on the score as a confident metric of sorts. we've found some good standard deviation info when plotting out the accuracy of the top result and the relative score with the analyzers currently in production and hope to strengthen

Re: Edismax

2015-05-20 Thread Walter Underwood
I was going to post the same advice. If your approach depends on absolute scores, you need to change your approach. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On May 20, 2015, at 2:09 PM, Shawn Heisey wrote: > On 5/20/2015 2:54 PM, John Blythe wro

Re: Edismax

2015-05-20 Thread Shawn Heisey
On 5/20/2015 2:54 PM, John Blythe wrote: > new question re edismax: when i turn it on (in solr admin) my score goes > wayy down. from 772 to 4.9. > > what in the edismax query parser would account for that huge nosedive? Scores are 100% relative, and the number only has meaning in the context

Re: Edismax

2015-05-20 Thread John Blythe
new question re edismax: when i turn it on (in solr admin) my score goes wayy down. from 772 to 4.9. what in the edismax query parser would account for that huge nosedive? -- *John Blythe* Product Manager & Lead Developer 251.605.3071 | j...@curvolabs.com www.curvolabs.com 58 Adams Ave Eva

Re: ConfigSets and SolrCloud

2015-05-20 Thread Erick Erickson
What is "it"? There isn't one except zkcli and variants ;). Things are all automatic once you get things _to_ Zookeeper, but pushing the config sets up is a manual process. The "usual" process is to have the configs in some VCS somewhere so they're safe, and do the usual checkout/edit/checkin and

Re: Edismax

2015-05-20 Thread John Blythe
cool, will check into it some more via testing -- *John Blythe* Product Manager & Lead Developer 251.605.3071 | j...@curvolabs.com www.curvolabs.com 58 Adams Ave Evansville, IN 47713 On Wed, May 20, 2015 at 3:22 PM, Walter Underwood wrote: > I believe that boost is a superset of the bq funct

Re: Edismax

2015-05-20 Thread Walter Underwood
I believe that boost is a superset of the bq functionality. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On May 20, 2015, at 1:16 PM, John Blythe wrote: > could i do that the same way as my mention of using bq? the docs aren't > very rich in their exa

Re: Error on grouping result set

2015-05-20 Thread Erick Erickson
Possibly you changed the field type sometime without completely blowing away your index and re-indexing from scratch? Based on: "unexpected docvalues type SORTED_SET for field 'vendor' (expected=SORTED)" Because you can't group on multi-valued fields, which is I think what's going on here. Eithe

Re: Problem using a function with a multivalued field

2015-05-20 Thread Erick Erickson
bq: Keep a copy of the value into a non-multi-valued field, using an update processor: This involves indexing a new field Why can't you do this? You can't re-index the data perhaps? It's by far the easiest solution Best, Erick On Wed, May 20, 2015 at 2:45 AM, Fernando Agüero wrote: > Hi ev

Re: Edismax

2015-05-20 Thread John Blythe
could i do that the same way as my mention of using bq? the docs aren't very rich in their example or explanation of boost= here: https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser thanks! -- *John Blythe* Product Manager & Lead Developer 251.605.3071 | j...@curvo

Re: Edismax

2015-05-20 Thread Walter Underwood
I highly recommend using boost= in edismax rather than bq=. The multiplicative boost is stable with a wide range of scores. bq is additive and has problems with high or low scores. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On May 20, 2015, at 1:04

Re: [solr 5.1] Looking for full text + collation search field

2015-05-20 Thread Ahmet Arslan
Hi Bjorn, solr.ICUCollationField is useful for *sorting*, and you cannot sort on tokenized fields. Your example looks like diacritics insensitive search. Please see : ASCIIFoldingFilterFactory Ahmet On Wednesday, May 20, 2015 2:53 PM, Björn Keil wrote: Hello, might anyone suggest a field

Edismax

2015-05-20 Thread John Blythe
Hi all, I've been fine tuning our current Solr implementation the last week or two to get more precise results. We are trying to get our implementation accurate enough to serve as a lightweight machine learning (obviously a misnomer) implementation of sorts. Actual user generated searching is far

Re: Solr Cloud: No live SolrServers available

2015-05-20 Thread Chetan Vora
Seems like the attachements get stripped off. Anyways, here is the 4.7 log on startup INFO - 2015-05-20 10:35:45.786; org.eclipse.jetty.server.Server; jetty-8.1.10.v20130312 INFO - 2015-05-20 10:35:45.804; org.eclipse.jetty.deploy.providers.ScanningAppProvider; Deployment monitor C:\apps\solr\so

Re: When is too many fields in "qf" is too many?

2015-05-20 Thread Doug Turnbull
Yeah a copyField into one could be a good space/time tradeoff. It can be more manageable to use an all field for both relevancy and performance, if you can handle the duplication of data. You could set tie=1.0, which effectively sums all the matches instead of picking the best match. You'll still

Re: When is too many fields in "qf" is too many?

2015-05-20 Thread Steven White
Hi Doug, Your blog write up on relevancy is very interesting, I didn't know this. Looks like I have to go back to my drawing board and figure out an alternative solution: somehow get those group-based-fields data into a single field using copyField. Thanks Steve On Wed, May 20, 2015 at 11:17 AM

Re: When is too many fields in "qf" is too many?

2015-05-20 Thread Steven White
Thanks for calling out maxBooleanClauses. The current default of 1024 has not caused me any issues (so far) in my testing. However, you probably saw Doug Tumbull's reply, it looks like my relevance will suffer. Steve On Wed, May 20, 2015 at 11:42 AM, Shawn Heisey wrote: > On 5/20/2015 9:24 AM

ConfigSets and SolrCloud

2015-05-20 Thread Jim . Musil
Hi, I need a little clarification on configSets in solr 5.x. According to this page: https://cwiki.apache.org/confluence/display/solr/Config+Sets I can create named configSets to be shared by other cores. If I create them using this method AND am operating in SolrCloud mode, will it automatica

Re: Need help with Nested docs situation

2015-05-20 Thread Mikhail Khludnev
data scale and request rate can judge between block, plain joins and field collapsing. On Thu, Apr 30, 2015 at 1:07 PM, roySolr wrote: > Hello, > > I have a situation and i'm a little bit stuck on the way how to fix it. > For example the following data structure: > > *Deal* > All Coca Cola 20% o

Re: solr 5.x on glassfish/tomcat instead of jetty

2015-05-20 Thread Toke Eskildsen
Shawn Heisey wrote: > I'm wondering ... if Jetty is good enough for the Google App Engine, why > isn't it good enough for your infrastructure standards? Replace Jetty vs. Glassfish with Linux vs. Windows, Eclipse vs. Idea, emacs vs. vi, Java vs. C#... There are many reasons for a corporation to

Re: Help to index nested document

2015-05-20 Thread Mikhail Khludnev
I'm absolutely sure that you need to group them externally in the indexer eg like a child VALUES entity in DataImportHandler. On Mon, May 11, 2015 at 9:52 PM, Vishal Swaroop wrote: > Need your valuable inputs... > > I am indexing data from database (one table) which is in this example > format :

Re: Block Join Query update documents, how to do it correctly?

2015-05-20 Thread Mikhail Khludnev
On Thu, May 14, 2015 at 12:01 AM, Tom Devel wrote: > I tried to repost the whole modified document (the parent and ALL of its > children as one file), and it seems to work on a small toy example, but of > course I cannot be sure for a larger instance with thousands of documents, > and I would lik

Re: scoreMode ToParentBlockJoinQuery

2015-05-20 Thread Mikhail Khludnev
Hello, Here is the patch https://issues.apache.org/jira/browse/SOLR-5882 On Tue, May 12, 2015 at 1:11 PM, StrW_dev wrote: > Hi > > Is it possible to configure the scoreMode of the Parent block join query > parser (ToParentBlockJoinQuery)? > It seems it's set to none, while i would require max

Re: Looking up arrays in a sub-entity

2015-05-20 Thread rumford
I was able to get what I wanted by processing the column in question as massaged text, so that it was a comma-delimited series of IDs, and then passing that to a subentity query that went something like: SELECT value FROM othertable WHERE id IN (${master.ids}). It's slow but I think it's getting t

Re: solr 5.x on glassfish/tomcat instead of jetty

2015-05-20 Thread Ravi Solr
Shawn I agree with you, but, some of the decisions in the corporate world are handed down through higher powers/pay grade, who do not always like to hear counter arguments. For example, this is the same reason why govt/federal restrict tech folks only use certified DBs/App Servers like Oracle,WSAD

Re: When is too many fields in "qf" is too many?

2015-05-20 Thread Steven White
> Also, is this 1500 fields that are always populated, or are there really a > larger number of different record types, each with a relatively small > number of fields populated in a particular document? Answer: This is a large number of different record types, each with a relatively small number

Re: Suggestion on field type

2015-05-20 Thread Vishal Swaroop
Thank you all... You all are experts... I will go with double as this seems to be more feasible. Regards On Tue, May 19, 2015 at 7:26 PM, Walter Underwood wrote: > A field type based on BigDecimal could be useful, but that would be a fair > amount more work. > > Double is usually sufficient fo

Re: When is too many fields in "qf" is too many?

2015-05-20 Thread Shawn Heisey
On 5/20/2015 9:24 AM, Steven White wrote: > I have already switched to using POST because I need to send a long list of > data in "qf". My question isn't about POST / GET, it's about Solr and > Lucene having to deal with such long list of fields. Here is the text of > my question reposted: > >>

[ANN] Relevant Search -- The Book on Search Relevance

2015-05-20 Thread Doug Turnbull
Hello fellow Solr users, We're writing a book on applied Lucene search relevance -- "Relevant Search" (http://manning.com/turnbull). We want to teach you to improve the quality of your Solr search results! We're trying to bridge the academic side of Information Retrieval from books like Intro. to

Re: When is too many fields in "qf" is too many?

2015-05-20 Thread Steven White
Thanks Shawn. I have already switched to using POST because I need to send a long list of data in "qf". My question isn't about POST / GET, it's about Solr and Lucene having to deal with such long list of fields. Here is the text of my question reposted: > Given the above, beside the fact that

Re: solr 5.x on glassfish/tomcat instead of jetty

2015-05-20 Thread Shawn Heisey
On 5/20/2015 9:07 AM, Ravi Solr wrote: > I have read that solr 5.x has moved away from deployable WAR architecture > to a runnable Java Application architecture. Our infrastructure/standards > folks are adamant about not running SOLR on Jetty (as we are about to > upgrade from 4.7.2 to 5.1), any id

Re: When is too many fields in "qf" is too many?

2015-05-20 Thread Doug Turnbull
Steven, I'd be concerned about your relevance with that many qf fields. Dismax takes a "winner takes all" point of view to search. Field scores can vary by an order of magnitude (or even two) despite the attempts of query normalization. You can read more here http://opensourceconnections.com/blog/

Re: When is too many fields in "qf" is too many?

2015-05-20 Thread Jack Krupansky
The uf parameter is used to specify which fields a user "may" query against - the "qf" parameter specifies the set of fields that an unfielded query term "must" be queried against. The user is free to specify fielded query terms, like "field1:term1 OR field2:term2". So, which use case are you reall

solr 5.x on glassfish/tomcat instead of jetty

2015-05-20 Thread Ravi Solr
I have read that solr 5.x has moved away from deployable WAR architecture to a runnable Java Application architecture. Our infrastructure/standards folks are adamant about not running SOLR on Jetty (as we are about to upgrade from 4.7.2 to 5.1), any ideas on how I can make it run on Glassfish or at

Re: When is too many fields in "qf" is too many?

2015-05-20 Thread Shawn Heisey
On 5/20/2015 6:27 AM, Steven White wrote: > My solution requires that users in group-A can only search against a set of > fields-A and users in group-B can only search against a set of fields-B, > etc. There can be several groups, as many as 100 even more. To meet this > need, I build my search b

Re: Solr Cloud: No live SolrServers available

2015-05-20 Thread Chetan Vora
Erick Thanks for your response. Logs don't seem to show any explicit errors (I have log level at INFO). I am attaching the logs from a 4.7 start and a 5.1 start here. Note that both logs seem to show the shards as "Down" initially but for 5.1, the state change to Active later on. Also, note tha

Re: Solr query which return only those docs whose all tokens are from given list

2015-05-20 Thread Mikhail Khludnev
Use update processor to add number of tags per doc. eg check CountFieldValuesUpdateProcessorFactory Doc1 -> tags:T1 T2 ; tagNum: 2 Doc2 -> tags:T1 T3 ; tagNum: 2 Doc3 -> tags:T1 T4 ; tagNum: 2 Doc4 -> tags:T1 T2 T3 ; tagNum: 3 than when you search for tags you need to get number of tags matche

Re: Deduplication

2015-05-20 Thread Shalin Shekhar Mangar
On Wed, May 20, 2015 at 12:59 PM, Bram Van Dam wrote: > >> Write a custom update processor and include it in your update chain. > >> You will then have the ability to do anything you want with the entire > >> input document before it hits the code to actually do the indexing. > > This sounded lik

Error on grouping result set

2015-05-20 Thread Abhijit Deka
Hi, I am having some problem whille grouping the result set.I have a solr schema like this I am querying the schema and the result is like this product,Vendor,Invoice abc,vendor1,49206.758 abc,vendor2,35654.981 abc,vendor2,94861.258 abc,vendo

Re: Term Frequency Calculation - Clarification

2015-05-20 Thread ariya bala
Please ignore. On Wed, May 20, 2015 at 2:45 PM, ariya bala wrote: > Thanks Jack. > In my case there is only one document - Foo Foo is in bar > As per your comment, I should expect TF to be 2. > But I am getting one. > Is there any check where if one match is a subset of other, is calculated > o

When is too many fields in "qf" is too many?

2015-05-20 Thread Steven White
Hi everyone, My solution requires that users in group-A can only search against a set of fields-A and users in group-B can only search against a set of fields-B, etc. There can be several groups, as many as 100 even more. To meet this need, I build my search by passing in the list of fields via

[solr 5.1] Looking for full text + collation search field

2015-05-20 Thread Björn Keil
Hello, might anyone suggest a field type with which I may do both a full text search (i.e. there is an analyzer including a tokenizer) and apply a collation? An example for what I want to do: There is a field "composer" for which I passed the value "Dvořák, Antonín". I want the following queries

Problem using a function with a multivalued field

2015-05-20 Thread Fernando Agüero
Hi everyone, I’ve been reading answers around this problem but I wanted to make sure that there is another way out of my problem. The thing is that the solution shouldn’t be on index-time, involve indexing a new field or changing this multi-valued field to a single-valued one. Problem: I need

Re: Term Frequency Calculation - Clarification

2015-05-20 Thread ariya bala
Thanks Jack. In my case there is only one document - Foo Foo is in bar As per your comment, I should expect TF to be 2. But I am getting one. Is there any check where if one match is a subset of other, is calculated once? My class extends DefaultSimilarity. Cheers Ariya Bala S On Wed, May 20, 201

Re: Deduplication

2015-05-20 Thread Alessandro Benedetti
What the Solr de-duplciation offers you is to calculate for each document in input an Hash ( based on a set of fields). You can then select two options : - Index everything, documents with same signature will be equals - avoid the overwriting of duplicates. How the similarity has is calculated is

Re: Term Frequency Calculation - Clarification

2015-05-20 Thread Jack Krupansky
Yes. tf is both 1 and 2 - tf is per document, which is 1 for the first document and 2 for the second document. See: http://lucene.apache.org/core/5_1_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html -- Jack Krupansky On Wed, May 20, 2015 at 6:13 AM, ariya bala wrote: > Hi, >

Term Frequency Calculation - Clarification

2015-05-20 Thread ariya bala
Hi, I have made custom class for scoring the similarity (TermFrequencyBiasedSimilarity). The score was deduced by considering just the TF part (acheived by setting IDF=1). Question is: - *Document content:* Foo Foo is in bar *Search query:* Foo bar *slop:* 3 With Slop 3, There ar

Re: Deduplication

2015-05-20 Thread Bram Van Dam
On 19/05/15 14:47, Alessandro Benedetti wrote: > Hi Bram, > what do you mean with : > " I > would like it to provide the unique value myself, without having the > deduplicator create a hash of field values " . > > This is not reduplication, but simple document filtering based on a > constraint. >

Re: Deduplication

2015-05-20 Thread Bram Van Dam
>> Write a custom update processor and include it in your update chain. >> You will then have the ability to do anything you want with the entire >> input document before it hits the code to actually do the indexing. This sounded like the perfect option ... until I read Jack's comment: > > My und

Re: Looking up arrays in a sub-entity

2015-05-20 Thread Upayavira
Personally, I see this as a limit of the dataimporthandler. It gets you started, but when your needs get at all complicated, it can't help you. I would encourage you to write your own indexing code. A little bit of code that reads over your database, sorts it out in the right way, and pushes it to

Re: Solr query which return only those docs whose all tokens are from given list

2015-05-20 Thread Naresh Yadav
Requesting Solr experts again to suggest some solutions to my above problem as i am not able to solve this. On Tue, May 12, 2015 at 11:04 AM, Naresh Yadav wrote: > Thanks Andrew, You got my problem precisely But solutions you suggested > may not work for me. > > In my API i get only list of tags