De Duplication using Solr

2015-01-02 Thread Amit Jha
I am trying to find out duplicate records based on distance and phonetic algorithms. Can I utilize solr for that? I have following fields and conditions to identify exact or possible duplicates. 1. Fields prefix suffix firstname lastname email(primary_email1, email2, email3) phone(primary_phone1,

Re: SolrCloud multi-datacenter failover?

2015-01-02 Thread Erick Erickson
bq: This is problematic because some portion of user activity will fail, queries that are in transit will not complete This is always interesting to think about, but is it a serious enough problem to spend resources trying to anticipate? I can imagine situations where even losing the queries in tr

Re: Garbage Collection tuning - G1 is now a good option

2015-01-02 Thread Mark Miller
bq. But tons of people on this mailing list do not recommend AggressiveOpts It's up to you to decide - that is why it's an option. It will enable more aggressive options that will tend to perform better. On the other hand, these more aggressive options and optimizations have a history of being mor

SolrCloud multi-datacenter failover?

2015-01-02 Thread jaime spicciati
All, At my current customer we have developed a custom federator that will federate queries between Endeca and Solr to ease the transition from an extremely large (TBs of data) Endeca index to Solr. (Endeca is similar to Solr in terms of search/faceted navigation/etc). During this transition pl

Re: UUIDUpdateProcessorFactory causes repeated documents when uploading csv files?

2015-01-02 Thread jia gu
It's a single Solr Instance, and in my files, I used 'doc_key' everywhere, but I changed it to "id" in the email I sent out wanting to make it easier to read, sorry don't mean to confuse you :) On Fri, Jan 2, 2015 at 4:06 PM, Alexandre Rafalovitch wrote: > On 2 January 2015 at 15:43, wrote: >

Re: UUIDUpdateProcessorFactory causes repeated documents when uploading csv files?

2015-01-02 Thread Alexandre Rafalovitch
On 2 January 2015 at 15:43, wrote: > id Your uniqueKey does not seem to be the 'doc_key' that the URP is asked to generate. I wonder if that is causing the issue. Are you deliberately generating a field different from one defined as unique id? Regards, Alex. Sign up for my Solr resourc

Re: UUIDUpdateProcessorFactory causes repeated documents when uploading csv files?

2015-01-02 Thread Meraj A. Khan
Is this SolrCloud or single Solr Instance? On Jan 2, 2015 3:44 PM, wrote: > Happy New Year Everyone :) > > I am trying to automatically generate document Id when indexing a csv > file that contains multiple lines of documents. The desired case: if the > csv file contains 2 lines (each line is a d

UUIDUpdateProcessorFactory causes repeated documents when uploading csv files?

2015-01-02 Thread jiag
Happy New Year Everyone :) I am trying to automatically generate document Id when indexing a csv file that contains multiple lines of documents. The desired case: if the csv file contains 2 lines (each line is a document), then the index should contain 2 documents. What I observed: If the csv fi

Re: Inconsistent document addition

2015-01-02 Thread Erick Erickson
Really impossible to say, assuming you're generating correctly-formed documents I don't see how this would fail. So, here's how I'd approach it: You're assuming that 1> you're getting all the docs back from server A that you have in there and 2> you're correctly sending them all to server B So my

RE: UseLargePages

2015-01-02 Thread Toke Eskildsen
Shawn Heisey [apa...@elyograg.org] wrote: > All indications are that you should probably turn off the "transparent huge > pages" feature in the OS if you use them, though. > https://blogs.oracle.com/linux/entry/performance_issues_with_transparent_huge Very interesting. We had severe performance p

Re: UseLargePages

2015-01-02 Thread Shawn Heisey
On 1/1/2015 10:12 PM, William Bell wrote: > Do you think setting aside 2GB for UseLargePages would generally help > indexing or not? > > I can imaging it might help Allocating part of your operating system memory as huge pages and then turning on UseLargePages probably will help with general

Inconsistent document addition

2015-01-02 Thread Yashveer Rana
I have a solr cloud setup with two collections A & B with different schemas ( although majority of fields are identical ). Collection A has ~ 3.6 million documents Using *solrj 4.7.0 * As per a requirement, my application - reads documents from collection A in batches of 10k - Creates docs of ty

Re: SpellCheck (AutoComplete) Not Working In Distributed Environment

2015-01-02 Thread Shawn Heisey
On 1/1/2015 1:09 PM, Meraj A. Khan wrote: > When running SolrCloud do you even have to include the shards parameter > ,shouldnt only shards.qt parameter suffice? If you are using SolrCloud, no shards parameter is required ... all queries sent to either the collection or any shard replica will auto

Re: Garbage Collection tuning - G1 is now a good option

2015-01-02 Thread Shawn Heisey
On 1/1/2015 6:35 PM, William Bell wrote: > But tons of people on this mailing list do not recommend AggressiveOpts > > Why do you recommend it? I haven't done any comparisons with and without it. To call it a "recommendation" is a little bit strong. I use it, and I am seeing good results. My r

FOSDEM Open source search devroom

2015-01-02 Thread Bram Van Dam
Hi folks, There will be an Open source search devroom[1] at this year's FOSDEM in Brussels, 31st of January & 1st of February. I don't know if there will be a Lucene/Solr presence (there's no schedule for the dev room yet), but this seems like a good place meet up and talk shop. I'll be th