Re: How large is your solr index?
For Solr 5 why don't we switch it to 64 bit ?? Bill Bell Sent from mobile > On Dec 29, 2014, at 1:53 PM, Jack Krupansky wrote: > > And that Lucene index document limit includes deleted and updated > documents, so even if your actual document count stays under 2^31-1, > deleting and updating documents can push the apparent document count over > the limit unless you very aggressively merge segments to expunge deleted > documents. > > -- Jack Krupansky > > -- Jack Krupansky > > On Mon, Dec 29, 2014 at 12:54 PM, Erick Erickson > wrote: > >> When you say 2B docs on a single Solr instance, are you talking only one >> shard? >> Because if you are, you're very close to the absolute upper limit of a >> shard, internally >> the doc id is an int or 2^31. 2^31 + 1 will cause all sorts of problems. >> >> But yeah, your 100B documents are going to use up a lot of servers... >> >> Best, >> Erick >> >> On Mon, Dec 29, 2014 at 7:24 AM, Bram Van Dam >> wrote: >>> Hi folks, >>> >>> I'm trying to get a feel of how large Solr can grow without slowing down >> too >>> much. We're looking into a use-case with up to 100 billion documents >>> (SolrCloud), and we're a little afraid that we'll end up requiring 100 >>> servers to pull it off. >>> >>> The largest index we currently have is ~2billion documents in a single >> Solr >>> instance. Documents are smallish (5k each) and we have ~50 fields in the >>> schema, with an index size of about 2TB. Performance is mostly OK. Cold >>> searchers take a while, but most queries are alright after warming up. I >>> wish I could provide more statistics, but I only have very limited >> access to >>> the data (...banks...). >>> >>> I'd very grateful to anyone sharing statistics, especially on the larger >> end >>> of the spectrum -- with or without SolrCloud. >>> >>> Thanks, >>> >>> - Bram >>
RE: De Duplication using Solr
One possible "match" is using Python's FuzzyWuzzy https://github.com/seatgeek/fuzzywuzzy http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/ > Date: Sat, 3 Jan 2015 13:24:17 +0530 > Subject: De Duplication using Solr > From: shanuu@gmail.com > To: solr-user@lucene.apache.org > > I am trying to find out duplicate records based on distance and phonetic > algorithms. Can I utilize solr for that? I have following fields and > conditions to identify exact or possible duplicates. > > 1. Fields > prefix > suffix > firstname > lastname > email(primary_email1, email2, email3) > phone(primary_phone1, phone2, phone3) > 2. Conditions: > Two records said to be exact duplicates if > > 1. IsExactMatchFunction(record1_prefix, record2_prefix) AND > IsExactMatchFunction(record1_suffix, record2_suffix) AND > IsExactMatchFunction(record1_firstname,record2_firstname) AND > IsExactMatchFunction(record1_lastname,record2_lastname) AND > IsExactMatchFunction(record1_primary_email,record2_primary_email) OR > IsExactMatchFunction(record1_primary_phone,record2_primary_primary) > Two records said to be possible duplicates if > > 1. IsExactMatchFunction(record1_prefix, record2_prefix) OR > IsExactMatchFunction(record1_suffix, record2_suffix) OR > IsExactMatchFunction(record1_firstname,record2_firstname) AND > IsExactMatchFunction(record1_lastname,record2_lastname) AND > IsExactMatchFunction(record1_primary_email,record2_primary_email) OR > IsExactMatchFunction(record1_primary_phone,record2_primary_primary) > ELSE > 2. IsFuzzyMatchFunction(record1_firstname,record2_firstname) AND > IsExactMatchFunction(record1_lastname,record2_lastname) AND > IsExactMatchFunction(record1_primary_email,record2_primary_email) OR > IsExactMatchFunction(record1_primary_phone,record2_primary_primary) > ELSE > 3. IsFuzzyMatchFunction(record1_firstname,record2_firstname) AND > IsExactMatchFunction(record1_lastname,record2_lastname) AND > IsExactMatchFunction(record1_any_email,record2_any_email) OR > IsExactMatchFunction(record1_any_phone,record2_any_primary) > > IsFuzzyMatchFunction() will perform distance and phonetic algorithms > calculation and compare it with predefined threshold. > > For example: > > if threshold defined for firsname is 85 and IsFuzzyMatchFunction() function > only return "ture" only and only if one of the algorithms(distance or > phonetic) return the similarity socre >= 85. > > Can I use solr to perform this job. Or Can you guys suggest how can I > approach to this problem. I have seen the duke(De duplication API) but I > can not use duke out of the box.
RE: How large is your solr index?
Bill Bell [billnb...@gmail.com] wrote: [solr maxdoc limit of 2b] > For Solr 5 why don't we switch it to 64 bit ?? The biggest challenge for a switch is that Java's arrays can only hold 2b values. I support the idea of switching to much larger minimums throughout the code. But it is a larger fix than replacing int with long. - Toke Eskildsen
Re: De Duplication using Solr
First, see if you can get your requirements to align to the de-dupe feature that Solr already has: https://cwiki.apache.org/confluence/display/solr/De-Duplication -- Jack Krupansky On Sat, Jan 3, 2015 at 2:54 AM, Amit Jha wrote: > I am trying to find out duplicate records based on distance and phonetic > algorithms. Can I utilize solr for that? I have following fields and > conditions to identify exact or possible duplicates. > > 1. Fields > prefix > suffix > firstname > lastname > email(primary_email1, email2, email3) > phone(primary_phone1, phone2, phone3) > 2. Conditions: > Two records said to be exact duplicates if > > 1. IsExactMatchFunction(record1_prefix, record2_prefix) AND > IsExactMatchFunction(record1_suffix, record2_suffix) AND > IsExactMatchFunction(record1_firstname,record2_firstname) AND > IsExactMatchFunction(record1_lastname,record2_lastname) AND > IsExactMatchFunction(record1_primary_email,record2_primary_email) OR > IsExactMatchFunction(record1_primary_phone,record2_primary_primary) > Two records said to be possible duplicates if > > 1. IsExactMatchFunction(record1_prefix, record2_prefix) OR > IsExactMatchFunction(record1_suffix, record2_suffix) OR > IsExactMatchFunction(record1_firstname,record2_firstname) AND > IsExactMatchFunction(record1_lastname,record2_lastname) AND > IsExactMatchFunction(record1_primary_email,record2_primary_email) OR > IsExactMatchFunction(record1_primary_phone,record2_primary_primary) > ELSE > 2. IsFuzzyMatchFunction(record1_firstname,record2_firstname) AND > IsExactMatchFunction(record1_lastname,record2_lastname) AND > IsExactMatchFunction(record1_primary_email,record2_primary_email) OR > IsExactMatchFunction(record1_primary_phone,record2_primary_primary) > ELSE > 3. IsFuzzyMatchFunction(record1_firstname,record2_firstname) AND > IsExactMatchFunction(record1_lastname,record2_lastname) AND > IsExactMatchFunction(record1_any_email,record2_any_email) OR > IsExactMatchFunction(record1_any_phone,record2_any_primary) > > IsFuzzyMatchFunction() will perform distance and phonetic algorithms > calculation and compare it with predefined threshold. > > For example: > > if threshold defined for firsname is 85 and IsFuzzyMatchFunction() function > only return "ture" only and only if one of the algorithms(distance or > phonetic) return the similarity socre >= 85. > > Can I use solr to perform this job. Or Can you guys suggest how can I > approach to this problem. I have seen the duke(De duplication API) but I > can not use duke out of the box. >
Re: De Duplication using Solr
Thanks for reply...I have already seen wiki. It is more likely to record matching. On Sat, Jan 3, 2015 at 7:39 PM, Jack Krupansky wrote: > First, see if you can get your requirements to align to the de-dupe feature > that Solr already has: > https://cwiki.apache.org/confluence/display/solr/De-Duplication > > > -- Jack Krupansky > > On Sat, Jan 3, 2015 at 2:54 AM, Amit Jha wrote: > > > I am trying to find out duplicate records based on distance and phonetic > > algorithms. Can I utilize solr for that? I have following fields and > > conditions to identify exact or possible duplicates. > > > > 1. Fields > > prefix > > suffix > > firstname > > lastname > > email(primary_email1, email2, email3) > > phone(primary_phone1, phone2, phone3) > > 2. Conditions: > > Two records said to be exact duplicates if > > > > 1. IsExactMatchFunction(record1_prefix, record2_prefix) AND > > IsExactMatchFunction(record1_suffix, record2_suffix) AND > > IsExactMatchFunction(record1_firstname,record2_firstname) AND > > IsExactMatchFunction(record1_lastname,record2_lastname) AND > > IsExactMatchFunction(record1_primary_email,record2_primary_email) OR > > IsExactMatchFunction(record1_primary_phone,record2_primary_primary) > > Two records said to be possible duplicates if > > > > 1. IsExactMatchFunction(record1_prefix, record2_prefix) OR > > IsExactMatchFunction(record1_suffix, record2_suffix) OR > > IsExactMatchFunction(record1_firstname,record2_firstname) AND > > IsExactMatchFunction(record1_lastname,record2_lastname) AND > > IsExactMatchFunction(record1_primary_email,record2_primary_email) OR > > IsExactMatchFunction(record1_primary_phone,record2_primary_primary) > > ELSE > > 2. IsFuzzyMatchFunction(record1_firstname,record2_firstname) AND > > IsExactMatchFunction(record1_lastname,record2_lastname) AND > > IsExactMatchFunction(record1_primary_email,record2_primary_email) OR > > IsExactMatchFunction(record1_primary_phone,record2_primary_primary) > > ELSE > > 3. IsFuzzyMatchFunction(record1_firstname,record2_firstname) AND > > IsExactMatchFunction(record1_lastname,record2_lastname) AND > > IsExactMatchFunction(record1_any_email,record2_any_email) OR > > IsExactMatchFunction(record1_any_phone,record2_any_primary) > > > > IsFuzzyMatchFunction() will perform distance and phonetic algorithms > > calculation and compare it with predefined threshold. > > > > For example: > > > > if threshold defined for firsname is 85 and IsFuzzyMatchFunction() > function > > only return "ture" only and only if one of the algorithms(distance or > > phonetic) return the similarity socre >= 85. > > > > Can I use solr to perform this job. Or Can you guys suggest how can I > > approach to this problem. I have seen the duke(De duplication API) but I > > can not use duke out of the box. > > >
Re: How large is your solr index?
bq: For Solr 5 why don't we switch it to 64 bit ?? -1 on this for a couple of reasons > it'd be pretty invasive, and 5.0 may be imminent. Far too big a change to > implement at the last second > It's not clear that it's even useful. Once you get to that many documents, > performance usually suffers Of course I wouldn't be doing the work so I really don't have much of a vote, but it's not clear to me at all that enough people would actually have a use-case for 2b+ docs in a single shard to make it worthwhile. At that scale GC potentially becomes really unpleasant for instance FWIW, Erick On Sat, Jan 3, 2015 at 2:45 AM, Toke Eskildsen wrote: > Bill Bell [billnb...@gmail.com] wrote: > > [solr maxdoc limit of 2b] > >> For Solr 5 why don't we switch it to 64 bit ?? > > The biggest challenge for a switch is that Java's arrays can only hold 2b > values. I support the idea of switching to much larger minimums throughout > the code. But it is a larger fix than replacing int with long. > > - Toke Eskildsen
Re: How large is your solr index?
On 1/3/2015 9:02 AM, Erick Erickson wrote: > bq: For Solr 5 why don't we switch it to 64 bit ?? > > -1 on this for a couple of reasons >> it'd be pretty invasive, and 5.0 may be imminent. Far too big a change to >> implement at the last second >> It's not clear that it's even useful. Once you get to that many documents, >> performance usually suffers > > Of course I wouldn't be doing the work so I really don't have much of > a vote, but it's not clear to me at > all that enough people would actually have a use-case for 2b+ docs in > a single shard to make it > worthwhile. At that scale GC potentially becomes really unpleasant for > instance I agree, 2 billion documents in a single index is MORE than enough. If you actually create an index that large, you're going to have performance problems, and most of those performance problems will likely be related to garbage collection. I can extrapolate one such problem from personal experience on a much smaller index. A filterCache entry for a 2 billion document index is 256MB in size. Assuming you're using the G1 collector, the maximum size for a G1 heap region is 32MB, which means that at that size, every single filter will result in an object that is allocated immediately from the old generation (it's called a humongous allocation). Allocating that much memory from the old generation will eventually (and frequently) result in a full garbage collection ... and you do not want your application to wait for a full garbage collection on the heap size that would be required for a 2 billion document index. It could easily exceed 30 or 60 seconds. When you consider the current limitations of G1GC, it would be advisable to keep each Solr index below 100 million documents. At 134,217,728 documents, each filter object will be too large (more than 16MB) to be considered a normal allocation on the max heap region size (32MB). Even with the older battle-tested CMS collector (assuming good tuning options), I think the huge object sizes (and the huge number of smaller objects) resulting from a 2 billion document index will have major garbage collection problems. Thanks, Shawn
RE: How large is your solr index?
Erick Erickson [erickerick...@gmail.com] wrote: > Of course I wouldn't be doing the work so I really don't have much of > a vote, but it's not clear to me at all that enough people would actually > have a use-case for 2b+ docs in a single shard to make it > worthwhile. At that scale GC potentially becomes really unpleasant for > instance Over the last years we have seen a few use cases here on the mailing list. I would be very surprised if the number of such cases does not keep rising. Currently the work for a complete overhaul does not measure up to the rewards, but that is slowly changing. At the very least I find it prudent to not limit new Lucene/Solr interfaces to ints. As for GC: Right now a lot of structures are single-array oriented (for example using a long-array to represent bits in a bitset), which might not work well with current garbage collectors. A change to higher limits also means re-thinking such approaches: If the garbage collectors likes objects below a certain size then split the arrays into that. Likewise, iterations over structures linear in size to the index could be threaded. These are issues even with the current 2b limitation. - Toke Eskildsen
Re: How large is your solr index?
I can't disagree. You bring up some of the points that make me _extremely_ reluctant to try to get this in to 5.x though. 6.0 at the earliest I should think. And who knows? Java may get a GC process that's geared to modern amounts of memory and get by the current pain Best, Erick On Sat, Jan 3, 2015 at 1:00 PM, Toke Eskildsen wrote: > Erick Erickson [erickerick...@gmail.com] wrote: > > Of course I wouldn't be doing the work so I really don't have much of > > a vote, but it's not clear to me at all that enough people would actually > > have a use-case for 2b+ docs in a single shard to make it > > worthwhile. At that scale GC potentially becomes really unpleasant for > > instance > > Over the last years we have seen a few use cases here on the mailing list. > I would be very surprised if the number of such cases does not keep rising. > Currently the work for a complete overhaul does not measure up to the > rewards, but that is slowly changing. At the very least I find it prudent > to not limit new Lucene/Solr interfaces to ints. > > As for GC: Right now a lot of structures are single-array oriented (for > example using a long-array to represent bits in a bitset), which might not > work well with current garbage collectors. A change to higher limits also > means re-thinking such approaches: If the garbage collectors likes objects > below a certain size then split the arrays into that. Likewise, iterations > over structures linear in size to the index could be threaded. These are > issues even with the current 2b limitation. > > - Toke Eskildsen >
Re: How large is your solr index?
Back in June on a similar thread I asked "Anybody care to forecast when hardware will catch up with Solr and we can routinely look forward to newbies complaining that they indexed "some" data and after only 10 minutes they hit this weird 2G document count limit?" Still not there. So the race is on between when Lucene will relax the 2G limit and when hardware gets fast enough that 2G documents can be indexed within a small number of hours. -- Jack Krupansky On Sat, Jan 3, 2015 at 4:00 PM, Toke Eskildsen wrote: > Erick Erickson [erickerick...@gmail.com] wrote: > > Of course I wouldn't be doing the work so I really don't have much of > > a vote, but it's not clear to me at all that enough people would actually > > have a use-case for 2b+ docs in a single shard to make it > > worthwhile. At that scale GC potentially becomes really unpleasant for > > instance > > Over the last years we have seen a few use cases here on the mailing list. > I would be very surprised if the number of such cases does not keep rising. > Currently the work for a complete overhaul does not measure up to the > rewards, but that is slowly changing. At the very least I find it prudent > to not limit new Lucene/Solr interfaces to ints. > > As for GC: Right now a lot of structures are single-array oriented (for > example using a long-array to represent bits in a bitset), which might not > work well with current garbage collectors. A change to higher limits also > means re-thinking such approaches: If the garbage collectors likes objects > below a certain size then split the arrays into that. Likewise, iterations > over structures linear in size to the index could be threaded. These are > issues even with the current 2b limitation. > > - Toke Eskildsen >
RE: How large is your solr index?
Erick Erickson [erickerick...@gmail.com] wrote: > I can't disagree. You bring up some of the points that make me _extremely_ > reluctant to try to get this in to 5.x though. 6.0 at the earliest I should > think. Ignoring the magic 2b number for a moment, I think the overall question is whether or not single shards should perform well in the hundreds of millions of documents range. The alternative is more shards, but it is quite an explicit process to handle shard-juggling. From an end-user perspective, the underlying technology matters little: Whatever the choice, it should be possible to install "something" on a machine and expect it to scale within the hardware limitations without much ado. - Toke Eskildsen
Re: How large is your solr index?
That's a laudable goal - to support low-latency queries - including faceting - for "hundreds of millions" of documents, using Solr "out of the box" on a random, commodity box selected by IT and just adding a dozen or two fields to the default schema that are both indexed and stored, without any "expert" tuning, by an "average" developer. The reality doesn't seem to be there today. 50 to 100 million documents, yes, but beyond that takes some kind of "heroic" effort, whether a much beefier box, very careful and limited data modeling or limiting of query capabilities or tolerance of higher latency, expert tuning, etc. The proof is always in the pudding - pick a box, install Solr, setup the schema, load 20 or 50 or 100 or 250 or 350 million documents, try some queries with the features you need, and you get what you get. But I agree that it would be highly desirable to push that 100 million number up to 350 million or even 500 million ASAP since the pain of unnecessarily sharding is unnecessarily excessive. I wonder what changes will have to occur in Lucene, or... what evolution in commodity hardware will be necessary to get there. -- Jack Krupansky On Sat, Jan 3, 2015 at 6:11 PM, Toke Eskildsen wrote: > Erick Erickson [erickerick...@gmail.com] wrote: > > I can't disagree. You bring up some of the points that make me > _extremely_ > > reluctant to try to get this in to 5.x though. 6.0 at the earliest I > should > > think. > > Ignoring the magic 2b number for a moment, I think the overall question is > whether or not single shards should perform well in the hundreds of > millions of documents range. The alternative is more shards, but it is > quite an explicit process to handle shard-juggling. From an end-user > perspective, the underlying technology matters little: Whatever the choice, > it should be possible to install "something" on a machine and expect it to > scale within the hardware limitations without much ado. > > - Toke Eskildsen >
Re: solr export get wrong results
Thanks a lot for your for your help, Joel. Just wondering, why does "export" have such limitations? It uses the same query handler with "select", isn't it? 2014-12-31 10:28 GMT+08:00 Joel Bernstein : > For the initial release only JSON output format is supported with the > /export feature. Also there is no built-in distributed support yet. Both of > these features are likely to follow in future releases. > > For the initial release you'll need a client that can handle the JSON > format and distributed logic. The Heliosearch project includes a client > called CloudSolrStream that you can use for this purpose. Here are two > links to get started with CloudSolrStream: > > > https://github.com/Heliosearch/heliosearch/blob/helio_4_10/solr/solrj/src/java/org/apache/solr/client/solrj/streaming/CloudSolrStream.java > http://heliosearch.org/streaming-aggregation-for-solrcloud/ > > > > > > Joel Bernstein > Search Engineer at Heliosearch > > On Mon, Dec 29, 2014 at 2:20 AM, Sandy Ding > wrote: > > > Hi, Joel > > > > Thanks for your reply. > > It seems that the weird export results is because that I removed the > " > name>xsort" invariant of the export request handler in the default > > sorlconfig.xml to get csv-format output. > > I don't quite understand the meaning of "xsort", but I removed it > because I > > always get json response (as you said) with the xsort invariant. > > Is there a way to get a csv output using export? > > And also, can I get full results from all shards? (I tried to set > > "distrib=true" but get "SyntaxError:xport RankQuery is required for > xsort: > > rq={!xport}", and I do have rq={!xport} in the export invariants) > > > > > > 2014-12-27 3:21 GMT+08:00 Joel Bernstein : > > > > > Hi Sandy, > > > > > > I pulled Solr 4.10.3 to see if I could recreate the issue you are > seeing > > > with export and I wasn't able to recreate the bug you are seeing. For > > > example the following query: > > > > > > http://localhost:8983/solr/collection1/export?q=join_i:[50 TO > > > 500010]&wt=json&indent=true&sort=join_i+asc&fl=join_i,ShopId_i > > > > > > > > > Brings back the following result: > > > > > > > > > {"responseHeader": {"status": 0}, "response":{"numFound":11, > > > > > > > > > "docs":[{"join_i":50,"ShopId_i":578917},{"join_i":51,"ShopId_i":294217},{"join_i":52,"ShopId_i":199805},{"join_i":53,"ShopId_i":633461},{"join_i":54,"ShopId_i":472995},{"join_i":55,"ShopId_i":672122},{"join_i":56,"ShopId_i":394637},{"join_i":57,"ShopId_i":446443},{"join_i":58,"ShopId_i":697329},{"join_i":59,"ShopId_i":166988},{"join_i":500010,"ShopId_i":191261}]}} > > > > > > > > > Notice the join_i values are all within the correct range. > > > > > > If you can post the export handler configuration we should be able to > > > see the issue. > > > > > > > > > Joel Bernstein > > > Search Engineer at Heliosearch > > > > > > On Fri, Dec 26, 2014 at 1:50 PM, Joel Bernstein > > > wrote: > > > > > > > Hi Sandy, > > > > > > > > The export handler should only return documents in JSON format. The > > > > results in your second example are in XML for format so something > looks > > > to > > > > be wrong in the configuration. Can you post what your solrconfig > looks > > > like? > > > > > > > > Joel > > > > > > > > Joel Bernstein > > > > Search Engineer at Heliosearch > > > > > > > > On Fri, Dec 26, 2014 at 12:43 PM, Erick Erickson < > > > erickerick...@gmail.com> > > > > wrote: > > > > > > > >> I think you missed a very important part of Jack's reply: > > > >> > > > >> bq: I notice that you don't have distrib=false on your select, which > > > >> would make your select be from all nodes, while export would only be > > > >> docs from the specific node you sent the request to. > > > >> > > > >> And from the Reference Guide on export > > > >> > > > >> bq: The initial release treats all queries as non-distributed > > > >> requests. So the client is responsible for making the calls to each > > > >> Solr instance and merging the results. > > > >> > > > >> So the export statement you're sending is _only_ exporting the > results > > > >> from the shard on 8983 and completely ignoring the other (6?) > shards, > > > >> whereas the query you're sending is getting the results from all the > > > >> shards. > > > >> > > > >> As Jack said, add &distrib=false to the query, send it to the same > > > >> shard you send the export command to and the results should match. > > > >> > > > >> Also, be sure your configuration for the /select handler doesn't > have > > > >> any additional default parameters that might alter the results, but > I > > > >> doubt that's really a problem here. > > > >> > > > >> Best, > > > >> Erick > > > >> > > > >> On Fri, Dec 26, 2014 at 7:02 AM, Ahmet Arslan > > > > > > > > >> wrote: > > > >> > Hi, > > > >> > > > > >> > Do you have any custom solr components deployed? May be custom > > > response > > > >> writer? > > > >> > > > > >> > Ahmet > > > >> > > > > >> > > > > >> > > > > >>