You could also try the terms component which provides a very efficient facet-like feature - counting the terms. And you can set a minimum term frequency of 2, so only the dups would come back:

curl "http://localhost:8983/solr/terms?terms.fl=id&terms.mincount=2";

-- Jack Krupansky

-----Original Message----- From: Jack Krupansky
Sent: Tuesday, July 30, 2013 4:14 PM
To: solr-user@lucene.apache.org
Subject: Re: How might one search for dupe IDs other than faceting on the ID field?

The Solr SignatureUpdateProcessorFactory is designed to facilitate dedupe...
any particular reason you did not use it?

See:
http://wiki.apache.org/solr/Deduplication

and

https://cwiki.apache.org/confluence/display/solr/De-Duplication

And I give a bunch of examples in my book.

-- Jack Krupansky

-----Original Message----- From: Dotan Cohen
Sent: Tuesday, July 30, 2013 2:16 PM
To: solr-user@lucene.apache.org
Subject: How might one search for dupe IDs other than faceting on the ID
field?

To search for duplicate IDs, I am running the following query:
select?q=*:*&facet=true&facet.field=id&rows=0

However, since upgrading from Solr 4.1 to Solr 4.3 I am receiving
OutOfMemoryError errors instead of the desired facet:

<response><lst name="error"><str
name="msg">java.lang.OutOfMemoryError: Java heap space</str><str
name="trace">java.lang.RuntimeException: java.lang.OutOfMemoryError:
Java heap space
   at
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:670)
   at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
   at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
   at ...

Might there be a less resource-intensive way to get this information.
This is Solr 4.3 running on Ubuntu Server 12.04 in Jetty. The index
has over 100,000,000 small records, for a total of about 95 GiB of
disk space, with Solr running on it's own disk. Actually, the 'disk'
is an Amazon Web Service EBS volume.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Reply via email to