Re: Near Duplicate Documents
The duplication detection mechanism in Nutch is quite primitive. I think it uses a MD5 signature generated from the content of a field. The generation algorithm is described here: http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/crawl/TextProfileSignature.html. The problem with this approach is MD5 hash is very sensitive: one letter difference will generate completely different hash. You probably have to roll your own near duplication detection algorithm. My advice is have a look at existing literature on near duplication detection techniques and then implement one of them. I know Google has some papers that describe a technique called minhash. I read the paper and found it's very interesting. I'm not sure if you can implement the algorithm because they have patented it. That said, there are plenty literature on near dup detection so you should be able to get one for free! On Nov 21, 2007 6:57 PM, Rishabh Joshi <[EMAIL PROTECTED]> wrote: > Otis, > > Thanks for your response. > > I just gave a quick look to the Nutch Forum and find that there is an > implementation to obtain de-duplicate documents/pages but none for Near > Duplicates documents. Can you guide me a little further as to where exactly > under Nutch I should be concentrating, regarding near duplicate documents? > > Regards, > Rishabh > > On Nov 21, 2007 12:41 PM, Otis Gospodnetic <[EMAIL PROTECTED]> > wrote: > > > > To whomever started this thread: look at Nutch. I believe something > > related to this already exists in Nutch for near-duplicate detection. > > > > Otis > > -- > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > - Original Message > > From: Mike Klaas <[EMAIL PROTECTED]> > > To: solr-user@lucene.apache.org > > Sent: Sunday, November 18, 2007 11:08:38 PM > > Subject: Re: Near Duplicate Documents > > > > On 18-Nov-07, at 8:17 AM, Eswar K wrote: > > > > > Is there any idea implementing that feature in the up coming > > releases? > > > > Not currently. Feel free to contribute something if you find a good > > solution . > > > > -Mike > > > > > > > On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote: > > > > > >> On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote: > > >>> We have a scenario, where we want to find out documents which are > > >> similar in > > >>> content. To elaborate a little more on what we mean here, lets > > >>> take an > > >>> example. > > >>> > > >>> The example of this email chain in which we are interacting on, > > >>> can be > > >> best > > >>> used for illustrating the concept of near dupes (We are not getting > > >> confused > > >>> with threads, they are two different things.). Each email in this > > >>> thread > > >> is > > >>> treated as a document by the system. A reply to the original mail > > >>> also > > >>> includes the original mail in which case it becomes a near > > >>> duplicate of > > >> the > > >>> orginal mail (depending on the percentage of similarity). > > >>> Similarly it > > >> goes > > >>> on. The near dupes need not be limited to emails. > > >> > > >> I think this is what's known as "shingling." See > > >> http://en.wikipedia.org/wiki/W-shingling > > >> Lucene (and therefore Solr) does not implement shingling. The > > >> "MoreLikeThis" query might be close enough, however. > > >> > > >> -Stuart > > >> > > > > > > > > > > > -- Regards, Cuong Hoang
Re: Help with Debian solr/jetty install?
Make sure you have JDK installed not just JRE. Also try to set JAVA_HOME directory. apt-get install sun-java5-jdk On Nov 21, 2007 5:50 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > Phillip, > > I won't go into details, but I'll point out that the Java compiler is called > javac and if memory serves me well, it is defined in one of Jetty's XML > config files in its etc/ dir. The java compiler is used to compile JSPs that > Solr uses for the admin UI. So, make sure you have javac and make sure Jetty > can find it. > > Otis > > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > - Original Message > From: Phillip Farber <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Tuesday, November 20, 2007 5:55:27 PM > Subject: Help with Debian solr/jetty install? > > > Hi, > > I've successfully run as far as the example admin page on Debian linux > 2.6. > > So I installed the solr-jetty packaged for Debian testing which gives > me > Jetty 5.1.14-1 and Solr 1.2.0+ds1-1. Jetty starts fine and so does the > > Solr home page at http://localhost:8280/solr > > But I get an error when I try to run http://localhost:8280/solr/admin > > HTTP ERROR: 500 > No Java compiler available > > I have sun-java6-jre and sun-java6-jdk packages installed. I'm new to > servlet containers and java webapps. What should I be looking for to > fix this or what information could I provide the list to get me moving > forward from here? > > I've included the trace from the Jetty log, and the java properties > dump > from the example below. > > Thanks, > Phil > > --- > > Java properties (from the example): > -- > > sun.boot.library.path = /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386 > java.vm.version = 1.6.0-b105 > java.vm.name = Java HotSpot(TM) Client VM > user.dir = /tmp/apache-solr-1.2.0/example > java.runtime.version = 1.6.0-b105 > os.arch = i386 > java.io.tmpdir = /tmp > > java.library.path = > /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386/client:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/../lib/i386:/usr/java/packages/lib/i386:/lib:/usr/lib > java.class.version = 50.0 > jetty.home = /tmp/apache-solr-1.2.0/example > sun.management.compiler = HotSpot Client Compiler > os.version = 2.6.22-2-686 > java.class.path = > /tmp/apache-solr-1.2.0/example:/tmp/apache-solr-1.2.0/example/lib/jetty-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/jetty-util-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/servlet-api-2.5-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/ant-1.6.5.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/core-3.1.1.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/jsp-2.1.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/jsp-api-2.1.jar:/usr/share/ant/lib/ant.jar > java.home = /usr/lib/jvm/java-6-sun-1.6.0.00/jre > java.version = 1.6.0 > java.ext.dirs = > /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/ext:/usr/java/packages/lib/ext > sun.boot.class.path = > /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/resources.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/rt.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/sunrsasign.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/jsse.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/jce.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/charsets.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/classes > > > > > Jetty log (from the error under Debian Solr/Jetty): > > > org.apache.jasper.JasperException: No Java compiler available > at > org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:460) > at > org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:367) > at > org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:329) > at > org.apache.jasper.servlet.JspServlet.service(JspServlet.java:265) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) > at > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:428) > at > org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:473) > at > org.mortbay.jetty.servlet.Dispatcher.dispatch(Dispatcher.java:286) > at > org.mortbay.jetty.servlet.Dispatcher.forward(Dispatcher.java:171) > at org.mortbay.jetty.servlet.Default.handleGet(Default.java:302) > at org.mortbay.jetty.servlet.Default.service(Default.java:223) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) > at > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:428) > at > org.mortbay.jetty.servlet.WebApplicationHandler$CachedChain.doFilter(WebApplicationHandler.java:830) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:185) > at > org.mortbay.jetty.servlet.WebApplicationHandler$CachedChain.doFilter(WebApplicationHandler.java:821) > at > org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.j
Re: Near Duplicate Documents
Thanks for the info Cuong! Regards, Rishabh On Nov 21, 2007 1:59 PM, climbingrose <[EMAIL PROTECTED]> wrote: > The duplication detection mechanism in Nutch is quite primitive. I > think it uses a MD5 signature generated from the content of a field. > The generation algorithm is described here: > > http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/crawl/TextProfileSignature.html > . > > The problem with this approach is MD5 hash is very sensitive: one > letter difference will generate completely different hash. You > probably have to roll your own near duplication detection algorithm. > My advice is have a look at existing literature on near duplication > detection techniques and then implement one of them. I know Google has > some papers that describe a technique called minhash. I read the paper > and found it's very interesting. I'm not sure if you can implement the > algorithm because they have patented it. That said, there are plenty > literature on near dup detection so you should be able to get one for > free! > > On Nov 21, 2007 6:57 PM, Rishabh Joshi <[EMAIL PROTECTED]> wrote: > > Otis, > > > > Thanks for your response. > > > > I just gave a quick look to the Nutch Forum and find that there is an > > implementation to obtain de-duplicate documents/pages but none for Near > > Duplicates documents. Can you guide me a little further as to where > exactly > > under Nutch I should be concentrating, regarding near duplicate > documents? > > > > Regards, > > Rishabh > > > > On Nov 21, 2007 12:41 PM, Otis Gospodnetic <[EMAIL PROTECTED]> > > wrote: > > > > > > > To whomever started this thread: look at Nutch. I believe something > > > related to this already exists in Nutch for near-duplicate detection. > > > > > > Otis > > > -- > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > - Original Message > > > From: Mike Klaas <[EMAIL PROTECTED]> > > > To: solr-user@lucene.apache.org > > > Sent: Sunday, November 18, 2007 11:08:38 PM > > > Subject: Re: Near Duplicate Documents > > > > > > On 18-Nov-07, at 8:17 AM, Eswar K wrote: > > > > > > > Is there any idea implementing that feature in the up coming > > > releases? > > > > > > Not currently. Feel free to contribute something if you find a good > > > solution . > > > > > > -Mike > > > > > > > > > > On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> > wrote: > > > > > > > >> On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote: > > > >>> We have a scenario, where we want to find out documents which are > > > >> similar in > > > >>> content. To elaborate a little more on what we mean here, lets > > > >>> take an > > > >>> example. > > > >>> > > > >>> The example of this email chain in which we are interacting on, > > > >>> can be > > > >> best > > > >>> used for illustrating the concept of near dupes (We are not > getting > > > >> confused > > > >>> with threads, they are two different things.). Each email in this > > > >>> thread > > > >> is > > > >>> treated as a document by the system. A reply to the original mail > > > >>> also > > > >>> includes the original mail in which case it becomes a near > > > >>> duplicate of > > > >> the > > > >>> orginal mail (depending on the percentage of similarity). > > > >>> Similarly it > > > >> goes > > > >>> on. The near dupes need not be limited to emails. > > > >> > > > >> I think this is what's known as "shingling." See > > > >> http://en.wikipedia.org/wiki/W-shingling > > > >> Lucene (and therefore Solr) does not implement shingling. The > > > >> "MoreLikeThis" query might be close enough, however. > > > >> > > > >> -Stuart > > > >> > > > > > > > > > > > > > > > > > > > > > -- > Regards, > > Cuong Hoang >
Re: Help with Debian solr/jetty install?
On Tue, 2007-11-20 at 22:50 -0800, Otis Gospodnetic wrote: > Phillip, > > I won't go into details, but I'll point out that the Java compiler is called > javac and if memory serves me well, it is defined in one of Jetty's XML > config files in its etc/ dir. The java compiler is used to compile JSPs that > Solr uses for the admin UI. So, make sure you have javac and make sure Jetty > can find it. > e.g. cd ~ vim .bashrc ... export JAVA_HOME=/home/thorsten/opt/java export PATH=$JAVA_HOME/bin:$PATH The important thing is that $JAVA_HOME points to the JDK and it is first in your path! salu2 > Otis > > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > - Original Message > From: Phillip Farber <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Tuesday, November 20, 2007 5:55:27 PM > Subject: Help with Debian solr/jetty install? > > > Hi, > > I've successfully run as far as the example admin page on Debian linux > 2.6. > > So I installed the solr-jetty packaged for Debian testing which gives > me > Jetty 5.1.14-1 and Solr 1.2.0+ds1-1. Jetty starts fine and so does the > > Solr home page at http://localhost:8280/solr > > But I get an error when I try to run http://localhost:8280/solr/admin > > HTTP ERROR: 500 > No Java compiler available > > I have sun-java6-jre and sun-java6-jdk packages installed. I'm new to > servlet containers and java webapps. What should I be looking for to > fix this or what information could I provide the list to get me moving > forward from here? > > I've included the trace from the Jetty log, and the java properties > dump > from the example below. > > Thanks, > Phil > > --- > > Java properties (from the example): > -- > > sun.boot.library.path = /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386 > java.vm.version = 1.6.0-b105 > java.vm.name = Java HotSpot(TM) Client VM > user.dir = /tmp/apache-solr-1.2.0/example > java.runtime.version = 1.6.0-b105 > os.arch = i386 > java.io.tmpdir = /tmp > > java.library.path = > /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386/client:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/../lib/i386:/usr/java/packages/lib/i386:/lib:/usr/lib > java.class.version = 50.0 > jetty.home = /tmp/apache-solr-1.2.0/example > sun.management.compiler = HotSpot Client Compiler > os.version = 2.6.22-2-686 > java.class.path = > /tmp/apache-solr-1.2.0/example:/tmp/apache-solr-1.2.0/example/lib/jetty-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/jetty-util-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/servlet-api-2.5-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/ant-1.6.5.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/core-3.1.1.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/jsp-2.1.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/jsp-api-2.1.jar:/usr/share/ant/lib/ant.jar > java.home = /usr/lib/jvm/java-6-sun-1.6.0.00/jre > java.version = 1.6.0 > java.ext.dirs = > /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/ext:/usr/java/packages/lib/ext > sun.boot.class.path = > /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/resources.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/rt.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/sunrsasign.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/jsse.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/jce.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/charsets.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/classes > > > > > Jetty log (from the error under Debian Solr/Jetty): > > > org.apache.jasper.JasperException: No Java compiler available > at > org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:460) > at > org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:367) > at > org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:329) > at > org.apache.jasper.servlet.JspServlet.service(JspServlet.java:265) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) > at > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:428) > at > org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:473) > at > org.mortbay.jetty.servlet.Dispatcher.dispatch(Dispatcher.java:286) > at > org.mortbay.jetty.servlet.Dispatcher.forward(Dispatcher.java:171) > at org.mortbay.jetty.servlet.Default.handleGet(Default.java:302) > at org.mortbay.jetty.servlet.Default.service(Default.java:223) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) > at > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:428) > at > org.mortbay.jetty.servlet.WebApplicationHandler$CachedChain.doFilter(WebApplicationHandler.java:830) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:185) > at > org.mortbay.jetty.servlet.WebApplicationHandler$CachedChain.doFilter(WebApplication
Re: Any tips for indexing large amounts of data?
HI Otis, Thanks for the reply. I am using a pretty "vanilla approach" right now and it's taking about 30 hours to build an index of about 5.5Gb. Can you please tell me what some of the changes you made to optimize the indexing process? Thanks Brendan On Nov 21, 2007, at 2:27 AM, Otis Gospodnetic wrote: Just tried a search for "web" on this index - 1.1 seconds. This matches about 1MM of about 20MM docs. Redo the search, and it's 1 ms (cached). This is without any load nor serious benchmarking, clearly. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Eswar K <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Wednesday, November 21, 2007 2:11:07 AM Subject: Re: Any tips for indexing large amounts of data? Hi otis, I understand that is slightly off track question, but I am just curious to know the performance of Search on a 20 GB index file. What has been your observation? Regards, Eswar On Nov 21, 2007 12:33 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: Mike is right about the occasional slow-down, which appears as a pause and is due to large Lucene index segment merging. This should go away with newer versions of Lucene where this is happening in the background. That said, we just indexed about 20MM documents on a single 8-core machine with 8 GB of RAM, resulting in nearly 20 GB index. The whole process took a little less than 10 hours - that's over 550 docs/second. The vanilla approach before some of our changes apparently required several days to index the same amount of data. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Mike Klaas <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Monday, November 19, 2007 5:50:19 PM Subject: Re: Any tips for indexing large amounts of data? There should be some slowdown in larger indices as occasionally large segment merge operations must occur. However, this shouldn't really affect overall speed too much. You haven't really given us enough data to tell you anything useful. I would recommend trying to do the indexing via a webapp to eliminate all your code as a possible factor. Then, look for signs to what is happening when indexing slows. For instance, is Solr high in cpu, is the computer thrashing, etc? -Mike On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote: Hi, Thanks for answering this question a while back. I have made some of the suggestions you mentioned. ie not committing until I've finished indexing. What I am seeing though, is as the index get larger (around 1Gb), indexing is taking a lot longer. In fact it slows down to a crawl. Have you got any pointers as to what I might be doing wrong? Also, I was looking at using MultiCore solr. Could this help in some way? Thank you Brendan On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote: : I would think you would see better performance by allowing auto commit : to handle the commit size instead of reopening the connection all the : time. if your goal is "fast" indexing, don't use autoCommit at all ... just index everything, and don't commit until you are completely done. autoCommitting will slow your indexing down (the benefit being that more results will be visible to searchers as you proceed) -Hoss
Re: Solr cluster topology.
Thanks a lot for your responses! They were all very helpful! On Nov 20, 2007, at 5:52 PM, Norberto Meijome wrote: On Tue, 20 Nov 2007 16:26:27 -0600 Alexander Wallace <[EMAIL PROTECTED]> wrote: Interesting, this ALL MASTERS mode... I guess you don't do any replication then... correct In the single master, several slaves mode, I'm assuming the client still writes to one and reads from the others... right? Correct again. There is also another approach which I think in SOLR is called FederatedSearch , where a front end queries a number of index servers (each with overlapping or non-overlapping data sets) and puts together 1 result stream for the answer. There was some discussion on the list, http://www.mail-archive.com/solr- [EMAIL PROTECTED]/msg06081.html is the earliest link in the archive i can find . B _ {Beto|Norberto|Numard} Meijome "People demand freedom of speech to make up for the freedom of thought which they avoid. " Soren Aabye Kierkegaard I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: Near Duplicate Documents
The duplication detection mechanism in Nutch is quite primitive. I think it uses a MD5 signature generated from the content of a field. The generation algorithm is described here: http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/crawl/TextProfileSignature.html. The problem with this approach is MD5 hash is very sensitive: one letter difference will generate completely different hash. I'm confused by your answer, assuming it's based on the page referenced by the URL you provided. The approach by TextProfileSignature would only generate a different MD5 hash with a single letter change if that change resulted in a change in the quantized frequency for that word. And if it's an uncommon word, then it wouldn't even show up in the signature. -- Ken You probably have to roll your own near duplication detection algorithm. My advice is have a look at existing literature on near duplication detection techniques and then implement one of them. I know Google has some papers that describe a technique called minhash. I read the paper and found it's very interesting. I'm not sure if you can implement the algorithm because they have patented it. That said, there are plenty literature on near dup detection so you should be able to get one for free! On Nov 21, 2007 6:57 PM, Rishabh Joshi <[EMAIL PROTECTED]> wrote: Otis, Thanks for your response. > I just gave a quick look to the Nutch Forum and find that there is an implementation to obtain de-duplicate documents/pages but none for Near Duplicates documents. Can you guide me a little further as to where exactly > under Nutch I should be concentrating, regarding near duplicate documents? > > Regards, Rishabh On Nov 21, 2007 12:41 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > To whomever started this thread: look at Nutch. I believe something > related to this already exists in Nutch for near-duplicate detection. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > - Original Message > From: Mike Klaas <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Sunday, November 18, 2007 11:08:38 PM > Subject: Re: Near Duplicate Documents > > On 18-Nov-07, at 8:17 AM, Eswar K wrote: > > > Is there any idea implementing that feature in the up coming > releases? > > > Not currently. Feel free to contribute something if you find a good > solution . > > > -Mike > > > > On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote: > > > >> On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote: > >>> We have a scenario, where we want to find out documents which are > >> similar in > >>> content. To elaborate a little more on what we mean here, lets > >>> take an > >>> example. > >>> > >>> The example of this email chain in which we are interacting on, > >>> can be > >> best > >>> used for illustrating the concept of near dupes (We are not getting > >> confused > >>> with threads, they are two different things.). Each email in this > >>> thread > >> is > >>> treated as a document by the system. A reply to the original mail > >>> also > >>> includes the original mail in which case it becomes a near > >>> duplicate of > >> the > >>> orginal mail (depending on the percentage of similarity). > >>> Similarly it > >> goes > >>> on. The near dupes need not be limited to emails. > >> > >> I think this is what's known as "shingling." See > >> http://en.wikipedia.org/wiki/W-shingling > >> Lucene (and therefore Solr) does not implement shingling. The > >> "MoreLikeThis" query might be close enough, however. > >> > > >> -Stuart -- Ken Krugler Krugle, Inc. +1 530-210-6378 "If you can't find it, you can't fix it"
Document update based on ID
Hello,.. I have a document indexed with Solr. Originally it had only few fields. I want to add some more fields to the index later, based on ID but I don't want to submit original fields again. I use Solr 1.2, but I think there is no such functionality yet. But I saw a feature here https://issues.apache.org/jira/browse/SOLR-139 and it looks like what I need. Is it implemented already? How can I get the code? Would you suggest to use it in production? How it works? Thank you Gene
Re: Document update based on ID
Evgeniy Strokin wrote: Hello,.. I have a document indexed with Solr. Originally it had only few fields. I want to add some more fields to the index later, based on ID but I don't want to submit original fields again. I use Solr 1.2, but I think there is no such functionality yet. But I saw a feature here https://issues.apache.org/jira/browse/SOLR-139 and it looks like what I need. Is it implemented already? How can I get the code? Would you suggest to use it in production? How it works? Yes, SOLR-139 will eventually do what you need. The most recent patch should not be *too* hard to get running (it may not apply cleanly though) The patch as is needs to be reworked before it will go into trunk. I hope this will happen in the next month or so. As for production? It depends ;) The API will most likely change so if you base your code on the current patch, it will need to change when things finalize. As for stability, it has worked well for me (and I think for Erik) ryan
Memory use with sorting problem
Hi all, I've been struggling with this problem for over a month now, and although memory issues have been discussed often, I don't seem to be able to find a fitting solution. The index is merely 1.5 GB large, but memory use quickly fills out the heap max of 1 GB on a 2 GB machine. This then works fine until auto-warming starts. Switching the latter off altogether is unattractive as it leads to response times of up to 30 s. When auto-warming starts, I get this error: > SEVERE: Error during auto-warming of key:org.apache.solr.search.QueryResultKey @e0b93139:java.lang.OutOfMemoryError: Java heap space Now when I reduce the size of caches (to a fraction of the default settings) and number of warming Searchers (to 2), memory use is not reduced and the problem stays. Only deactivating auto-warming will help. When I set the heap size limit higher (and go into swap space), all the extra memory seems to be used up right away, independently from auto-warming. This all seems to be closely connected to sorting by a numerical field, as switching this off does make memory use a lot more friendly. Is it normal to need that much Memory for such a small index? I suspect the problem is in Lucene, would it be better to post on their list? Does anyone know a better way of getting the sorting done? Thanks in advance for your help, Chris This is the field setup in schema.xml: And this is a sample query: select/?q=solr&start=0&rows=20&sort=created+desc
Re: Memory use with sorting problem
On Nov 21, 2007 11:06 AM, Chris Laux <[EMAIL PROTECTED]> wrote: > Now when I reduce the size of caches (to a fraction of the default > settings) and number of warming Searchers (to 2), Set the max warming searchers to 1 to ensure that you never have more than one warming at the same time. > memory use is not > reduced and the problem stays. Only deactivating auto-warming will help. > When I set the heap size limit higher (and go into swap space), all the > extra memory seems to be used up right away, independently from > auto-warming. > > This all seems to be closely connected to sorting by a numerical field, > as switching this off does make memory use a lot more friendly. How many documents are in your index? If you don't need range queries on these numeric fields, you might try switching from "sfloat" to "float" and from "sint" to "int". The fieldCache representation will be smaller. > Is it normal to need that much Memory for such a small index? Some things are more related to the number of unique terms or the numer of documents more than the "size" of the index. -Yonik
Re: Any tips for indexing large amounts of data?
Hi Otis, Thanks for this. Are you using a flavor of linux and is it 64bit? How much heap are you giving your jvm? Thanks again Brendan On Nov 21, 2007, at 2:03 AM, Otis Gospodnetic wrote: Mike is right about the occasional slow-down, which appears as a pause and is due to large Lucene index segment merging. This should go away with newer versions of Lucene where this is happening in the background. That said, we just indexed about 20MM documents on a single 8-core machine with 8 GB of RAM, resulting in nearly 20 GB index. The whole process took a little less than 10 hours - that's over 550 docs/second. The vanilla approach before some of our changes apparently required several days to index the same amount of data. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Mike Klaas <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Monday, November 19, 2007 5:50:19 PM Subject: Re: Any tips for indexing large amounts of data? There should be some slowdown in larger indices as occasionally large segment merge operations must occur. However, this shouldn't really affect overall speed too much. You haven't really given us enough data to tell you anything useful. I would recommend trying to do the indexing via a webapp to eliminate all your code as a possible factor. Then, look for signs to what is happening when indexing slows. For instance, is Solr high in cpu, is the computer thrashing, etc? -Mike On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote: Hi, Thanks for answering this question a while back. I have made some of the suggestions you mentioned. ie not committing until I've finished indexing. What I am seeing though, is as the index get larger (around 1Gb), indexing is taking a lot longer. In fact it slows down to a crawl. Have you got any pointers as to what I might be doing wrong? Also, I was looking at using MultiCore solr. Could this help in some way? Thank you Brendan On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote: : I would think you would see better performance by allowing auto commit : to handle the commit size instead of reopening the connection all the : time. if your goal is "fast" indexing, don't use autoCommit at all ... just index everything, and don't commit until you are completely done. autoCommitting will slow your indexing down (the benefit being that more results will be visible to searchers as you proceed) -Hoss
Re: Weird memory error.
Actually when I look at the errormessage, this has nothing to do with memory. The error message: java.lang.OutOfMemoryError: unable to create new native thread means that the OS can not create any new native threads for this JVM. So the limit you are running into is not the JVM Memory. I guess you should rather look at a bottleneck inside your application that prevents your serverthreads from being reused when you fire concurrent batches to your sever. Do you do all that in paralell? In the stacktrace below your connector can not get any new threads from the pool which has nothing to do with memory. Try to figure out what is taking so much time during the batch process on the server. simon On Nov 20, 2007 5:16 PM, Brian Carmalt <[EMAIL PROTECTED]> wrote: > Hello all, > > I started looking into the scalability of solr, and have started getting > weird results. > I am getting the following error: > > Exception in thread "btpool0-3" java.lang.OutOfMemoryError: unable to > create new native thread >at java.lang.Thread.start0(Native Method) >at java.lang.Thread.start(Thread.java:574) >at > org.mortbay.thread.BoundedThreadPool.newThread(BoundedThreadPool.java:377) >at > org.mortbay.thread.BoundedThreadPool.dispatch(BoundedThreadPool.java:94) >at > org.mortbay.jetty.bio.SocketConnector$Connection.dispatch( > SocketConnector.java:187) >at > org.mortbay.jetty.bio.SocketConnector.accept(SocketConnector.java:101) >at > org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java > :516) >at > org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java > :442) > > This only occurs when I send docs to the server in batches of around 10 > as separate processes. > If I send the serially, the heap grows up to 1200M and with no errors. > > When I observe the VM during it's operation, It doesn't seem to run out > of memory. The VM starts > with 1024M and can allocate up to 1800M. I start getting the error > listed above when the memory > usage is right around 1 G. I have been using the Jconsole program on > windows to observe the > jetty server by using the com.sun.management.jmxremote* functions on the > server side. The number of threads > is always around 30, and jetty can create up 250, so I don't think > that's the problem. I can't really image that > the monitoring process is using the other 800M of the allowable heap > memory, but it could be. > But the problem occurs without monitoring, even when the VM heap is set > to 1500M. > > Does anyone have an idea as to why this error is occurring? > > Thanks, > Brian >
Re: Near Duplicate Documents
On 21-Nov-07, at 12:29 AM, climbingrose wrote: The problem with this approach is MD5 hash is very sensitive: one letter difference will generate completely different hash. You probably have to roll your own near duplication detection algorithm. My advice is have a look at existing literature on near duplication detection techniques and then implement one of them. I know Google has some papers that describe a technique called minhash. I read the paper and found it's very interesting. I'm not sure if you can implement the algorithm because they have patented it. That said, there are plenty literature on near dup detection so you should be able to get one for free! To help your googling: the main algorithm used for this is called 'shingling' or 'shingle printing'. -Mike
Re: Finding the right place to start ...
On 20-Nov-07, at 8:51 PM, Tracy Flynn wrote: I'm trying to find the right place to start in this community. I recently posted a question in the thread on SOLR-236. In that posting I mentioned that I was hoping to persuade my management to move from a FAST installation to a SOLR-based one. The changeover was approved in principle today. Great! Welcome to the Solr world. Our application is a large Rails application. I integrated Solr and created a proof-of-concept that covered almost all existing functionality and projected new functionality for 2008. So, I have a few requests for information and possibly help. I will need the result collapsing described in SOLR 236 to deploy Solr. It's an absolute requirement. I understand that it's to be available in Solr 1.3. Is there updated information for the timetable for Solr 1.3, and what's to be included? Not exactly. It mostly depends on what is stable and tested in the next few months. It also depends somewhat on the timing of the next lucene release. One of the main dependencies for SOLR-236 has been committed to trunk, so in theory it should be relatively easy to patch a copy of solr yourself to add the needed functionality. One of the great things about Solr is that you can add your own plugins and handlers relatively easily (for instance, you could add the patch locally to your copy to create the demo). The best way to help is to try out the patch, make sure it applies, see if the functionality is working, and review the code changes. Review is usually the biggest bottleneck in open-source development. I would very much also like to have SOLR 103 - SQL Upload plugin available, though I think I have a work around if it isn't in Solr 1.3. This one is less likely as it depends on other components which are not yet included. -Mike
Performance problems for OR-queries
I have N keywords and execute a query of the form keyword1 OR keyword2 OR .. OR keywordN The search result would be very large (some million), so I defined a result limit of 100. However Solr seems now to calculates for every possible result document the number of matched keywords and to order them and to give the 100 documents back with the highest number of matched keywords. This seems to take linear time to the size of all possible matched documents. So my questions are: 1. Does Solr support this kind of index access with better performance ? Is there anything special to define in schema.xml? 2. Can one switch off this ordering and just return any 100 documents fullfilling the query (though getting best-matching documents would be a nice feature if it would be fast)? Thanks
Re: Performance problems for OR-queries
On Nov 21, 2007 3:09 PM, Jörg Kiegeland <[EMAIL PROTECTED]> wrote: > I have N keywords and execute a query of the form > > keyword1 OR keyword2 OR .. OR keywordN [...] > This seems to take linear time to the size of all possible matched > documents. Yes. > 1. Does Solr support this kind of index access with better performance ? > Is there anything special to define in schema.xml? No... Solr uses Lucene at it's core, and all matching documents for a query are scored. > 2. Can one switch off this ordering and just return any 100 documents > fullfilling the query (though getting best-matching documents would be > a nice feature if it would be fast)? a feature like this could be developed... but what is the usecase for this? What are you tring to accomplish where either relevancy or complete matching doesn't matter? There may be an easier workaround for your specific case. -Yonik
Re: Help with Debian solr/jetty install?
After following Otis' and Thorsten's advice, I still get: HTTP ERROR: 500 No Java compiler available running http://localhost:8280/solr/admin out of the Debian solr-jetty package. I have *both* the sun 5 and 6 JDK and JRE installed and both have javac /usr/lib/jvm/java-1.5.0-sun/bin/javac /usr/lib/jvm/java-6-sun/bin/javac I get the same error with JAVA_HOME set to either the sun JDK 5 or 6. I have made sure to stop and start Jetty so it reads the environment. % echo $JAVA_HOME /usr/lib/jvm/java-1.5.0-sun % echo $PATH /usr/lib/jvm/java-1.5.0-sun:/usr/lib/jvm/java-1.5.0-sun/bin:/root/local/bin:/l/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin % which javac /usr/lib/jvm/java-1.5.0-sun/bin/javac % javac -version javac 1.5.0_13 % cd /etc/init.d % ./jetty stop Stopping Jetty servlet engine: ...jetty. % ./jetty start Starting Jetty servlet engine: jetty. % firefox http://localhost:8280/solr/admin & HTTP ERROR: 500 No Java compiler available I see in /etc/jetty/start.config some lines to put tools.jar into the classpath: $(java.home)/lib/tools.jar! available com.sun.tools.javac.Main $(java.home)/../lib/tools.jar ! available com.sun.tools.javac.Main and noticed that java.home was not defined in this file. I defined: java.home=/usr/lib/jvm/java-1.5.0-sun No change. I see in /etc/jetty/webdefault.xml JSP servlet definition a note about compilers: The JSP page compiler and execution servlet, which is the mechanism used by Tomcat to support JSP pages. Traditionally, this servlet is mapped to URL patterh "*.jsp". This servlet supports the following initialization parameters (default values are in square brackets): [...] compiler Which compiler Ant should use to compile JSP pages. See the Ant documenation for more information. [javac] I added "compiler" to the definition so in full that looks like: jsp org.apache.jasper.servlet.JspServlet logVerbosityLevel DEBUG fork false xpoweredBy false compiler javac 0 I still get the error. Can anyone suggest where I go from here? Thanks, Phil Thorsten Scherler wrote: On Tue, 2007-11-20 at 22:50 -0800, Otis Gospodnetic wrote: Phillip, I won't go into details, but I'll point out that the >> Java compiler is called javac and if memory serves me well, >> it is defined in one of Jetty's XML config files in its >> etc/ dir. The java compiler is used to compile JSPs that Solr >> uses for the admin UI. So, make sure you have javac and make >> sure Jetty can find it. e.g. cd ~ vim .bashrc ... export JAVA_HOME=/home/thorsten/opt/java export PATH=$JAVA_HOME/bin:$PATH The important thing is that $JAVA_HOME points to the JDK and it is first in your path! salu2 Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Phillip Farber <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Tuesday, November 20, 2007 5:55:27 PM Subject: Help with Debian solr/jetty install? Hi, I've successfully run as far as the example admin page on Debian linux 2.6. So I installed the solr-jetty packaged for Debian testing which gives me Jetty 5.1.14-1 and Solr 1.2.0+ds1-1. Jetty starts fine and so does the Solr home page at http://localhost:8280/solr But I get an error when I try to run http://localhost:8280/solr/admin HTTP ERROR: 500 No Java compiler available I have sun-java6-jre and sun-java6-jdk packages installed. I'm new to servlet containers and java webapps. What should I be looking for to fix this or what information could I provide the list to get me moving forward from here? I've included the trace from the Jetty log, and the java properties dump from the example below. Thanks, Phil --- Java properties (from the example): -- sun.boot.library.path = /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386 java.vm.version = 1.6.0-b105 java.vm.name = Java HotSpot(TM) Client VM user.dir = /tmp/apache-solr-1.2.0/example java.runtime.version = 1.6.0-b105 os.arch = i386 java.io.tmpdir = /tmp java.library.path = /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386/client:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/../lib/i386:/usr/java/packages/lib/i386:/lib:/usr/lib java.class.version = 50.0 jetty.home = /tmp/apache-solr-1.2.0/example sun.management.compiler = HotSpot Client Compiler os.version = 2.6.22-2-686 java.class.path = /tmp/apache-solr-1.2.0/example:/tmp/apache-solr-1.2.0/example/lib/jetty-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/jetty-util-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/servlet-api-2.5-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/ant-1.6.5.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/core-3.1.1.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/jsp-2.1.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/jsp-api-2.1.jar:/usr/share/ant/lib
Re: Help with Debian solr/jetty install?
: After following Otis' and Thorsten's advice, I still get: : : HTTP ERROR: 500 No Java compiler available Just so i'm clear, you: 1) downloaded solr, tried out the tutorial, and had the url http://localhost:8983/solr/admin/ work when you ran: > cd $DIR_CONTAINING_SOLR/example > java -jar start.jar 2) you then installed the debian packaging of jetty (which aparently uses port 8280). 3) you copied the solr WAR into the debian install of jetty, and how you get an error about no compiler when you hit the url http://localhost:8280/solr/admin did i sum that up correctly? have you by any chance attempted to get the debian install of jetty to compile/run a simple helloworld.jsp ? If that doesn't work, then you have a much more fundemental problem with the way Jetty is setup then anything related to Solr. This really sounds like maybe there is a problem with the debain packaging of jetty, and nothing specific to Solr ... perhaps people on the jetty user list or one of the debian user lists might have some ideas? -Hoss
Re: Help with Debian solr/jetty install?
Chris Hostetter wrote: : After following Otis' and Thorsten's advice, I still get: : : HTTP ERROR: 500 No Java compiler available Just so i'm clear, you: 1) downloaded solr, tried out the tutorial, and had the url http://localhost:8983/solr/admin/ work when you ran: > cd $DIR_CONTAINING_SOLR/example > java -jar start.jar 2) you then installed the debian packaging of jetty (which aparently uses port 8280). Yes exactly. 3) you copied the solr WAR into the debian install of jetty, and how you get an error about no compiler when you hit the url http://localhost:8280/solr/admin I did not copy the WAR the debian install if jetty. It looked like the debian install took care of that: % ls -al /usr/share/jetty/webapps drwxr-xr-x 3 root root 4096 Nov 20 15:32 . drwxr-xr-x 6 root root 4096 Nov 20 15:32 .. drwxr-xr-x 15 root root 4096 Nov 20 15:32 root lrwxrwxrwx 1 root root 10 Nov 20 15:32 solr -> ../../solr where ../../solr is % ls -al ../../solr total 32 drwxr-xr-x 6 root root 4096 2007-11-20 15:38 ./ drwxr-xr-x 379 root root 8192 2007-11-21 16:34 ../ drwxr-xr-x 2 root root 4096 2007-11-20 15:38 admin/ drwxr-xr-x 2 root root 4096 2007-11-20 15:38 bin/ lrwxrwxrwx 1 root root 14 2007-11-20 15:32 conf -> /etc/solr/conf/ -rw-r--r-- 1 root root 1213 2007-09-07 03:55 index.html drwxr-xr-x 2 root root 4096 2007-11-20 15:38 META-INF/ drwxr-xr-x 3 root root 4096 2007-11-20 15:38 WEB-INF/ Hmmm. I'm not seeing solar.war anywhere under the solr symlinked from /usr/share/jetty/webapps. Is that the problem here? Phil did i sum that up correctly? have you by any chance attempted to get the debian install of jetty to compile/run a simple helloworld.jsp ? If that doesn't work, then you have a much more fundemental problem with the way Jetty is setup then anything related to Solr. I haven't tried that. I'd have to get proficient in JSP :-) This really sounds like maybe there is a problem with the debain packaging of jetty, and nothing specific to Solr ... perhaps people on the jetty user list or one of the debian user lists might have some ideas? I'll check that out. Thanks -Hoss
Re: Near Duplicate Documents
Hi Ken, It's correct that uncommon words are most likely not showing up in the signature. However, I was trying to say that if two documents has 99% common tokens and differ in one token with frequency > quantised frequency, the two resulted hashes are completely different. If you want true near dup detection, what you would like to have is two hashes that differ only in 1-2 bytes. That way, the signatures will truely reflect the content of the document they present. However, with this approach, you need a bit more work to cluster near dup documents. Basically, once you have the hash function as I describe above, finding similar documents comes down to Hamming distance problem: two docs are near dup if ther hashes different in k positions (with k small, might be < 3). On Nov 22, 2007 2:35 AM, Ken Krugler <[EMAIL PROTECTED]> wrote: > >The duplication detection mechanism in Nutch is quite primitive. I > >think it uses a MD5 signature generated from the content of a field. > >The generation algorithm is described here: > >http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/crawl/TextProfileSignature.html. > > > >The problem with this approach is MD5 hash is very sensitive: one > >letter difference will generate completely different hash. > > I'm confused by your answer, assuming it's based on the page > referenced by the URL you provided. > > The approach by TextProfileSignature would only generate a different > MD5 hash with a single letter change if that change resulted in a > change in the quantized frequency for that word. And if it's an > uncommon word, then it wouldn't even show up in the signature. > > -- Ken > > > >You > >probably have to roll your own near duplication detection algorithm. > >My advice is have a look at existing literature on near duplication > >detection techniques and then implement one of them. I know Google has > >some papers that describe a technique called minhash. I read the paper > >and found it's very interesting. I'm not sure if you can implement the > >algorithm because they have patented it. That said, there are plenty > >literature on near dup detection so you should be able to get one for > >free! > > > >On Nov 21, 2007 6:57 PM, Rishabh Joshi <[EMAIL PROTECTED]> wrote: > >> Otis, > >> > >> Thanks for your response. > >> > > > I just gave a quick look to the Nutch Forum and find that there is an > >> implementation to obtain de-duplicate documents/pages but none for Near > >> Duplicates documents. Can you guide me a little further as to where > >> exactly > > > under Nutch I should be concentrating, regarding near duplicate > > documents? > > > > > > Regards, > >> Rishabh > >> > >> On Nov 21, 2007 12:41 PM, Otis Gospodnetic <[EMAIL PROTECTED]> > >> wrote: > >> > >> > >> > To whomever started this thread: look at Nutch. I believe something > >> > related to this already exists in Nutch for near-duplicate detection. > >> > > >> > Otis > >> > -- > >> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > >> > > >> > - Original Message > >> > From: Mike Klaas <[EMAIL PROTECTED]> > >> > To: solr-user@lucene.apache.org > >> > Sent: Sunday, November 18, 2007 11:08:38 PM > >> > Subject: Re: Near Duplicate Documents > >> > > >> > On 18-Nov-07, at 8:17 AM, Eswar K wrote: > >> > > >> > > Is there any idea implementing that feature in the up coming > >> > releases? > >> > > > > > Not currently. Feel free to contribute something if you find a good > >> > solution . > > > > > >> > -Mike > >> > > >> > > >> > > On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote: > >> > > > >> > >> On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote: > >> > >>> We have a scenario, where we want to find out documents which are > >> > >> similar in > >> > >>> content. To elaborate a little more on what we mean here, lets > >> > >>> take an > >> > >>> example. > >> > >>> > >> > >>> The example of this email chain in which we are interacting on, > >> > >>> can be > >> > >> best > >> > >>> used for illustrating the concept of near dupes (We are not getting > >> > >> confused > >> > >>> with threads, they are two different things.). Each email in this > >> > >>> thread > >> > >> is > >> > >>> treated as a document by the system. A reply to the original mail > >> > >>> also > >> > >>> includes the original mail in which case it becomes a near > >> > >>> duplicate of > >> > >> the > >> > >>> orginal mail (depending on the percentage of similarity). > >> > >>> Similarly it > >> > >> goes > >> > >>> on. The near dupes need not be limited to emails. > >> > >> > >> > >> I think this is what's known as "shingling." See > >> > >> http://en.wikipedia.org/wiki/W-shingling > >> > >> Lucene (and therefore Solr) does not implement shingling. The > >> > >> "MoreLikeThis" query might be close enough, however. > >> > >> > > > > >> -Stuart > > -- > Ken Krugler > Krugle, Inc. > +1 530-210-6378 > "If you c