date:20071121

Re: Near Duplicate Documents

2007-11-21 Thread climbingrose

The duplication detection mechanism in Nutch is quite primitive. I
think it uses a MD5 signature generated from the content of a field.
The generation algorithm is described here:
http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/crawl/TextProfileSignature.html.

The problem with this approach is MD5 hash is very sensitive: one
letter difference will generate completely different hash. You
probably have to roll your own near duplication detection algorithm.
My advice is have a look at existing literature on near duplication
detection techniques and then implement one of them. I know Google has
some papers that describe a technique called minhash. I read the paper
and found it's very interesting. I'm not sure if you can implement the
algorithm because they have patented it. That said, there are plenty
literature on near dup detection so you should be able to get one for
free!

On Nov 21, 2007 6:57 PM, Rishabh Joshi <[EMAIL PROTECTED]> wrote:
> Otis,
>
> Thanks for your response.
>
> I just gave a quick look to the Nutch Forum and find that there is an
> implementation to obtain de-duplicate documents/pages but none for Near
> Duplicates documents. Can you guide me a little further as to where exactly
> under Nutch I should be concentrating, regarding near duplicate documents?
>
> Regards,
> Rishabh
>
> On Nov 21, 2007 12:41 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
> wrote:
>
>
> > To whomever started this thread: look at Nutch.  I believe something
> > related to this already exists in Nutch for near-duplicate detection.
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> > - Original Message 
> > From: Mike Klaas <[EMAIL PROTECTED]>
> > To: solr-user@lucene.apache.org
> > Sent: Sunday, November 18, 2007 11:08:38 PM
> > Subject: Re: Near Duplicate Documents
> >
> > On 18-Nov-07, at 8:17 AM, Eswar K wrote:
> >
> > > Is there any idea implementing that feature in the up coming
> >  releases?
> >
> > Not currently.  Feel free to contribute something if you find a good
> > solution .
> >
> > -Mike
> >
> >
> > > On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote:
> > >
> > >> On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote:
> > >>> We have a scenario, where we want to find out documents which are
> > >> similar in
> > >>> content. To elaborate a little more on what we mean here, lets
> > >>> take an
> > >>> example.
> > >>>
> > >>> The example of this email chain in which we are interacting on,
> > >>> can be
> > >> best
> > >>> used for illustrating the concept of near dupes (We are not getting
> > >> confused
> > >>> with threads, they are two different things.). Each email in this
> > >>> thread
> > >> is
> > >>> treated as a document by the system. A reply to the original mail
> > >>> also
> > >>> includes the original mail in which case it becomes a near
> > >>> duplicate of
> > >> the
> > >>> orginal mail (depending on the percentage of similarity).
> > >>> Similarly it
> > >> goes
> > >>> on. The near dupes need not be limited to emails.
> > >>
> > >> I think this is what's known as "shingling."  See
> > >> http://en.wikipedia.org/wiki/W-shingling
> > >> Lucene (and therefore Solr) does not implement shingling.  The
> > >> "MoreLikeThis" query might be close enough, however.
> > >>
> > >> -Stuart
> > >>
> >
> >
> >
> >
> >
>



-- 
Regards,

Cuong Hoang

Re: Help with Debian solr/jetty install?

2007-11-21 Thread climbingrose

Make sure you have JDK installed not just JRE. Also try to set
JAVA_HOME directory.

apt-get install sun-java5-jdk




On Nov 21, 2007 5:50 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
> Phillip,
>
> I won't go into details, but I'll point out that the Java compiler is called 
> javac and if memory serves me well, it is defined in one of Jetty's XML 
> config files in its etc/ dir.  The java compiler is used to compile JSPs that 
> Solr uses for the admin UI.  So, make sure you have javac and make sure Jetty 
> can find it.
>
> Otis
>
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
> - Original Message 
> From: Phillip Farber <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Tuesday, November 20, 2007 5:55:27 PM
> Subject: Help with Debian solr/jetty install?
>
>
> Hi,
>
> I've successfully run as far as the example admin page on Debian linux
>  2.6.
>
> So I installed the solr-jetty packaged for Debian testing which gives
>  me
> Jetty 5.1.14-1 and Solr 1.2.0+ds1-1.  Jetty starts fine and so does the
>
> Solr home page at http://localhost:8280/solr
>
> But I get an error when I try to run http://localhost:8280/solr/admin
>
> HTTP ERROR: 500
> No Java compiler available
>
> I have sun-java6-jre and sun-java6-jdk packages installed.  I'm new to
> servlet containers and java webapps.  What should I be looking for to
> fix this or what information could I provide the list to get me moving
> forward from here?
>
> I've included the trace from the Jetty log, and the java properties
>  dump
> from the example below.
>
> Thanks,
> Phil
>
> ---
>
> Java properties (from the example):
> --
>
> sun.boot.library.path = /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386
> java.vm.version = 1.6.0-b105
> java.vm.name = Java HotSpot(TM) Client VM
> user.dir = /tmp/apache-solr-1.2.0/example
> java.runtime.version = 1.6.0-b105
> os.arch = i386
> java.io.tmpdir = /tmp
>
> java.library.path =
> /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386/client:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/../lib/i386:/usr/java/packages/lib/i386:/lib:/usr/lib
> java.class.version = 50.0
> jetty.home = /tmp/apache-solr-1.2.0/example
> sun.management.compiler = HotSpot Client Compiler
> os.version = 2.6.22-2-686
> java.class.path =
> /tmp/apache-solr-1.2.0/example:/tmp/apache-solr-1.2.0/example/lib/jetty-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/jetty-util-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/servlet-api-2.5-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/ant-1.6.5.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/core-3.1.1.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/jsp-2.1.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/jsp-api-2.1.jar:/usr/share/ant/lib/ant.jar
> java.home = /usr/lib/jvm/java-6-sun-1.6.0.00/jre
> java.version = 1.6.0
> java.ext.dirs =
> /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/ext:/usr/java/packages/lib/ext
> sun.boot.class.path =
> /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/resources.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/rt.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/sunrsasign.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/jsse.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/jce.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/charsets.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/classes
>
>
>
>
> Jetty log (from the error under Debian Solr/Jetty):
> 
>
> org.apache.jasper.JasperException: No Java compiler available
> at
> org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:460)
> at
> org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:367)
> at
>  org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:329)
> at
>  org.apache.jasper.servlet.JspServlet.service(JspServlet.java:265)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> at
>  org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:428)
> at
> org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:473)
> at
>  org.mortbay.jetty.servlet.Dispatcher.dispatch(Dispatcher.java:286)
> at
>  org.mortbay.jetty.servlet.Dispatcher.forward(Dispatcher.java:171)
> at org.mortbay.jetty.servlet.Default.handleGet(Default.java:302)
> at org.mortbay.jetty.servlet.Default.service(Default.java:223)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> at
>  org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:428)
> at
> org.mortbay.jetty.servlet.WebApplicationHandler$CachedChain.doFilter(WebApplicationHandler.java:830)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:185)
> at
> org.mortbay.jetty.servlet.WebApplicationHandler$CachedChain.doFilter(WebApplicationHandler.java:821)
> at
> org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.j

Re: Near Duplicate Documents

2007-11-21 Thread Rishabh Joshi

Thanks for the info Cuong!

Regards,
Rishabh

On Nov 21, 2007 1:59 PM, climbingrose <[EMAIL PROTECTED]> wrote:

> The duplication detection mechanism in Nutch is quite primitive. I
> think it uses a MD5 signature generated from the content of a field.
> The generation algorithm is described here:
>
> http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/crawl/TextProfileSignature.html
> .
>
> The problem with this approach is MD5 hash is very sensitive: one
> letter difference will generate completely different hash. You
> probably have to roll your own near duplication detection algorithm.
> My advice is have a look at existing literature on near duplication
> detection techniques and then implement one of them. I know Google has
> some papers that describe a technique called minhash. I read the paper
> and found it's very interesting. I'm not sure if you can implement the
> algorithm because they have patented it. That said, there are plenty
> literature on near dup detection so you should be able to get one for
> free!
>
> On Nov 21, 2007 6:57 PM, Rishabh Joshi <[EMAIL PROTECTED]> wrote:
> > Otis,
> >
> > Thanks for your response.
> >
> > I just gave a quick look to the Nutch Forum and find that there is an
> > implementation to obtain de-duplicate documents/pages but none for Near
> > Duplicates documents. Can you guide me a little further as to where
> exactly
> > under Nutch I should be concentrating, regarding near duplicate
> documents?
> >
> > Regards,
> > Rishabh
> >
> > On Nov 21, 2007 12:41 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
> > wrote:
> >
> >
> > > To whomever started this thread: look at Nutch.  I believe something
> > > related to this already exists in Nutch for near-duplicate detection.
> > >
> > > Otis
> > > --
> > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > >
> > > - Original Message 
> > > From: Mike Klaas <[EMAIL PROTECTED]>
> > > To: solr-user@lucene.apache.org
> > > Sent: Sunday, November 18, 2007 11:08:38 PM
> > > Subject: Re: Near Duplicate Documents
> > >
> > > On 18-Nov-07, at 8:17 AM, Eswar K wrote:
> > >
> > > > Is there any idea implementing that feature in the up coming
> > >  releases?
> > >
> > > Not currently.  Feel free to contribute something if you find a good
> > > solution .
> > >
> > > -Mike
> > >
> > >
> > > > On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]>
> wrote:
> > > >
> > > >> On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote:
> > > >>> We have a scenario, where we want to find out documents which are
> > > >> similar in
> > > >>> content. To elaborate a little more on what we mean here, lets
> > > >>> take an
> > > >>> example.
> > > >>>
> > > >>> The example of this email chain in which we are interacting on,
> > > >>> can be
> > > >> best
> > > >>> used for illustrating the concept of near dupes (We are not
> getting
> > > >> confused
> > > >>> with threads, they are two different things.). Each email in this
> > > >>> thread
> > > >> is
> > > >>> treated as a document by the system. A reply to the original mail
> > > >>> also
> > > >>> includes the original mail in which case it becomes a near
> > > >>> duplicate of
> > > >> the
> > > >>> orginal mail (depending on the percentage of similarity).
> > > >>> Similarly it
> > > >> goes
> > > >>> on. The near dupes need not be limited to emails.
> > > >>
> > > >> I think this is what's known as "shingling."  See
> > > >> http://en.wikipedia.org/wiki/W-shingling
> > > >> Lucene (and therefore Solr) does not implement shingling.  The
> > > >> "MoreLikeThis" query might be close enough, however.
> > > >>
> > > >> -Stuart
> > > >>
> > >
> > >
> > >
> > >
> > >
> >
>
>
>
> --
> Regards,
>
> Cuong Hoang
>

Re: Help with Debian solr/jetty install?

2007-11-21 Thread Thorsten Scherler

On Tue, 2007-11-20 at 22:50 -0800, Otis Gospodnetic wrote:
> Phillip,
> 
> I won't go into details, but I'll point out that the Java compiler is called 
> javac and if memory serves me well, it is defined in one of Jetty's XML 
> config files in its etc/ dir.  The java compiler is used to compile JSPs that 
> Solr uses for the admin UI.  So, make sure you have javac and make sure Jetty 
> can find it.
>  

e.g. 

cd ~
vim .bashrc

...
export JAVA_HOME=/home/thorsten/opt/java
export PATH=$JAVA_HOME/bin:$PATH

The important thing is that $JAVA_HOME points to the JDK and it is first
in your path!

salu2

> Otis
> 
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> - Original Message 
> From: Phillip Farber <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Tuesday, November 20, 2007 5:55:27 PM
> Subject: Help with Debian solr/jetty install?
> 
> 
> Hi,
> 
> I've successfully run as far as the example admin page on Debian linux
>  2.6.
> 
> So I installed the solr-jetty packaged for Debian testing which gives
>  me 
> Jetty 5.1.14-1 and Solr 1.2.0+ds1-1.  Jetty starts fine and so does the
>  
> Solr home page at http://localhost:8280/solr
> 
> But I get an error when I try to run http://localhost:8280/solr/admin
> 
> HTTP ERROR: 500
> No Java compiler available
> 
> I have sun-java6-jre and sun-java6-jdk packages installed.  I'm new to 
> servlet containers and java webapps.  What should I be looking for to 
> fix this or what information could I provide the list to get me moving 
> forward from here?
> 
> I've included the trace from the Jetty log, and the java properties
>  dump 
> from the example below.
> 
> Thanks,
> Phil
> 
> ---
> 
> Java properties (from the example):
> --
> 
> sun.boot.library.path = /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386
> java.vm.version = 1.6.0-b105
> java.vm.name = Java HotSpot(TM) Client VM
> user.dir = /tmp/apache-solr-1.2.0/example
> java.runtime.version = 1.6.0-b105
> os.arch = i386
> java.io.tmpdir = /tmp
> 
> java.library.path = 
> /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386/client:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/../lib/i386:/usr/java/packages/lib/i386:/lib:/usr/lib
> java.class.version = 50.0
> jetty.home = /tmp/apache-solr-1.2.0/example
> sun.management.compiler = HotSpot Client Compiler
> os.version = 2.6.22-2-686
> java.class.path = 
> /tmp/apache-solr-1.2.0/example:/tmp/apache-solr-1.2.0/example/lib/jetty-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/jetty-util-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/servlet-api-2.5-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/ant-1.6.5.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/core-3.1.1.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/jsp-2.1.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/jsp-api-2.1.jar:/usr/share/ant/lib/ant.jar
> java.home = /usr/lib/jvm/java-6-sun-1.6.0.00/jre
> java.version = 1.6.0
> java.ext.dirs = 
> /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/ext:/usr/java/packages/lib/ext
> sun.boot.class.path = 
> /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/resources.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/rt.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/sunrsasign.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/jsse.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/jce.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/charsets.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/classes
> 
> 
> 
> 
> Jetty log (from the error under Debian Solr/Jetty):
> 
> 
> org.apache.jasper.JasperException: No Java compiler available
> at 
> org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:460)
> at 
> org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:367)
> at
>  org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:329)
> at
>  org.apache.jasper.servlet.JspServlet.service(JspServlet.java:265)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> at
>  org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:428)
> at 
> org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:473)
> at
>  org.mortbay.jetty.servlet.Dispatcher.dispatch(Dispatcher.java:286)
> at
>  org.mortbay.jetty.servlet.Dispatcher.forward(Dispatcher.java:171)
> at org.mortbay.jetty.servlet.Default.handleGet(Default.java:302)
> at org.mortbay.jetty.servlet.Default.service(Default.java:223)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> at
>  org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:428)
> at 
> org.mortbay.jetty.servlet.WebApplicationHandler$CachedChain.doFilter(WebApplicationHandler.java:830)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:185)
> at 
> org.mortbay.jetty.servlet.WebApplicationHandler$CachedChain.doFilter(WebApplication

Re: Any tips for indexing large amounts of data?

2007-11-21 Thread Brendan Grainger


HI Otis,

Thanks for the reply. I am using a pretty "vanilla approach" right  
now and it's taking about 30 hours to build an index of about 5.5Gb.  
Can you please tell me what some of the changes you made to optimize  
the indexing process?


Thanks
Brendan

On Nov 21, 2007, at 2:27 AM, Otis Gospodnetic wrote:

Just tried a search for "web" on this index - 1.1 seconds.  This  
matches about 1MM of about 20MM docs.  Redo the search, and it's 1  
ms (cached).  This is without any load nor serious benchmarking,  
clearly.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Eswar K <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Wednesday, November 21, 2007 2:11:07 AM
Subject: Re: Any tips for indexing large amounts of data?

Hi otis,

I understand that is slightly off track question, but I am just  
curious

 to
know the performance of Search on a 20 GB index file. What has been
 your
observation?

Regards,
Eswar

On Nov 21, 2007 12:33 PM, Otis Gospodnetic  
<[EMAIL PROTECTED]>

wrote:


Mike is right about the occasional slow-down, which appears as a

 pause and

is due to large Lucene index segment merging.  This should go away

 with

newer versions of Lucene where this is happening in the background.

That said, we just indexed about 20MM documents on a single 8-core

 machine

with 8 GB of RAM, resulting in nearly 20 GB index.  The whole process

 took a

little less than 10 hours - that's over 550 docs/second.  The vanilla
approach before some of our changes apparently required several days

 to

index the same amount of data.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Mike Klaas <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Monday, November 19, 2007 5:50:19 PM
Subject: Re: Any tips for indexing large amounts of data?

There should be some slowdown in larger indices as occasionally large
segment merge operations must occur.  However, this shouldn't really
affect overall speed too much.

You haven't really given us enough data to tell you anything useful.
I would recommend trying to do the indexing via a webapp to eliminate
all your code as a possible factor.  Then, look for signs to what is
happening when indexing slows.  For instance, is Solr high in cpu, is
the computer thrashing, etc?

-Mike

On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote:


Hi,

Thanks for answering this question a while back. I have made some
of the suggestions you mentioned. ie not committing until I've
finished indexing. What I am seeing though, is as the index get
larger (around 1Gb), indexing is taking a lot longer. In fact it
slows down to a crawl. Have you got any pointers as to what I might
be doing wrong?

Also, I was looking at using MultiCore solr. Could this help in
some way?

Thank you
Brendan

On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote:



: I would think you would see better performance by allowing auto
commit
: to handle the commit size instead of reopening the connection
all the
: time.

if your goal is "fast" indexing, don't use autoCommit at all ...

 just

index everything, and don't commit until you are completely done.

autoCommitting will slow your indexing down (the benefit being
that more
results will be visible to searchers as you proceed)




-Hoss

Re: Solr cluster topology.

2007-11-21 Thread Alexander Wallace


Thanks a lot for your responses! They were all very helpful!

On Nov 20, 2007, at 5:52 PM, Norberto Meijome wrote:


On Tue, 20 Nov 2007 16:26:27 -0600
Alexander Wallace <[EMAIL PROTECTED]> wrote:


Interesting, this ALL MASTERS mode... I guess you don't do any
replication then...


correct


In the single master, several slaves mode, I'm assuming the client
still writes to one and reads from the others... right?


Correct again.

There is also another approach which I think in SOLR is called  
FederatedSearch , where a front end queries a number of index  
servers (each with overlapping or non-overlapping data sets) and  
puts together 1 result stream for the answer. There was some  
discussion on the list,  http://www.mail-archive.com/solr- 
[EMAIL PROTECTED]/msg06081.html is the earliest link in the  
archive i can find .


B
_
{Beto|Norberto|Numard} Meijome

"People demand freedom of speech to make up for the freedom of  
thought which they avoid. "

  Soren Aabye Kierkegaard

I speak for myself, not my employer. Contents may be hot. Slippery  
when wet. Reading disclaimers makes you go blind. Writing them is  
worse. You have been Warned.

Re: Near Duplicate Documents

2007-11-21 Thread Ken Krugler

The duplication detection mechanism in Nutch is quite primitive. I
think it uses a MD5 signature generated from the content of a field.
The generation algorithm is described here:
http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/crawl/TextProfileSignature.html.

The problem with this approach is MD5 hash is very sensitive: one
letter difference will generate completely different hash.

I'm confused by your answer, assuming it's based on the page 
referenced by the URL you provided.

The approach by TextProfileSignature would only generate a different 
MD5 hash with a single letter change if that change resulted in a 
change in the quantized frequency for that word. And if it's an 
uncommon word, then it wouldn't even show up in the signature.

-- Ken

You
probably have to roll your own near duplication detection algorithm.
My advice is have a look at existing literature on near duplication
detection techniques and then implement one of them. I know Google has
some papers that describe a technique called minhash. I read the paper
and found it's very interesting. I'm not sure if you can implement the
algorithm because they have patented it. That said, there are plenty
literature on near dup detection so you should be able to get one for
free!

On Nov 21, 2007 6:57 PM, Rishabh Joshi <[EMAIL PROTECTED]> wrote:

 Otis,

 Thanks for your response.

 > I just gave a quick look to the Nutch Forum and find that there is an

 implementation to obtain de-duplicate documents/pages but none for Near
 Duplicates documents. Can you guide me a little further as to where exactly

 > under Nutch I should be concentrating, regarding near duplicate documents?
 >
 > Regards,

 Rishabh

 On Nov 21, 2007 12:41 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
 wrote:

 > To whomever started this thread: look at Nutch.  I believe something
 > related to this already exists in Nutch for near-duplicate detection.
 >
 > Otis
 > --
 > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 >
 > - Original Message 
 > From: Mike Klaas <[EMAIL PROTECTED]>
 > To: solr-user@lucene.apache.org
 > Sent: Sunday, November 18, 2007 11:08:38 PM
 > Subject: Re: Near Duplicate Documents
 >
 > On 18-Nov-07, at 8:17 AM, Eswar K wrote:
 >
 > > Is there any idea implementing that feature in the up coming
 >  releases?
 >

 > > Not currently.  Feel free to contribute something if you find a good

 > solution .

 > >

 > -Mike
 >
 >
 > > On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote:
 > >
 > >> On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote:
 > >>> We have a scenario, where we want to find out documents which are
 > >> similar in
 > >>> content. To elaborate a little more on what we mean here, lets
 > >>> take an
 > >>> example.
 > >>>
 > >>> The example of this email chain in which we are interacting on,
 > >>> can be
 > >> best
 > >>> used for illustrating the concept of near dupes (We are not getting
 > >> confused
 > >>> with threads, they are two different things.). Each email in this
 > >>> thread
 > >> is
 > >>> treated as a document by the system. A reply to the original mail
 > >>> also
 > >>> includes the original mail in which case it becomes a near
 > >>> duplicate of
 > >> the
 > >>> orginal mail (depending on the percentage of similarity).
 > >>> Similarly it
 > >> goes
 > >>> on. The near dupes need not be limited to emails.
 > >>
 > >> I think this is what's known as "shingling."  See
 > >> http://en.wikipedia.org/wiki/W-shingling
 > >> Lucene (and therefore Solr) does not implement shingling.  The
 > >> "MoreLikeThis" query might be close enough, however.
 > >>

 > > >> -Stuart

--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Document update based on ID

2007-11-21 Thread Evgeniy Strokin

Hello,.. 
I have a document indexed with Solr. Originally it had only few fields. I want 
to add some more fields to the index later, based on ID but I don't want to 
submit original fields again. I use Solr 1.2, but I think there is no such 
functionality yet. But I saw a feature here 
https://issues.apache.org/jira/browse/SOLR-139 and it looks like what I need.
Is it implemented already? How can I get the code? Would you suggest to use it 
in production? How it works? 
 
Thank you
Gene

Re: Document update based on ID

2007-11-21 Thread Ryan McKinley


Evgeniy Strokin wrote:
Hello,.. 
I have a document indexed with Solr. Originally it had only few fields. I want to add some more fields to the index later, based on ID but I don't want to submit original fields again. I use Solr 1.2, but I think there is no such functionality yet. But I saw a feature here https://issues.apache.org/jira/browse/SOLR-139 and it looks like what I need.
Is it implemented already? How can I get the code? Would you suggest to use it in production? How it works? 
 


Yes, SOLR-139 will eventually do what you need.

The most recent patch should not be *too* hard to get running (it may 
not apply cleanly though)  The patch as is needs to be reworked before 
it will go into trunk.  I hope this will happen in the next month or so.


As for production?  It depends ;)  The API will most likely change so if 
you base your code on the current patch, it will need to change when 
things finalize.  As for stability, it has worked well for me (and I 
think for Erik)


ryan

Memory use with sorting problem

2007-11-21 Thread Chris Laux

Hi all,

I've been struggling with this problem for over a month now, and
although memory issues have been discussed often, I don't seem to be
able to find a fitting solution.

The index is merely 1.5 GB large, but memory use quickly fills out the
heap max of 1 GB on a 2 GB machine. This then works fine until
auto-warming starts. Switching the latter off altogether is unattractive
as it leads to response times of up to 30 s. When auto-warming starts, I
get this error:

> SEVERE: Error during auto-warming of
key:org.apache.solr.search.QueryResultKey
@e0b93139:java.lang.OutOfMemoryError: Java heap space

Now when I reduce the size of caches (to a fraction of the default
settings) and number of warming Searchers (to 2), memory use is not
reduced and the problem stays. Only deactivating auto-warming will help.
When I set the heap size limit higher (and go into swap space), all the
extra memory seems to be used up right away, independently from
auto-warming.

This all seems to be closely connected to sorting by a numerical field,
as switching this off does make memory use a lot more friendly.

Is it normal to need that much Memory for such a small index?

I suspect the problem is in Lucene, would it be better to post on their
list?

Does anyone know a better way of getting the sorting done?

Thanks in advance for your help,

Chris


This is the field setup in schema.xml:






And this is a sample query:

select/?q=solr&start=0&rows=20&sort=created+desc

Re: Memory use with sorting problem

2007-11-21 Thread Yonik Seeley

On Nov 21, 2007 11:06 AM, Chris Laux <[EMAIL PROTECTED]> wrote:
> Now when I reduce the size of caches (to a fraction of the default
> settings) and number of warming Searchers (to 2),

Set the max warming searchers to 1 to ensure that you never have more
than one warming at the same time.

> memory use is not
> reduced and the problem stays. Only deactivating auto-warming will help.
> When I set the heap size limit higher (and go into swap space), all the
> extra memory seems to be used up right away, independently from
> auto-warming.
>
> This all seems to be closely connected to sorting by a numerical field,
> as switching this off does make memory use a lot more friendly.

How many documents are in your index?

If you don't need range queries on these numeric fields, you might try
switching from "sfloat" to "float" and from "sint" to "int".  The
fieldCache representation will be smaller.

> Is it normal to need that much Memory for such a small index?

Some things are more related to the number of unique terms or the
numer of documents more than the "size" of the index.

-Yonik

Re: Any tips for indexing large amounts of data?

2007-11-21 Thread Brendan Grainger


Hi Otis,

Thanks for this. Are you using a flavor of linux and is it 64bit? How  
much heap are you giving your jvm?


Thanks again
Brendan

On Nov 21, 2007, at 2:03 AM, Otis Gospodnetic wrote:

Mike is right about the occasional slow-down, which appears as a  
pause and is due to large Lucene index segment merging.  This  
should go away with newer versions of Lucene where this is  
happening in the background.


That said, we just indexed about 20MM documents on a single 8-core  
machine with 8 GB of RAM, resulting in nearly 20 GB index.  The  
whole process took a little less than 10 hours - that's over 550  
docs/second.  The vanilla approach before some of our changes  
apparently required several days to index the same amount of data.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Mike Klaas <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Monday, November 19, 2007 5:50:19 PM
Subject: Re: Any tips for indexing large amounts of data?

There should be some slowdown in larger indices as occasionally large
segment merge operations must occur.  However, this shouldn't really
affect overall speed too much.

You haven't really given us enough data to tell you anything useful.
I would recommend trying to do the indexing via a webapp to eliminate
all your code as a possible factor.  Then, look for signs to what is
happening when indexing slows.  For instance, is Solr high in cpu, is
the computer thrashing, etc?

-Mike

On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote:


Hi,

Thanks for answering this question a while back. I have made some
of the suggestions you mentioned. ie not committing until I've
finished indexing. What I am seeing though, is as the index get
larger (around 1Gb), indexing is taking a lot longer. In fact it
slows down to a crawl. Have you got any pointers as to what I might
be doing wrong?

Also, I was looking at using MultiCore solr. Could this help in
some way?

Thank you
Brendan

On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote:



: I would think you would see better performance by allowing auto
commit
: to handle the commit size instead of reopening the connection
all the
: time.

if your goal is "fast" indexing, don't use autoCommit at all ...

 just

index everything, and don't commit until you are completely done.

autoCommitting will slow your indexing down (the benefit being
that more
results will be visible to searchers as you proceed)




-Hoss

Re: Weird memory error.

2007-11-21 Thread Simon Willnauer

Actually when I look at the errormessage, this has nothing to do with
memory.
The error message:
java.lang.OutOfMemoryError: unable to create new native thread

means that the OS can not create any new native threads for this JVM. So the
limit you are running into is not the JVM Memory.
I guess you should rather look at a bottleneck inside your application that
prevents your serverthreads from being reused when you fire concurrent
batches to your sever. Do you do all that in paralell?

In the stacktrace below your connector can not get any new threads from the
pool which has nothing to do with memory.

Try to figure out what is taking so much time during the batch process on
the server.

simon

On Nov 20, 2007 5:16 PM, Brian Carmalt <[EMAIL PROTECTED]> wrote:

> Hello all,
>
> I started looking into the scalability of solr, and have started getting
> weird  results.
> I am getting the following error:
>
> Exception in thread "btpool0-3" java.lang.OutOfMemoryError: unable to
> create new native thread
>at java.lang.Thread.start0(Native Method)
>at java.lang.Thread.start(Thread.java:574)
>at
> org.mortbay.thread.BoundedThreadPool.newThread(BoundedThreadPool.java:377)
>at
> org.mortbay.thread.BoundedThreadPool.dispatch(BoundedThreadPool.java:94)
>at
> org.mortbay.jetty.bio.SocketConnector$Connection.dispatch(
> SocketConnector.java:187)
>at
> org.mortbay.jetty.bio.SocketConnector.accept(SocketConnector.java:101)
>at
> org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java
> :516)
>at
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java
> :442)
>
> This only occurs when I send docs to the server in batches of around 10
> as separate processes.
> If I send the serially, the heap grows up to 1200M and with no errors.
>
> When I observe the VM during it's operation, It doesn't seem to run out
> of memory.  The VM starts
> with 1024M and can allocate up to 1800M. I start getting the error
> listed above when the memory
> usage is right around 1 G. I have been using the Jconsole program on
> windows to observe the
> jetty server by using the com.sun.management.jmxremote* functions on the
> server side. The number of threads
> is always around 30, and jetty can create up 250, so I don't think
> that's the problem. I can't really image that
> the monitoring process is using the other 800M of the allowable heap
> memory, but it could be.
> But the problem occurs without monitoring, even when the VM heap is set
> to 1500M.
>
> Does anyone have an idea as to why this error is occurring?
>
> Thanks,
> Brian
>

Re: Near Duplicate Documents

2007-11-21 Thread Mike Klaas


On 21-Nov-07, at 12:29 AM, climbingrose wrote:


The problem with this approach is MD5 hash is very sensitive: one
letter difference will generate completely different hash. You
probably have to roll your own near duplication detection algorithm.
My advice is have a look at existing literature on near duplication
detection techniques and then implement one of them. I know Google has
some papers that describe a technique called minhash. I read the paper
and found it's very interesting. I'm not sure if you can implement the
algorithm because they have patented it. That said, there are plenty
literature on near dup detection so you should be able to get one for
free!


To help your googling: the main algorithm used for this is called  
'shingling' or 'shingle printing'.


-Mike

Re: Finding the right place to start ...

2007-11-21 Thread Mike Klaas


On 20-Nov-07, at 8:51 PM, Tracy Flynn wrote:


I'm trying to find the right place to start in this community.

I recently posted a question in the thread on SOLR-236.  In that  
posting I mentioned that I was hoping to persuade my management to  
move from a FAST installation to a SOLR-based one.  The changeover  
was approved in principle today.


Great!  Welcome to the Solr world.

Our application is a large Rails application. I integrated Solr and  
created a proof-of-concept that covered almost all existing  
functionality and projected new functionality for 2008.


So, I have a few requests for information and possibly help.

I will need the result collapsing described in SOLR 236 to deploy  
Solr. It's an absolute requirement. I understand that it's to be  
available in Solr 1.3. Is there updated information for the  
timetable for Solr 1.3, and what's to be included?


Not exactly.  It mostly depends on what is stable and tested in the  
next few months.  It also depends somewhat on the timing of the next  
lucene release.


One of the main dependencies for SOLR-236 has been committed to  
trunk, so in theory it should be relatively easy to patch a copy of  
solr yourself to add the needed functionality.  One of the great  
things about Solr is that you can add your own plugins and handlers  
relatively easily (for instance, you could add the patch locally to  
your copy to create the demo).


The best way to help is to try out the patch, make sure it applies,  
see if the functionality is working, and review the code changes.   
Review is usually the biggest bottleneck in open-source development.


I would very much also like to have SOLR 103 - SQL Upload plugin  
available, though I think I have a work around if it isn't in Solr  
1.3.


This one is less likely as it depends on other components which are  
not yet included.


-Mike

Performance problems for OR-queries

2007-11-21 Thread Jörg Kiegeland


I have N keywords and execute a query of the form

keyword1 OR keyword2 OR .. OR keywordN

The search result would be very large (some million), so I defined a 
result limit of 100.
However Solr seems now to calculates for every possible result document 
the number of matched keywords and to order them and to give the 100 
documents back with the highest number of matched keywords.
This seems to take linear time to the size of all possible matched 
documents. So my questions are:


1. Does Solr support this kind of index access with better performance ? 
Is there anything special to define in schema.xml?


2. Can one switch off this ordering and just return any 100 documents 
fullfilling the query (though  getting best-matching documents would be 
a nice feature if it would be fast)?


Thanks

Re: Performance problems for OR-queries

2007-11-21 Thread Yonik Seeley

On Nov 21, 2007 3:09 PM, Jörg Kiegeland <[EMAIL PROTECTED]> wrote:
> I have N keywords and execute a query of the form
>
> keyword1 OR keyword2 OR .. OR keywordN
[...]
> This seems to take linear time to the size of all possible matched
> documents.

Yes.

> 1. Does Solr support this kind of index access with better performance ?
> Is there anything special to define in schema.xml?

No... Solr uses Lucene at it's core, and all matching documents for a
query are scored.

> 2. Can one switch off this ordering and just return any 100 documents
> fullfilling the query (though  getting best-matching documents would be
> a nice feature if it would be fast)?

a feature like this could be developed... but what is the usecase for
this?  What are you tring to accomplish where either relevancy or
complete matching doesn't matter?  There may be an easier workaround
for your specific case.

-Yonik

Re: Help with Debian solr/jetty install?

2007-11-21 Thread Phillip Farber

After following Otis' and Thorsten's advice, I still get:

HTTP ERROR: 500 No Java compiler available

running http://localhost:8280/solr/admin out of the Debian solr-jetty 
package.

I have *both* the sun 5 and 6 JDK and JRE installed and both have javac

/usr/lib/jvm/java-1.5.0-sun/bin/javac
/usr/lib/jvm/java-6-sun/bin/javac

I get the same error with JAVA_HOME set to either the sun JDK 5 or 6.

I have made sure to stop and start Jetty so it reads the environment.

% echo $JAVA_HOME
/usr/lib/jvm/java-1.5.0-sun

% echo $PATH
/usr/lib/jvm/java-1.5.0-sun:/usr/lib/jvm/java-1.5.0-sun/bin:/root/local/bin:/l/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

% which javac
/usr/lib/jvm/java-1.5.0-sun/bin/javac

% javac -version
javac 1.5.0_13

% cd /etc/init.d
% ./jetty stop
Stopping Jetty servlet engine: ...jetty.

% ./jetty start
Starting Jetty servlet engine: jetty.

% firefox http://localhost:8280/solr/admin &
HTTP ERROR: 500 No Java compiler available

I see in /etc/jetty/start.config some lines to put tools.jar into the 
classpath:

$(java.home)/lib/tools.jar! available com.sun.tools.javac.Main
$(java.home)/../lib/tools.jar ! available com.sun.tools.javac.Main

and noticed that java.home was not defined in this file.  I defined:

java.home=/usr/lib/jvm/java-1.5.0-sun

No change.

I see in /etc/jetty/webdefault.xml JSP servlet definition a note about 
compilers:

The JSP page compiler and execution servlet, which is the mechanism
used by Tomcat to support JSP pages.  Traditionally, this servlet is
mapped to URL patterh "*.jsp".  This servlet supports the following
initialization parameters (default values are in square brackets):

[...]
compiler  Which compiler Ant should use to compile JSP pages.  See 
the Ant documenation for more information. [javac]

I added "compiler" to the definition so in full that looks like:

jsp
org.apache.jasper.servlet.JspServlet

logVerbosityLevel
DEBUG

fork
false

xpoweredBy
false

compiler
javac

0

I still get the error.

Can anyone suggest where I go from here?

Thanks,

Phil

Thorsten Scherler wrote:

On Tue, 2007-11-20 at 22:50 -0800, Otis Gospodnetic wrote:

Phillip,

I won't go into details, but I'll point out that the 

>> Java compiler is called javac and if memory serves me well,
>> it is defined in one of Jetty's XML config files in its
>> etc/ dir.  The java compiler is used to compile JSPs that Solr
>> uses for the admin UI.  So, make sure you have javac and make
>> sure Jetty can find it.

e.g. 

cd ~
vim .bashrc

...
export JAVA_HOME=/home/thorsten/opt/java
export PATH=$JAVA_HOME/bin:$PATH

The important thing is that $JAVA_HOME points to the JDK and it is first
in your path!

salu2

Otis

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Phillip Farber <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Tuesday, November 20, 2007 5:55:27 PM
Subject: Help with Debian solr/jetty install?

Hi,

I've successfully run as far as the example admin page on Debian linux
 2.6.

So I installed the solr-jetty packaged for Debian testing which gives
 me 
Jetty 5.1.14-1 and Solr 1.2.0+ds1-1.  Jetty starts fine and so does the

Solr home page at http://localhost:8280/solr

But I get an error when I try to run http://localhost:8280/solr/admin

HTTP ERROR: 500
No Java compiler available

I have sun-java6-jre and sun-java6-jdk packages installed.  I'm new to 
servlet containers and java webapps.  What should I be looking for to 
fix this or what information could I provide the list to get me moving 
forward from here?

I've included the trace from the Jetty log, and the java properties
 dump 
from the example below.

Thanks,
Phil

---

Java properties (from the example):
--

sun.boot.library.path = /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386
java.vm.version = 1.6.0-b105
java.vm.name = Java HotSpot(TM) Client VM
user.dir = /tmp/apache-solr-1.2.0/example
java.runtime.version = 1.6.0-b105
os.arch = i386
java.io.tmpdir = /tmp

java.library.path = 
/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386/client:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/../lib/i386:/usr/java/packages/lib/i386:/lib:/usr/lib

java.class.version = 50.0
jetty.home = /tmp/apache-solr-1.2.0/example
sun.management.compiler = HotSpot Client Compiler
os.version = 2.6.22-2-686
java.class.path = 
/tmp/apache-solr-1.2.0/example:/tmp/apache-solr-1.2.0/example/lib/jetty-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/jetty-util-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/servlet-api-2.5-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/ant-1.6.5.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/core-3.1.1.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/jsp-2.1.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/jsp-api-2.1.jar:/usr/share/ant/lib

Re: Help with Debian solr/jetty install?

2007-11-21 Thread Chris Hostetter


: After following Otis' and Thorsten's advice, I still get:
: 
: HTTP ERROR: 500 No Java compiler available

Just so i'm clear, you:
   1) downloaded solr, tried out the tutorial, and had the 
  url http://localhost:8983/solr/admin/ work when you ran:
> cd $DIR_CONTAINING_SOLR/example
> java -jar start.jar
   2) you then installed the debian packaging of jetty (which aparently 
  uses port 8280).
   3) you copied the solr WAR into the debian install of jetty, and how 
  you get an error about no compiler when you hit the url 
  http://localhost:8280/solr/admin

did i sum that up correctly?

have you by any chance attempted to get the debian install of jetty to  
compile/run a simple helloworld.jsp ?  If that doesn't work, then 
you have a much more fundemental problem with the way Jetty is setup then 
anything related to Solr.

This really sounds like maybe there is a problem with the debain packaging 
of jetty, and nothing specific to Solr ... perhaps people on the jetty 
user list or one of the debian user lists might have some ideas?


-Hoss

Re: Help with Debian solr/jetty install?

2007-11-21 Thread Phillip Farber

Chris Hostetter wrote:

: After following Otis' and Thorsten's advice, I still get:
: 
: HTTP ERROR: 500 No Java compiler available

Just so i'm clear, you:
   1) downloaded solr, tried out the tutorial, and had the 
  url http://localhost:8983/solr/admin/ work when you ran:

> cd $DIR_CONTAINING_SOLR/example
> java -jar start.jar
   2) you then installed the debian packaging of jetty (which aparently 
  uses port 8280).

Yes exactly.

   3) you copied the solr WAR into the debian install of jetty, and how 
  you get an error about no compiler when you hit the url 
  http://localhost:8280/solr/admin

I did not copy the WAR the debian install if jetty.  It looked like the 
debian install took care of that:

 % ls -al /usr/share/jetty/webapps

  drwxr-xr-x  3 root root 4096 Nov 20 15:32 .
  drwxr-xr-x  6 root root 4096 Nov 20 15:32 ..
  drwxr-xr-x 15 root root 4096 Nov 20 15:32 root
  lrwxrwxrwx  1 root root   10 Nov 20 15:32 solr -> ../../solr

where ../../solr is

% ls -al ../../solr

total 32
drwxr-xr-x   6 root root 4096 2007-11-20 15:38 ./
drwxr-xr-x 379 root root 8192 2007-11-21 16:34 ../
drwxr-xr-x   2 root root 4096 2007-11-20 15:38 admin/
drwxr-xr-x   2 root root 4096 2007-11-20 15:38 bin/
lrwxrwxrwx   1 root root   14 2007-11-20 15:32 conf -> /etc/solr/conf/
-rw-r--r--   1 root root 1213 2007-09-07 03:55 index.html
drwxr-xr-x   2 root root 4096 2007-11-20 15:38 META-INF/
drwxr-xr-x   3 root root 4096 2007-11-20 15:38 WEB-INF/

Hmmm.  I'm not seeing solar.war anywhere under the solr symlinked from 
/usr/share/jetty/webapps.  Is that the problem here?

Phil

did i sum that up correctly?

have you by any chance attempted to get the debian install of jetty to  
compile/run a simple helloworld.jsp ?  If that doesn't work, then 
you have a much more fundemental problem with the way Jetty is setup then 
anything related to Solr.

I haven't tried that.  I'd have to get proficient in JSP :-)

This really sounds like maybe there is a problem with the debain packaging 
of jetty, and nothing specific to Solr ... perhaps people on the jetty 
user list or one of the debian user lists might have some ideas?

I'll check that out.  Thanks

-Hoss

Re: Near Duplicate Documents

2007-11-21 Thread climbingrose

Hi Ken,

It's correct that uncommon words are most likely not showing up in the
signature. However, I was trying to say that if two documents has 99%
common tokens and differ in one token with frequency > quantised
frequency, the two resulted hashes are completely different. If you
want true near dup detection, what you would like to have is two
hashes that differ only in 1-2 bytes. That way, the signatures will
truely reflect the content of the document they present. However, with
this approach, you need a bit more work to cluster near dup documents.
Basically, once you have the hash function as I describe above,
finding similar documents comes down to Hamming distance problem: two
docs are near dup if ther hashes different in k positions (with k
small, might be < 3).


On Nov 22, 2007 2:35 AM, Ken Krugler <[EMAIL PROTECTED]> wrote:
> >The duplication detection mechanism in Nutch is quite primitive. I
> >think it uses a MD5 signature generated from the content of a field.
> >The generation algorithm is described here:
> >http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/crawl/TextProfileSignature.html.
> >
> >The problem with this approach is MD5 hash is very sensitive: one
> >letter difference will generate completely different hash.
>
> I'm confused by your answer, assuming it's based on the page
> referenced by the URL you provided.
>
> The approach by TextProfileSignature would only generate a different
> MD5 hash with a single letter change if that change resulted in a
> change in the quantized frequency for that word. And if it's an
> uncommon word, then it wouldn't even show up in the signature.
>
> -- Ken
>
>
> >You
> >probably have to roll your own near duplication detection algorithm.
> >My advice is have a look at existing literature on near duplication
> >detection techniques and then implement one of them. I know Google has
> >some papers that describe a technique called minhash. I read the paper
> >and found it's very interesting. I'm not sure if you can implement the
> >algorithm because they have patented it. That said, there are plenty
> >literature on near dup detection so you should be able to get one for
> >free!
> >
> >On Nov 21, 2007 6:57 PM, Rishabh Joshi <[EMAIL PROTECTED]> wrote:
> >>  Otis,
> >>
> >>  Thanks for your response.
> >>
> >  > I just gave a quick look to the Nutch Forum and find that there is an
> >>  implementation to obtain de-duplicate documents/pages but none for Near
> >>  Duplicates documents. Can you guide me a little further as to where 
> >> exactly
> >  > under Nutch I should be concentrating, regarding near duplicate 
> > documents?
> >  >
> >  > Regards,
> >>  Rishabh
> >>
> >>  On Nov 21, 2007 12:41 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
> >>  wrote:
> >>
> >>
> >>  > To whomever started this thread: look at Nutch.  I believe something
> >>  > related to this already exists in Nutch for near-duplicate detection.
> >>  >
> >>  > Otis
> >>  > --
> >>  > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >>  >
> >>  > - Original Message 
> >>  > From: Mike Klaas <[EMAIL PROTECTED]>
> >>  > To: solr-user@lucene.apache.org
> >>  > Sent: Sunday, November 18, 2007 11:08:38 PM
> >>  > Subject: Re: Near Duplicate Documents
> >>  >
> >>  > On 18-Nov-07, at 8:17 AM, Eswar K wrote:
> >>  >
> >>  > > Is there any idea implementing that feature in the up coming
> >>  >  releases?
> >>  >
> >  > > Not currently.  Feel free to contribute something if you find a good
> >>  > solution .
> >  > >
> >>  > -Mike
> >>  >
> >>  >
> >>  > > On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote:
> >>  > >
> >>  > >> On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote:
> >>  > >>> We have a scenario, where we want to find out documents which are
> >>  > >> similar in
> >>  > >>> content. To elaborate a little more on what we mean here, lets
> >>  > >>> take an
> >>  > >>> example.
> >>  > >>>
> >>  > >>> The example of this email chain in which we are interacting on,
> >>  > >>> can be
> >>  > >> best
> >>  > >>> used for illustrating the concept of near dupes (We are not getting
> >>  > >> confused
> >>  > >>> with threads, they are two different things.). Each email in this
> >>  > >>> thread
> >>  > >> is
> >>  > >>> treated as a document by the system. A reply to the original mail
> >>  > >>> also
> >>  > >>> includes the original mail in which case it becomes a near
> >>  > >>> duplicate of
> >>  > >> the
> >>  > >>> orginal mail (depending on the percentage of similarity).
> >>  > >>> Similarly it
> >>  > >> goes
> >>  > >>> on. The near dupes need not be limited to emails.
> >>  > >>
> >>  > >> I think this is what's known as "shingling."  See
> >>  > >> http://en.wikipedia.org/wiki/W-shingling
> >>  > >> Lucene (and therefore Solr) does not implement shingling.  The
> >>  > >> "MoreLikeThis" query might be close enough, however.
> >>  > >>
> >  > > >> -Stuart
>
> --
> Ken Krugler
> Krugle, Inc.
> +1 530-210-6378
> "If you c

Re: Near Duplicate Documents

Re: Help with Debian solr/jetty install?

Re: Near Duplicate Documents

Re: Help with Debian solr/jetty install?

Re: Any tips for indexing large amounts of data?

Re: Solr cluster topology.

Re: Near Duplicate Documents

Document update based on ID

Re: Document update based on ID

Memory use with sorting problem

Re: Memory use with sorting problem

Re: Any tips for indexing large amounts of data?

Re: Weird memory error.

Re: Near Duplicate Documents

Re: Finding the right place to start ...

Performance problems for OR-queries

Re: Performance problems for OR-queries

Re: Help with Debian solr/jetty install?

Re: Help with Debian solr/jetty install?

Re: Help with Debian solr/jetty install?

Re: Near Duplicate Documents

21 matches

Site Navigation

Mail list logo

Footer information