Re: Near Duplicate Documents

2007-11-21 Thread climbingrose
Hi Ken, It's correct that uncommon words are most likely not showing up in the signature. However, I was trying to say that if two documents has 99% common tokens and differ in one token with frequency > quantised frequency, the two resulted hashes are completely different. If you want true near d

Re: Help with Debian solr/jetty install?

2007-11-21 Thread Phillip Farber
Chris Hostetter wrote: : After following Otis' and Thorsten's advice, I still get: : : HTTP ERROR: 500 No Java compiler available Just so i'm clear, you: 1) downloaded solr, tried out the tutorial, and had the url http://localhost:8983/solr/admin/ work when you ran: > cd $

Re: Help with Debian solr/jetty install?

2007-11-21 Thread Chris Hostetter
: After following Otis' and Thorsten's advice, I still get: : : HTTP ERROR: 500 No Java compiler available Just so i'm clear, you: 1) downloaded solr, tried out the tutorial, and had the url http://localhost:8983/solr/admin/ work when you ran: > cd $DIR_CONTAINING_SOLR/example

Re: Help with Debian solr/jetty install?

2007-11-21 Thread Phillip Farber
After following Otis' and Thorsten's advice, I still get: HTTP ERROR: 500 No Java compiler available running http://localhost:8280/solr/admin out of the Debian solr-jetty package. I have *both* the sun 5 and 6 JDK and JRE installed and both have javac /usr/lib/jvm/java-1.5.0-sun/bin/javac /u

Re: Performance problems for OR-queries

2007-11-21 Thread Yonik Seeley
On Nov 21, 2007 3:09 PM, Jörg Kiegeland <[EMAIL PROTECTED]> wrote: > I have N keywords and execute a query of the form > > keyword1 OR keyword2 OR .. OR keywordN [...] > This seems to take linear time to the size of all possible matched > documents. Yes. > 1. Does Solr support this kind of index

Performance problems for OR-queries

2007-11-21 Thread Jörg Kiegeland
I have N keywords and execute a query of the form keyword1 OR keyword2 OR .. OR keywordN The search result would be very large (some million), so I defined a result limit of 100. However Solr seems now to calculates for every possible result document the number of matched keywords and to order

Re: Finding the right place to start ...

2007-11-21 Thread Mike Klaas
On 20-Nov-07, at 8:51 PM, Tracy Flynn wrote: I'm trying to find the right place to start in this community. I recently posted a question in the thread on SOLR-236. In that posting I mentioned that I was hoping to persuade my management to move from a FAST installation to a SOLR-based one.

Re: Near Duplicate Documents

2007-11-21 Thread Mike Klaas
On 21-Nov-07, at 12:29 AM, climbingrose wrote: The problem with this approach is MD5 hash is very sensitive: one letter difference will generate completely different hash. You probably have to roll your own near duplication detection algorithm. My advice is have a look at existing literature on

Re: Weird memory error.

2007-11-21 Thread Simon Willnauer
Actually when I look at the errormessage, this has nothing to do with memory. The error message: java.lang.OutOfMemoryError: unable to create new native thread means that the OS can not create any new native threads for this JVM. So the limit you are running into is not the JVM Memory. I guess you

Re: Any tips for indexing large amounts of data?

2007-11-21 Thread Brendan Grainger
Hi Otis, Thanks for this. Are you using a flavor of linux and is it 64bit? How much heap are you giving your jvm? Thanks again Brendan On Nov 21, 2007, at 2:03 AM, Otis Gospodnetic wrote: Mike is right about the occasional slow-down, which appears as a pause and is due to large Lucene ind

Re: Memory use with sorting problem

2007-11-21 Thread Yonik Seeley
On Nov 21, 2007 11:06 AM, Chris Laux <[EMAIL PROTECTED]> wrote: > Now when I reduce the size of caches (to a fraction of the default > settings) and number of warming Searchers (to 2), Set the max warming searchers to 1 to ensure that you never have more than one warming at the same time. > memo

Memory use with sorting problem

2007-11-21 Thread Chris Laux
Hi all, I've been struggling with this problem for over a month now, and although memory issues have been discussed often, I don't seem to be able to find a fitting solution. The index is merely 1.5 GB large, but memory use quickly fills out the heap max of 1 GB on a 2 GB machine. This then works

Re: Document update based on ID

2007-11-21 Thread Ryan McKinley
Evgeniy Strokin wrote: Hello,.. I have a document indexed with Solr. Originally it had only few fields. I want to add some more fields to the index later, based on ID but I don't want to submit original fields again. I use Solr 1.2, but I think there is no such functionality yet. But I saw a feat

Document update based on ID

2007-11-21 Thread Evgeniy Strokin
Hello,.. I have a document indexed with Solr. Originally it had only few fields. I want to add some more fields to the index later, based on ID but I don't want to submit original fields again. I use Solr 1.2, but I think there is no such functionality yet. But I saw a feature here https://iss

Re: Near Duplicate Documents

2007-11-21 Thread Ken Krugler
The duplication detection mechanism in Nutch is quite primitive. I think it uses a MD5 signature generated from the content of a field. The generation algorithm is described here: http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/crawl/TextProfileSignature.html. The problem with this

Re: Solr cluster topology.

2007-11-21 Thread Alexander Wallace
Thanks a lot for your responses! They were all very helpful! On Nov 20, 2007, at 5:52 PM, Norberto Meijome wrote: On Tue, 20 Nov 2007 16:26:27 -0600 Alexander Wallace <[EMAIL PROTECTED]> wrote: Interesting, this ALL MASTERS mode... I guess you don't do any replication then... correct In t

Re: Any tips for indexing large amounts of data?

2007-11-21 Thread Brendan Grainger
HI Otis, Thanks for the reply. I am using a pretty "vanilla approach" right now and it's taking about 30 hours to build an index of about 5.5Gb. Can you please tell me what some of the changes you made to optimize the indexing process? Thanks Brendan On Nov 21, 2007, at 2:27 AM, Otis Gos

Re: Help with Debian solr/jetty install?

2007-11-21 Thread Thorsten Scherler
On Tue, 2007-11-20 at 22:50 -0800, Otis Gospodnetic wrote: > Phillip, > > I won't go into details, but I'll point out that the Java compiler is called > javac and if memory serves me well, it is defined in one of Jetty's XML > config files in its etc/ dir. The java compiler is used to compile J

Re: Near Duplicate Documents

2007-11-21 Thread Rishabh Joshi
Thanks for the info Cuong! Regards, Rishabh On Nov 21, 2007 1:59 PM, climbingrose <[EMAIL PROTECTED]> wrote: > The duplication detection mechanism in Nutch is quite primitive. I > think it uses a MD5 signature generated from the content of a field. > The generation algorithm is described here: >

Re: Help with Debian solr/jetty install?

2007-11-21 Thread climbingrose
Make sure you have JDK installed not just JRE. Also try to set JAVA_HOME directory. apt-get install sun-java5-jdk On Nov 21, 2007 5:50 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > Phillip, > > I won't go into details, but I'll point out that the Java compiler is called > javac and if mem

Re: Near Duplicate Documents

2007-11-21 Thread climbingrose
The duplication detection mechanism in Nutch is quite primitive. I think it uses a MD5 signature generated from the content of a field. The generation algorithm is described here: http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/crawl/TextProfileSignature.html. The problem with this a